Circuit Breaker Pattern: Building Resilient Microservices That Don't Cascade Failures
Real-world implementation of the Circuit Breaker pattern with battle-tested strategies for preventing cascading failures in distributed systems
Last month, our payment service took down the entire platform for 47 minutes. Not because it failed - that would have been manageable. It failed slowly. Each request took 30 seconds to timeout, creating a traffic jam that backed up through 12 other services. Classic cascading failure. Here's how we fixed it with the Circuit Breaker pattern, and what I learned about resilience after debugging distributed systems at 3 AM too many times.
The Problem: When Slow is Worse Than Dead#
Picture this: Your payment provider's API starts responding slowly. Not down, just taking 20-30 seconds per request instead of the usual 200ms. Your service dutifully waits. Meanwhile, incoming requests pile up. Thread pools exhaust. Memory consumption spikes. Eventually, your healthy service becomes unhealthy, and the infection spreads upstream.
I've seen this pattern kill entire platforms. The worst part? Your monitoring shows all services are "up" - they're just not responding.
Circuit Breaker: Your System's Safety Valve#
The Circuit Breaker pattern acts like an electrical circuit breaker in your house. When things go wrong, it trips, preventing damage from spreading. But unlike your home's breaker, this one is smart - it can test if the problem is fixed and automatically recover.
The Three States#
enum CircuitState {
CLOSED = 'CLOSED', // Normal operation, requests flow through
OPEN = 'OPEN', // Circuit tripped, requests fail immediately
HALF_OPEN = 'HALF_OPEN' // Testing if service recovered
}
Think of it like a bouncer at a club:
- CLOSED: "Come on in, everything's fine"
- OPEN: "Nobody gets in, there's a problem inside"
- HALF_OPEN: "Let me check with one person if it's safe now"
Real Implementation: What Actually Works#
Here's the circuit breaker we built after our incident. It's battle-tested across 40+ services handling 2M requests/day:
interface CircuitBreakerConfig {
failureThreshold: number; // Failures before opening
successThreshold: number; // Successes to close from half-open
timeout: number; // Request timeout in ms
resetTimeout: number; // Time before trying half-open
volumeThreshold: number; // Min requests before evaluating
errorThresholdPercentage: number; // Error % to trip
}
class CircuitBreaker<T> {
private state: CircuitState = CircuitState.CLOSED;
private failureCount = 0;
private successCount = 0;
private lastFailureTime?: Date;
private requestCount = 0;
private errorCount = 0;
private window = new RollingWindow(10000); // 10 second window
constructor(
private readonly config: CircuitBreakerConfig,
private readonly protectedFunction: () => Promise<T>
) {}
async execute(): Promise<T> {
// Check if we should attempt half-open
if (this.state === CircuitState.OPEN) {
if (this.shouldAttemptReset()) {
this.state = CircuitState.HALF_OPEN;
} else {
throw new CircuitOpenError('Circuit breaker is OPEN');
}
}
try {
const result = await this.executeWithTimeout();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private async executeWithTimeout(): Promise<T> {
return Promise.race([
this.protectedFunction(),
new Promise<T>((_, reject) =>
setTimeout(() => reject(new TimeoutError()), this.config.timeout)
)
]);
}
private onSuccess(): void {
this.failureCount = 0;
this.window.recordSuccess();
if (this.state === CircuitState.HALF_OPEN) {
this.successCount++;
if (this.successCount >= this.config.successThreshold) {
this.state = CircuitState.CLOSED;
this.successCount = 0;
}
}
}
private onFailure(): void {
this.failureCount++;
this.lastFailureTime = new Date();
this.window.recordFailure();
if (this.state === CircuitState.HALF_OPEN) {
this.state = CircuitState.OPEN;
this.successCount = 0;
return;
}
// Check both absolute and percentage thresholds
const stats = this.window.getStats();
if (stats.totalRequests >= this.config.volumeThreshold) {
const errorRate = (stats.failures / stats.totalRequests) * 100;
if (errorRate >= this.config.errorThresholdPercentage ||
this.failureCount >= this.config.failureThreshold) {
this.state = CircuitState.OPEN;
}
}
}
private shouldAttemptReset(): boolean {
return this.lastFailureTime &&
Date.now() - this.lastFailureTime.getTime() >= this.config.resetTimeout;
}
}
Lessons from Production: What the Tutorials Don't Tell You#
1. Timeout is Your Most Important Setting#
After analyzing 6 months of incidents, 73% were caused by slow responses, not complete failures. Set your timeout aggressively:
const config = {
timeout: 3000, // 3 seconds - our P99 is 1.2s, so this catches problems
// NOT 30000! // This killed us. Waiting 30s = thread exhaustion
};
Real numbers from our payment service:
- Normal P50: 180ms
- Normal P99: 1.2s
- Circuit breaker timeout: 3s
- Result: 94% reduction in cascading failures
2. The Half-Open State Gotcha#
Early on, we'd trip to half-open, send one request, succeed, close the circuit, then immediately fail again with full traffic. The fix: require multiple successes before closing.
// Don't do this
if (testRequest.succeeded) {
this.state = CircuitState.CLOSED; // Boom! Full traffic returns
}
// Do this instead
if (++this.successCount >= this.config.successThreshold) {
this.state = CircuitState.CLOSED; // Gradual recovery
}
3. Combine with Retry Logic (But Carefully)#
Circuit breakers and retries can create feedback loops. Here's our battle-tested combination:
class ResilientClient {
private circuitBreaker: CircuitBreaker<any>;
async callWithResilience(request: Request): Promise<Response> {
// Circuit breaker wraps retry logic, not vice versa
return this.circuitBreaker.execute(async () => {
return await this.retryWithBackoff(request, {
maxAttempts: 3,
backoffMs: [100, 200, 400],
shouldRetry: (error) => {
// Don't retry circuit breaker errors
if (error instanceof CircuitOpenError) return false;
// Don't retry client errors
if (error.statusCode >= 400 && error.statusCode <500) return false;
return true;
}
});
});
}
}
4. Monitor the Right Metrics#
What to track (in order of importance):
- Circuit state changes - Alert immediately on OPEN
- Reset attempt results - Failed resets = ongoing problem
- Request rejection rate - Business impact metric
- Time in OPEN state - Helps tune reset timeout
Our CloudWatch dashboard:
// Custom metrics we push
await cloudwatch.putMetricData({
Namespace: 'CircuitBreakers',
MetricData: [
{
MetricName: 'StateChange',
Value: 1,
Unit: 'Count',
Dimensions: [
{ Name: 'ServiceName', Value: this.serviceName },
{ Name: 'FromState', Value: oldState },
{ Name: 'ToState', Value: newState }
]
},
{
MetricName: 'RejectedRequests',
Value: rejectedCount,
Unit: 'Count',
Dimensions: [{ Name: 'ServiceName', Value: this.serviceName }]
}
]
});
Advanced Patterns: Beyond Basic Circuit Breaking#
Bulkheading: Isolated Circuit Breakers#
Don't use one circuit breaker for an entire service. Isolate critical paths:
class PaymentService {
private readonly chargeBreaker = new CircuitBreaker(chargeConfig);
private readonly refundBreaker = new CircuitBreaker(refundConfig);
private readonly queryBreaker = new CircuitBreaker(queryConfig);
async chargeCard(request: ChargeRequest): Promise<ChargeResponse> {
// Charging failures don't affect refunds
return this.chargeBreaker.execute(() => this.api.charge(request));
}
async refundPayment(request: RefundRequest): Promise<RefundResponse> {
// Refunds stay available even if charges are failing
return this.refundBreaker.execute(() => this.api.refund(request));
}
}
This saved us during Black Friday when our charge endpoint was overwhelmed but refunds (customer service critical) kept working.
Fallback Strategies#
Not all failures are equal. Sometimes you can degrade gracefully:
async getProductRecommendations(userId: string): Promise<Product[]> {
try {
return await this.recommendationBreaker.execute(
() => this.mlService.getRecommendations(userId)
);
} catch (error) {
if (error instanceof CircuitOpenError) {
// Fallback to simple popularity-based recommendations
return this.getPopularProducts();
}
throw error;
}
}
Circuit Breaker Inheritance#
For microservices calling other microservices, inherit circuit state:
// API Gateway
if (paymentServiceBreaker.state === CircuitState.OPEN) {
// Don't even try to call order service which depends on payment
return { error: 'Payment service unavailable', status: 503 };
}
Real-World Configuration Examples#
Here's what actually works in production for different service types:
// External API (payment providers, third-party services)
const externalAPIConfig: CircuitBreakerConfig = {
failureThreshold: 5, // 5 consecutive failures
successThreshold: 2, // 2 successes to recover
timeout: 5000, // 5 second timeout
resetTimeout: 30000, // Try recovery after 30s
volumeThreshold: 10, // Need 10 requests minimum
errorThresholdPercentage: 50 // 50% error rate trips
};
// Internal microservice
const internalServiceConfig: CircuitBreakerConfig = {
failureThreshold: 10, // More tolerant
successThreshold: 3,
timeout: 3000, // Faster timeout
resetTimeout: 10000, // Faster recovery attempts
volumeThreshold: 20,
errorThresholdPercentage: 30 // More sensitive to error rates
};
// Database connections
const databaseConfig: CircuitBreakerConfig = {
failureThreshold: 3, // Quick to trip
successThreshold: 5, // Slow to recover
timeout: 1000, // Very fast timeout
resetTimeout: 5000, // Quick retry
volumeThreshold: 5,
errorThresholdPercentage: 20 // Very sensitive
};
Testing Circuit Breakers: Chaos Engineering#
You can't trust a circuit breaker you haven't tested. Here's our chaos testing approach:
describe('Circuit Breaker Chaos Tests', () => {
it('should handle gradual degradation', async () => {
const scenarios = [
{ latency: 100, errorRate: 0 }, // Normal
{ latency: 500, errorRate: 0.1 }, // Slight degradation
{ latency: 2000, errorRate: 0.3 }, // Major degradation
{ latency: 5000, errorRate: 0.7 }, // Near failure
];
for (const scenario of scenarios) {
mockService.setScenario(scenario);
await runLoadTest(1000); // 1000 requests
const metrics = await breaker.getMetrics();
if (scenario.errorRate > 0.5) {
expect(breaker.state).toBe(CircuitState.OPEN);
}
}
});
});
In production, we use AWS Fault Injection Simulator to randomly inject failures and verify our circuit breakers respond correctly.
The Mistakes That Cost Us#
Mistake 1: Client-Side Only Circuit Breaking#
We initially implemented circuit breakers only in clients. When the server itself had issues, it couldn't protect itself:
// Bad: Client protects itself but server still overwhelmed
class Client {
private breaker = new CircuitBreaker();
async call() { return this.breaker.execute(() => fetch('/api')); }
}
// Good: Server also protects itself
class Server {
private downstreamBreaker = new CircuitBreaker();
async handleRequest(req, res) {
try {
const data = await this.downstreamBreaker.execute(() =>
this.database.query(req.query)
);
res.json(data);
} catch (error) {
if (error instanceof CircuitOpenError) {
res.status(503).json({ error: 'Service temporarily unavailable' });
}
}
}
}
Mistake 2: Sharing Circuit Breakers Across Unrelated Operations#
We had one circuit breaker for "database operations". When writes failed, reads were also blocked:
// Bad: One breaker for everything
class UserService {
private dbBreaker = new CircuitBreaker();
async getUser(id) {
return this.dbBreaker.execute(() => db.query('SELECT...'));
}
async createUser(data) {
return this.dbBreaker.execute(() => db.query('INSERT...'));
}
}
// Good: Separate breakers for different operations
class UserService {
private readBreaker = new CircuitBreaker(readConfig);
private writeBreaker = new CircuitBreaker(writeConfig);
async getUser(id) {
return this.readBreaker.execute(() => db.query('SELECT...'));
}
async createUser(data) {
return this.writeBreaker.execute(() => db.query('INSERT...'));
}
}
Mistake 3: Not Considering Business Impact#
We treated all services equally. Then we blocked payment processing while letting metrics collection through. Learned that lesson quickly.
The Implementation Checklist#
After implementing circuit breakers in 40+ services, here's my checklist:
- Set timeout to 2-3x your P99 latency
- Require multiple successes before closing from half-open
- Implement separate breakers for read/write operations
- Add fallback behavior for business-critical paths
- Export metrics for state changes and rejections
- Test with chaos engineering before production
- Document timeout and threshold choices
- Alert on circuit OPEN, not on individual failures
- Consider business priority in configuration
- Implement gradual recovery, not instant
Final Thoughts: It's About Failing Fast#
The hardest lesson? Sometimes the best thing your service can do is fail immediately. That 503 response in 10ms is infinitely better than a timeout after 30 seconds. Your users can retry. Your system can recover. But thread exhaustion? That's a one-way ticket to a 3 AM wake-up call.
Circuit breakers aren't about preventing failures - they're about preventing failures from spreading. They're about maintaining enough system health that when the problem is fixed, you can actually recover.
Implement them before you need them. Trust me on this one.
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!
Comments (0)
Join the conversation
Sign in to share your thoughts and engage with the community
No comments yet
Be the first to share your thoughts on this post!