Circuit Breaker Pattern: Building Resilient Microservices That Don't Cascade Failures

Last month, our payment service took down the entire platform for 47 minutes. Not because it failed - that would have been manageable. It failed slowly. Each request took 30 seconds to timeout, creating a traffic jam that backed up through 12 other services. Classic cascading failure. Here's how we fixed it with the Circuit Breaker pattern, and what I learned about resilience after debugging distributed systems at 3 AM too many times.

The Problem: When Slow is Worse Than Dead#

Picture this: Your payment provider's API starts responding slowly. Not down, just taking 20-30 seconds per request instead of the usual 200ms. Your service dutifully waits. Meanwhile, incoming requests pile up. Thread pools exhaust. Memory consumption spikes. Eventually, your healthy service becomes unhealthy, and the infection spreads upstream.

I've seen this pattern kill entire platforms. The worst part? Your monitoring shows all services are "up" - they're just not responding.

Circuit Breaker: Your System's Safety Valve#

The Circuit Breaker pattern acts like an electrical circuit breaker in your house. When things go wrong, it trips, preventing damage from spreading. But unlike your home's breaker, this one is smart - it can test if the problem is fixed and automatically recover.

The Three States#

TypeScript

enum CircuitState {
  CLOSED = 'CLOSED',     // Normal operation, requests flow through
  OPEN = 'OPEN',         // Circuit tripped, requests fail immediately
  HALF_OPEN = 'HALF_OPEN' // Testing if service recovered
}

Think of it like a bouncer at a club:

CLOSED: "Come on in, everything's fine"
OPEN: "Nobody gets in, there's a problem inside"
HALF_OPEN: "Let me check with one person if it's safe now"

Real Implementation: What Actually Works#

Here's the circuit breaker we built after our incident. It's battle-tested across 40+ services handling 2M requests/day:

TypeScript

interface CircuitBreakerConfig {
  failureThreshold: number;      // Failures before opening
  successThreshold: number;       // Successes to close from half-open
  timeout: number;               // Request timeout in ms
  resetTimeout: number;          // Time before trying half-open
  volumeThreshold: number;       // Min requests before evaluating
  errorThresholdPercentage: number; // Error % to trip
}

class CircuitBreaker<T> {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime?: Date;
  private requestCount = 0;
  private errorCount = 0;
  private window = new RollingWindow(10000); // 10 second window

  constructor(
    private readonly config: CircuitBreakerConfig,
    private readonly protectedFunction: () => Promise<T>
  ) {}

  async execute(): Promise<T> {
    // Check if we should attempt half-open
    if (this.state === CircuitState.OPEN) {
      if (this.shouldAttemptReset()) {
        this.state = CircuitState.HALF_OPEN;
      } else {
        throw new CircuitOpenError('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await this.executeWithTimeout();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private async executeWithTimeout(): Promise<T> {
    return Promise.race([
      this.protectedFunction(),
      new Promise<T>((_, reject) =>
        setTimeout(() => reject(new TimeoutError()), this.config.timeout)
      )
    ]);
  }

  private onSuccess(): void {
    this.failureCount = 0;
    this.window.recordSuccess();

    if (this.state === CircuitState.HALF_OPEN) {
      this.successCount++;
      if (this.successCount >= this.config.successThreshold) {
        this.state = CircuitState.CLOSED;
        this.successCount = 0;
      }
    }
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = new Date();
    this.window.recordFailure();

    if (this.state === CircuitState.HALF_OPEN) {
      this.state = CircuitState.OPEN;
      this.successCount = 0;
      return;
    }

    // Check both absolute and percentage thresholds
    const stats = this.window.getStats();
    if (stats.totalRequests >= this.config.volumeThreshold) {
      const errorRate = (stats.failures / stats.totalRequests) * 100;
      if (errorRate >= this.config.errorThresholdPercentage ||
          this.failureCount >= this.config.failureThreshold) {
        this.state = CircuitState.OPEN;
      }
    }
  }

  private shouldAttemptReset(): boolean {
    return this.lastFailureTime &&
      Date.now() - this.lastFailureTime.getTime() >= this.config.resetTimeout;
  }
}

Lessons from Production: What the Tutorials Don't Tell You#

1. Timeout is Your Most Important Setting#

After analyzing 6 months of incidents, 73% were caused by slow responses, not complete failures. Set your timeout aggressively:

TypeScript

const config = {
  timeout: 3000,  // 3 seconds - our P99 is 1.2s, so this catches problems
  // NOT 30000!   // This killed us. Waiting 30s = thread exhaustion
};

Real numbers from our payment service:

Normal P50: 180ms
Normal P99: 1.2s
Circuit breaker timeout: 3s
Result: 94% reduction in cascading failures

2. The Half-Open State Gotcha#

Early on, we'd trip to half-open, send one request, succeed, close the circuit, then immediately fail again with full traffic. The fix: require multiple successes before closing.

TypeScript

// Don't do this
if (testRequest.succeeded) {
  this.state = CircuitState.CLOSED; // Boom! Full traffic returns
}

// Do this instead
if (++this.successCount >= this.config.successThreshold) {
  this.state = CircuitState.CLOSED; // Gradual recovery
}

3. Combine with Retry Logic (But Carefully)#

Circuit breakers and retries can create feedback loops. Here's our battle-tested combination:

TypeScript

class ResilientClient {
  private circuitBreaker: CircuitBreaker<any>;

  async callWithResilience(request: Request): Promise<Response> {
    // Circuit breaker wraps retry logic, not vice versa
    return this.circuitBreaker.execute(async () => {
      return await this.retryWithBackoff(request, {
        maxAttempts: 3,
        backoffMs: [100, 200, 400],
        shouldRetry: (error) => {
          // Don't retry circuit breaker errors
          if (error instanceof CircuitOpenError) return false;
          // Don't retry client errors
          if (error.statusCode >= 400 && error.statusCode &lt;500) return false;
          return true;
        }
      });
    });
  }
}

4. Monitor the Right Metrics#

What to track (in order of importance):

Circuit state changes - Alert immediately on OPEN
Reset attempt results - Failed resets = ongoing problem
Request rejection rate - Business impact metric
Time in OPEN state - Helps tune reset timeout

Our CloudWatch dashboard:

TypeScript

// Custom metrics we push
await cloudwatch.putMetricData({
  Namespace: 'CircuitBreakers',
  MetricData: [
    {
      MetricName: 'StateChange',
      Value: 1,
      Unit: 'Count',
      Dimensions: [
        { Name: 'ServiceName', Value: this.serviceName },
        { Name: 'FromState', Value: oldState },
        { Name: 'ToState', Value: newState }
      ]
    },
    {
      MetricName: 'RejectedRequests',
      Value: rejectedCount,
      Unit: 'Count',
      Dimensions: [{ Name: 'ServiceName', Value: this.serviceName }]
    }
  ]
});

Advanced Patterns: Beyond Basic Circuit Breaking#

Bulkheading: Isolated Circuit Breakers#

Don't use one circuit breaker for an entire service. Isolate critical paths:

TypeScript

class PaymentService {
  private readonly chargeBreaker = new CircuitBreaker(chargeConfig);
  private readonly refundBreaker = new CircuitBreaker(refundConfig);
  private readonly queryBreaker = new CircuitBreaker(queryConfig);

  async chargeCard(request: ChargeRequest): Promise<ChargeResponse> {
    // Charging failures don't affect refunds
    return this.chargeBreaker.execute(() => this.api.charge(request));
  }

  async refundPayment(request: RefundRequest): Promise<RefundResponse> {
    // Refunds stay available even if charges are failing
    return this.refundBreaker.execute(() => this.api.refund(request));
  }
}

This saved us during Black Friday when our charge endpoint was overwhelmed but refunds (customer service critical) kept working.

Fallback Strategies#

Not all failures are equal. Sometimes you can degrade gracefully:

TypeScript

async getProductRecommendations(userId: string): Promise<Product[]> {
  try {
    return await this.recommendationBreaker.execute(
      () => this.mlService.getRecommendations(userId)
    );
  } catch (error) {
    if (error instanceof CircuitOpenError) {
      // Fallback to simple popularity-based recommendations
      return this.getPopularProducts();
    }
    throw error;
  }
}

Circuit Breaker Inheritance#

For microservices calling other microservices, inherit circuit state:

TypeScript

// API Gateway
if (paymentServiceBreaker.state === CircuitState.OPEN) {
  // Don't even try to call order service which depends on payment
  return { error: 'Payment service unavailable', status: 503 };
}

Real-World Configuration Examples#

Here's what actually works in production for different service types:

TypeScript

// External API (payment providers, third-party services)
const externalAPIConfig: CircuitBreakerConfig = {
  failureThreshold: 5,           // 5 consecutive failures
  successThreshold: 2,           // 2 successes to recover
  timeout: 5000,                // 5 second timeout
  resetTimeout: 30000,          // Try recovery after 30s
  volumeThreshold: 10,          // Need 10 requests minimum
  errorThresholdPercentage: 50  // 50% error rate trips
};

// Internal microservice
const internalServiceConfig: CircuitBreakerConfig = {
  failureThreshold: 10,          // More tolerant
  successThreshold: 3,
  timeout: 3000,                // Faster timeout
  resetTimeout: 10000,          // Faster recovery attempts
  volumeThreshold: 20,
  errorThresholdPercentage: 30  // More sensitive to error rates
};

// Database connections
const databaseConfig: CircuitBreakerConfig = {
  failureThreshold: 3,           // Quick to trip
  successThreshold: 5,           // Slow to recover
  timeout: 1000,                // Very fast timeout
  resetTimeout: 5000,           // Quick retry
  volumeThreshold: 5,
  errorThresholdPercentage: 20  // Very sensitive
};

Testing Circuit Breakers: Chaos Engineering#

You can't trust a circuit breaker you haven't tested. Here's our chaos testing approach:

TypeScript

describe('Circuit Breaker Chaos Tests', () => {
  it('should handle gradual degradation', async () => {
    const scenarios = [
      { latency: 100, errorRate: 0 },    // Normal
      { latency: 500, errorRate: 0.1 },  // Slight degradation
      { latency: 2000, errorRate: 0.3 }, // Major degradation
      { latency: 5000, errorRate: 0.7 }, // Near failure
    ];

    for (const scenario of scenarios) {
      mockService.setScenario(scenario);
      await runLoadTest(1000); // 1000 requests

      const metrics = await breaker.getMetrics();
      if (scenario.errorRate > 0.5) {
        expect(breaker.state).toBe(CircuitState.OPEN);
      }
    }
  });
});

In production, we use AWS Fault Injection Simulator to randomly inject failures and verify our circuit breakers respond correctly.

The Mistakes That Cost Us#

Mistake 1: Client-Side Only Circuit Breaking#

We initially implemented circuit breakers only in clients. When the server itself had issues, it couldn't protect itself:

TypeScript

// Bad: Client protects itself but server still overwhelmed
class Client {
  private breaker = new CircuitBreaker();
  async call() { return this.breaker.execute(() => fetch('/api')); }
}

// Good: Server also protects itself
class Server {
  private downstreamBreaker = new CircuitBreaker();
  async handleRequest(req, res) {
    try {
      const data = await this.downstreamBreaker.execute(() =>
        this.database.query(req.query)
      );
      res.json(data);
    } catch (error) {
      if (error instanceof CircuitOpenError) {
        res.status(503).json({ error: 'Service temporarily unavailable' });
      }
    }
  }
}

Mistake 2: Sharing Circuit Breakers Across Unrelated Operations#

We had one circuit breaker for "database operations". When writes failed, reads were also blocked:

TypeScript

// Bad: One breaker for everything
class UserService {
  private dbBreaker = new CircuitBreaker();

  async getUser(id) {
    return this.dbBreaker.execute(() => db.query('SELECT...'));
  }

  async createUser(data) {
    return this.dbBreaker.execute(() => db.query('INSERT...'));
  }
}

// Good: Separate breakers for different operations
class UserService {
  private readBreaker = new CircuitBreaker(readConfig);
  private writeBreaker = new CircuitBreaker(writeConfig);

  async getUser(id) {
    return this.readBreaker.execute(() => db.query('SELECT...'));
  }

  async createUser(data) {
    return this.writeBreaker.execute(() => db.query('INSERT...'));
  }
}

Mistake 3: Not Considering Business Impact#

We treated all services equally. Then we blocked payment processing while letting metrics collection through. Learned that lesson quickly.

The Implementation Checklist#

After implementing circuit breakers in 40+ services, here's my checklist:

Set timeout to 2-3x your P99 latency
Require multiple successes before closing from half-open
Implement separate breakers for read/write operations
Add fallback behavior for business-critical paths
Export metrics for state changes and rejections
Test with chaos engineering before production
Document timeout and threshold choices
Alert on circuit OPEN, not on individual failures
Consider business priority in configuration
Implement gradual recovery, not instant

Final Thoughts: It's About Failing Fast#

The hardest lesson? Sometimes the best thing your service can do is fail immediately. That 503 response in 10ms is infinitely better than a timeout after 30 seconds. Your users can retry. Your system can recover. But thread exhaustion? That's a one-way ticket to a 3 AM wake-up call.

Circuit breakers aren't about preventing failures - they're about preventing failures from spreading. They're about maintaining enough system health that when the problem is fixed, you can actually recover.

Implement them before you need them. Trust me on this one.

Circuit Breaker Pattern: Building Resilient Microservices That Don't Cascade Failures

The Problem: When Slow is Worse Than Dead#

Circuit Breaker: Your System's Safety Valve#

The Three States#

Real Implementation: What Actually Works#

Lessons from Production: What the Tutorials Don't Tell You#

1. Timeout is Your Most Important Setting#

2. The Half-Open State Gotcha#

3. Combine with Retry Logic (But Carefully)#

4. Monitor the Right Metrics#

Advanced Patterns: Beyond Basic Circuit Breaking#

Bulkheading: Isolated Circuit Breakers#

Fallback Strategies#

Circuit Breaker Inheritance#

Real-World Configuration Examples#

Testing Circuit Breakers: Chaos Engineering#

The Mistakes That Cost Us#

Mistake 1: Client-Side Only Circuit Breaking#

Mistake 2: Sharing Circuit Breakers Across Unrelated Operations#

Mistake 3: Not Considering Business Impact#

The Implementation Checklist#

Final Thoughts: It's About Failing Fast#

Comments (0)

Join the conversation

No comments yet

Comments (0)

Join the conversation

No comments yet

Related Posts