Building a Production-Grade Resilience Layer

I've spent two decades watching systems fail. Networks partition. Databases timeout. APIs return 500s. Disks fill up. Memory leaks compound. Certificates expire.

Everything fails, all the time.

The question isn't whether your system will fail. It's whether you've designed it to handle failure gracefully.

This is the story of how we rebuilt our resilience architecture from scattered band-aids into a proper engineering foundation.

The Problem We Inherited

Before the transformation, every Finance agent - Payroll, Treasury, Audit, Statutory Payments, and four others - implemented its own retry logic.

Eight different retry patterns. Eight different max-attempt configurations. Eight different delay strategies. 150+ lines of duplicate code scattered across the codebase.

The consequences were predictable:

Inconsistent behavior: Payroll retried 3 times with 2-second delays. Treasury retried 5 times with exponential backoff. Same failure, different response.
Cascade failures: When OpenAI went down, all eight agents hammered it simultaneously. No circuit breaking. No backpressure. Just a thundering herd making things worse.
Zero observability: How many retries happened last week? What's our success rate? Which agents fail most often? Nobody knew.
Maintenance nightmare: Change the retry strategy? Edit eight files. Miss one? Inconsistent behavior. Find out in production.

The Solution: Centralized Resilience

We built three core components:

1. RetryPolicy

A unified retry system with three configurable strategies:

Exponential backoff: For AI operations and external APIs. Start at 2 seconds, double each retry, cap at 60 seconds. Add jitter to prevent thundering herd.
Fixed delay: For database operations and internal services. Consistent 1-second wait. Predictable behavior.
Immediate retry: For transient network blips. Retry once immediately, then fall back to exponential.

All strategies share the same interface. Swap one for another without changing calling code.

2. CircuitBreaker

Classic three-state machine:

CLOSED: Normal operation. Track failures.
OPEN: Too many failures. Fast-fail all requests. Don't waste resources on a broken dependency.
HALF_OPEN: Timeout expired. Try one request. If it succeeds, close the circuit. If it fails, reopen.

Configuration:

5 consecutive failures opens the circuit
60-second timeout before half-open
2 consecutive successes in half-open closes the circuit

These numbers came from production observation, not theory. They work for our traffic patterns. Yours may differ.

3. Observability Dashboard

Real-time metrics:

Retry counts by agent and error type
Circuit breaker state transitions
Success rates over time
AI confidence trends
Error patterns and correlations

You can't improve what you can't measure.

The Implementation

All eight Finance agents now call the centralized resilience layer. Their code focuses purely on business logic:

Payroll calculates deductions
Treasury manages currencies
Audit classifies findings

Retry mechanics, circuit breaking, observability? Handled automatically by infrastructure.

When OpenAI times out during invoice extraction:

First retry after 2 seconds
Second retry after 4 seconds
If still failing, circuit breaker opens
Subsequent requests fast-fail for 60 seconds
Half-open test after timeout
If recovered, normal operation resumes

All of this is automatic. The agent code doesn't know or care.

The Results

150+ lines of duplicate code eliminated
One centralized retry system
90%+ auto-correction rate across all Finance agents
Real-time error analytics tracking patterns and trends
Consistent behavior across all agents

More importantly: when we need to adjust retry parameters, we edit one file. When we need to tune the circuit breaker, one configuration change applies everywhere.

What We Learned

1. Resilience is infrastructure, not application code.

Business logic shouldn't know about retries. Separate concerns. Build proper abstractions.

2. Observability isn't optional.

A resilience layer without metrics is just complexity. You need to see what's happening to improve it.

3. Start with production requirements.

We didn't build this in a vacuum. We built it after watching real failures in real systems. The configuration values come from observation.

4. Jitter matters.

When a service recovers, you don't want all clients retrying simultaneously. Add randomness to your backoff. Spread the load.

5. Fast-fail is a feature.

When a dependency is down, the kindest thing you can do is fail immediately. Don't waste time retrying something that won't work. Don't pile load on a system trying to recover.

Everything fails, all the time.

The difference between a fragile system and a robust one isn't the absence of failure. It's the presence of proper failure handling.

We built that handling once. Now it works everywhere. That's what infrastructure should do.

Want to learn more about building resilient systems? Let's talk.

Everything Fails: Building a Production-Grade Resilience Layer