You’ve shipped your agent. It works in dev. It works on staging. Then OpenAI has an incident. Or your search API rate-limits. Or your tool call times out at 1 in 100. Production is now a series of small fires.
The three primitives that put out those fires: circuit breakers, retry policies, dead-letter queues. They’re old patterns (Hystrix popularized circuit breakers in 2013), but most AI agent frameworks don’t ship them. Agentmatic does, in the open-source core. This is what they do and how to use them well.
The failure modes
Three things that go wrong in production agents:
- Upstream provider degrades. OpenAI is 4× slower than usual. Half your requests return 503. You’re getting paged.
- Tool flakes. Your search API rate-limits at 100 req/min; you have 200 concurrent agent runs.
- A long-running graph fails on step 28 of 30. Re-running from scratch is expensive and slow.
Three primitives for three failures.
Circuit breakers
The pattern: after N consecutive failures, open the circuit. While open, fail fast with a known error. After a cooldown, half-open: send one probe call. If it succeeds, close the circuit. If it fails, re-open and back off.
from agentmatic.resilience import CircuitBreaker
agent = (Agent.builder("ai")
.llm(OpenAI())
.circuit_breaker("openai", failure_threshold=5, cooldown_seconds=30)
.circuit_breaker("search", failure_threshold=3, cooldown_seconds=10)
.build())
Per-service breakers, not per-graph. Your OpenAI errors shouldn’t trip your search circuit; your search rate-limit shouldn’t trip OpenAI.
Mistakes
- Single breaker for everything. One slow tool trips the entire agent.
- Too aggressive a threshold. 1 failure = open is too sensitive; flickery upstream networks will keep your circuit oscillating.
- Too long a cooldown. 5 minutes for a transient 503 means 5 minutes of all-fail.
- No half-open probe. You need a way to test recovery without sending full traffic.
Sensible defaults
- Failure threshold: 5 consecutive failures.
- Cooldown: 30 s for LLM providers, 10 s for fast services.
- Half-open: 1 probe, 60 s evaluation window.
- Failure definition: HTTP 5xx, timeouts, connection errors. Not 4xx (those are deterministic; retry won’t fix them).
Retry with backoff
The pattern: when a transient failure happens, wait, retry. Wait longer each time (exponential). Add randomness so you don’t synchronize a retry storm (jitter).
from agentmatic.resilience import RetryPolicy
policy = RetryPolicy.exponential(
max_attempts=3,
initial_delay=1.0,
max_delay=30.0,
jitter=True,
)
agent = Agent.builder("ai").retry_policy(policy).build()
Per-tool overrides:
@tool(retry=RetryPolicy.fixed(max_attempts=5, delay=2.0))
def idempotent_lookup(id: str) -> dict: ...
@tool(retry=RetryPolicy.none())
def transfer_funds(amount: float) -> str: ... # don't retry money
Mistakes
- Retrying non-idempotent operations. Refunding twice is worse than refunding once and failing.
- No jitter. 1000 concurrent failures all retry at exactly T+1s. You DOS the upstream.
- Linear backoff for LLM providers. OpenAI’s rate-limit recovery is slow; you want exponential.
- Retry on 4xx. A
400 Bad Requestwon’t succeed on retry. Stop.
Sensible defaults
- LLM calls: 3 attempts, exponential, 1 s → 30 s, jitter on.
- Idempotent tools: 5 attempts, exponential, 0.5 s → 10 s, jitter on.
- Non-idempotent tools: no retry (or 1 attempt for “retry-safe-by-design” tools you’re sure about).
- Always classify the error first: 5xx and timeouts retry; 4xx fail; 429 retries with exponential backoff respecting the Retry-After header.
Dead-letter queues
The pattern: when a run exhausts retries, persist it. State, context, error trace, last checkpoint. Replay later when the upstream is healthy.
from agentmatic.resilience import DeadLetterQueue
dlq = DeadLetterQueue.postgres(POSTGRES_URL)
agent = Agent.builder("ai").dead_letter_queue(dlq).build()
# Inspect failed runs.
for run in await dlq.list(limit=10):
print(run.run_id, run.last_error, run.checkpoint_id)
# Replay when ready.
await agent.replay(run.checkpoint_id)
The DLQ stores enough state to resume the agent from where it failed — not from scratch. This is huge for expensive multi-step workflows.
Mistakes
- No DLQ, just logs. “We’ll figure out failures later” → users don’t get answers, you have no automated way to recover.
- DLQ for every error. Deterministic failures (bad input) clog the DLQ. They should fail to your error log, not enqueue.
- No retention policy. A DLQ that never drains becomes a swamp.
- No replay tooling. A DLQ you can’t replay from is just a fancy log.
Sensible defaults
- DLQ storage: same backend as your checkpointer (Postgres usually). Reuse the connection pool.
- Retention: 30 days for replayable, 7 days for known-bad.
- Replay tooling: a CLI or admin endpoint.
agentmatic-cli dlq replay —run-id abc. - Alerting: page on > N DLQ entries per minute (indicates upstream is down, not just a flake).
Putting it together
A production-grade agent uses all three:
agent = (Agent.builder("production")
.llm(OpenAI("gpt-4o"))
.tools([search, calculator, send_email])
.checkpoint(PostgresSaver.from_env())
.circuit_breaker("openai", failure_threshold=5, cooldown_seconds=30)
.circuit_breaker("search", failure_threshold=3, cooldown_seconds=10)
.retry_policy(RetryPolicy.exponential(max_attempts=3, jitter=True))
.dead_letter_queue(DeadLetterQueue.postgres(POSTGRES_URL))
.interrupt_before(["send_email"]) # human approval for outbound
.build())
That’s a ~10-line config. It survives:
- OpenAI 4-hour incident (breaker opens; runs DLQ; replay when healthy).
- Search rate-limit (per-tool breaker + retry; circuit independent of LLM).
- Tool flake (retry with backoff handles transients).
- Worker crash (checkpointer + DLQ preserve state).
- Bad LLM decision (HITL interrupt before risky tool).
Why this is in the open-source core
Most frameworks bundle these into a paid tier. Our position: production primitives belong in the open-source core. Teams shouldn’t have to choose between writing it themselves and paying for a SaaS to get the basics right.
We ship the primitives. You run them. Free, forever.