Do I really need a circuit breaker for an LLM call?

Yes if you're at any nontrivial scale. OpenAI and Anthropic both have multi-hour incidents quarterly. Without a breaker your retry storm makes their incident worse and burns your budget.

What's a sensible retry policy?

Exponential backoff with jitter, capped at 3 attempts for LLM calls, capped at 5 for idempotent tools, no retries for non-idempotent tools (transfers, writes). Plus a per-service circuit breaker on top.

When should a failed run go to a DLQ?

When all retries are exhausted and the failure is transient (5xx, timeout). Don't DLQ deterministic failures (invalid input, bad tool args) — those go to your error log.

How does Agentmatic do this without configuration overhead?

Builder methods: .circuit_breaker(name, ...), .retry_policy(...), .dead_letter_queue(...). The runtime handles the state machine for breakers, the backoff for retries, and the persistence + replay for DLQs.

Deep dive

Production AI agents need circuit breakers, retry, and dead-letter queues

The three resilience primitives every production agent eventually needs — and the mistakes engineers make when they bolt them on themselves. Plus what 'built-in' looks like in Agentmatic.

Dipankar Sarkar May 25, 2026 9 min read

productionresiliencecircuit-breakersretrydlq

You’ve shipped your agent. It works in dev. It works on staging. Then OpenAI has an incident. Or your search API rate-limits. Or your tool call times out at 1 in 100. Production is now a series of small fires.

The three primitives that put out those fires: circuit breakers, retry policies, dead-letter queues. They’re old patterns (Hystrix popularized circuit breakers in 2013), but most AI agent frameworks don’t ship them. Agentmatic does, in the open-source core. This is what they do and how to use them well.

The failure modes

Three things that go wrong in production agents:

Upstream provider degrades. OpenAI is 4× slower than usual. Half your requests return 503. You’re getting paged.
Tool flakes. Your search API rate-limits at 100 req/min; you have 200 concurrent agent runs.
A long-running graph fails on step 28 of 30. Re-running from scratch is expensive and slow.

Three primitives for three failures.

Circuit breakers

The pattern: after N consecutive failures, open the circuit. While open, fail fast with a known error. After a cooldown, half-open: send one probe call. If it succeeds, close the circuit. If it fails, re-open and back off.

from agentmatic.resilience import CircuitBreaker

agent = (Agent.builder("ai")
    .llm(OpenAI())
    .circuit_breaker("openai", failure_threshold=5, cooldown_seconds=30)
    .circuit_breaker("search", failure_threshold=3, cooldown_seconds=10)
    .build())

Per-service breakers, not per-graph. Your OpenAI errors shouldn’t trip your search circuit; your search rate-limit shouldn’t trip OpenAI.

Mistakes

Single breaker for everything. One slow tool trips the entire agent.
Too aggressive a threshold. 1 failure = open is too sensitive; flickery upstream networks will keep your circuit oscillating.
Too long a cooldown. 5 minutes for a transient 503 means 5 minutes of all-fail.
No half-open probe. You need a way to test recovery without sending full traffic.

Sensible defaults

Failure threshold: 5 consecutive failures.
Cooldown: 30 s for LLM providers, 10 s for fast services.
Half-open: 1 probe, 60 s evaluation window.
Failure definition: HTTP 5xx, timeouts, connection errors. Not 4xx (those are deterministic; retry won’t fix them).

Retry with backoff

The pattern: when a transient failure happens, wait, retry. Wait longer each time (exponential). Add randomness so you don’t synchronize a retry storm (jitter).

from agentmatic.resilience import RetryPolicy

policy = RetryPolicy.exponential(
    max_attempts=3,
    initial_delay=1.0,
    max_delay=30.0,
    jitter=True,
)

agent = Agent.builder("ai").retry_policy(policy).build()

Per-tool overrides:

@tool(retry=RetryPolicy.fixed(max_attempts=5, delay=2.0))
def idempotent_lookup(id: str) -> dict: ...

@tool(retry=RetryPolicy.none())
def transfer_funds(amount: float) -> str: ...   # don't retry money

Mistakes

Retrying non-idempotent operations. Refunding twice is worse than refunding once and failing.
No jitter. 1000 concurrent failures all retry at exactly T+1s. You DOS the upstream.
Linear backoff for LLM providers. OpenAI’s rate-limit recovery is slow; you want exponential.
Retry on 4xx. A 400 Bad Request won’t succeed on retry. Stop.

Sensible defaults

LLM calls: 3 attempts, exponential, 1 s → 30 s, jitter on.
Idempotent tools: 5 attempts, exponential, 0.5 s → 10 s, jitter on.
Non-idempotent tools: no retry (or 1 attempt for “retry-safe-by-design” tools you’re sure about).
Always classify the error first: 5xx and timeouts retry; 4xx fail; 429 retries with exponential backoff respecting the Retry-After header.

Dead-letter queues

The pattern: when a run exhausts retries, persist it. State, context, error trace, last checkpoint. Replay later when the upstream is healthy.

from agentmatic.resilience import DeadLetterQueue

dlq = DeadLetterQueue.postgres(POSTGRES_URL)

agent = Agent.builder("ai").dead_letter_queue(dlq).build()

# Inspect failed runs.
for run in await dlq.list(limit=10):
    print(run.run_id, run.last_error, run.checkpoint_id)

# Replay when ready.
await agent.replay(run.checkpoint_id)

The DLQ stores enough state to resume the agent from where it failed — not from scratch. This is huge for expensive multi-step workflows.

Mistakes

No DLQ, just logs. “We’ll figure out failures later” → users don’t get answers, you have no automated way to recover.
DLQ for every error. Deterministic failures (bad input) clog the DLQ. They should fail to your error log, not enqueue.
No retention policy. A DLQ that never drains becomes a swamp.
No replay tooling. A DLQ you can’t replay from is just a fancy log.

Sensible defaults

DLQ storage: same backend as your checkpointer (Postgres usually). Reuse the connection pool.
Retention: 30 days for replayable, 7 days for known-bad.
Replay tooling: a CLI or admin endpoint. agentmatic-cli dlq replay —run-id abc.
Alerting: page on > N DLQ entries per minute (indicates upstream is down, not just a flake).

Putting it together

A production-grade agent uses all three:

agent = (Agent.builder("production")
    .llm(OpenAI("gpt-4o"))
    .tools([search, calculator, send_email])
    .checkpoint(PostgresSaver.from_env())
    .circuit_breaker("openai", failure_threshold=5, cooldown_seconds=30)
    .circuit_breaker("search", failure_threshold=3, cooldown_seconds=10)
    .retry_policy(RetryPolicy.exponential(max_attempts=3, jitter=True))
    .dead_letter_queue(DeadLetterQueue.postgres(POSTGRES_URL))
    .interrupt_before(["send_email"])  # human approval for outbound
    .build())

That’s a ~10-line config. It survives:

OpenAI 4-hour incident (breaker opens; runs DLQ; replay when healthy).
Search rate-limit (per-tool breaker + retry; circuit independent of LLM).
Tool flake (retry with backoff handles transients).
Worker crash (checkpointer + DLQ preserve state).
Bad LLM decision (HITL interrupt before risky tool).

Why this is in the open-source core

Most frameworks bundle these into a paid tier. Our position: production primitives belong in the open-source core. Teams shouldn’t have to choose between writing it themselves and paying for a SaaS to get the basics right.

We ship the primitives. You run them. Free, forever.