You’ve shipped your agent. It works in dev. It works on staging. Then OpenAI has an incident. Or your search API rate-limits. Or your tool call times out at 1 in 100. Production is now a series of small fires.

The three primitives that put out those fires: circuit breakers, retry policies, dead-letter queues. They’re old patterns (Hystrix popularized circuit breakers in 2013), but most AI agent frameworks don’t ship them. Agentmatic does, in the open-source core. This is what they do and how to use them well.

The failure modes

Three things that go wrong in production agents:

  1. Upstream provider degrades. OpenAI is 4× slower than usual. Half your requests return 503. You’re getting paged.
  2. Tool flakes. Your search API rate-limits at 100 req/min; you have 200 concurrent agent runs.
  3. A long-running graph fails on step 28 of 30. Re-running from scratch is expensive and slow.

Three primitives for three failures.

Circuit breakers

The pattern: after N consecutive failures, open the circuit. While open, fail fast with a known error. After a cooldown, half-open: send one probe call. If it succeeds, close the circuit. If it fails, re-open and back off.

from agentmatic.resilience import CircuitBreaker

agent = (Agent.builder("ai")
    .llm(OpenAI())
    .circuit_breaker("openai", failure_threshold=5, cooldown_seconds=30)
    .circuit_breaker("search", failure_threshold=3, cooldown_seconds=10)
    .build())

Per-service breakers, not per-graph. Your OpenAI errors shouldn’t trip your search circuit; your search rate-limit shouldn’t trip OpenAI.

Mistakes

  • Single breaker for everything. One slow tool trips the entire agent.
  • Too aggressive a threshold. 1 failure = open is too sensitive; flickery upstream networks will keep your circuit oscillating.
  • Too long a cooldown. 5 minutes for a transient 503 means 5 minutes of all-fail.
  • No half-open probe. You need a way to test recovery without sending full traffic.

Sensible defaults

  • Failure threshold: 5 consecutive failures.
  • Cooldown: 30 s for LLM providers, 10 s for fast services.
  • Half-open: 1 probe, 60 s evaluation window.
  • Failure definition: HTTP 5xx, timeouts, connection errors. Not 4xx (those are deterministic; retry won’t fix them).

Retry with backoff

The pattern: when a transient failure happens, wait, retry. Wait longer each time (exponential). Add randomness so you don’t synchronize a retry storm (jitter).

from agentmatic.resilience import RetryPolicy

policy = RetryPolicy.exponential(
    max_attempts=3,
    initial_delay=1.0,
    max_delay=30.0,
    jitter=True,
)

agent = Agent.builder("ai").retry_policy(policy).build()

Per-tool overrides:

@tool(retry=RetryPolicy.fixed(max_attempts=5, delay=2.0))
def idempotent_lookup(id: str) -> dict: ...

@tool(retry=RetryPolicy.none())
def transfer_funds(amount: float) -> str: ...   # don't retry money

Mistakes

  • Retrying non-idempotent operations. Refunding twice is worse than refunding once and failing.
  • No jitter. 1000 concurrent failures all retry at exactly T+1s. You DOS the upstream.
  • Linear backoff for LLM providers. OpenAI’s rate-limit recovery is slow; you want exponential.
  • Retry on 4xx. A 400 Bad Request won’t succeed on retry. Stop.

Sensible defaults

  • LLM calls: 3 attempts, exponential, 1 s → 30 s, jitter on.
  • Idempotent tools: 5 attempts, exponential, 0.5 s → 10 s, jitter on.
  • Non-idempotent tools: no retry (or 1 attempt for “retry-safe-by-design” tools you’re sure about).
  • Always classify the error first: 5xx and timeouts retry; 4xx fail; 429 retries with exponential backoff respecting the Retry-After header.

Dead-letter queues

The pattern: when a run exhausts retries, persist it. State, context, error trace, last checkpoint. Replay later when the upstream is healthy.

from agentmatic.resilience import DeadLetterQueue

dlq = DeadLetterQueue.postgres(POSTGRES_URL)

agent = Agent.builder("ai").dead_letter_queue(dlq).build()

# Inspect failed runs.
for run in await dlq.list(limit=10):
    print(run.run_id, run.last_error, run.checkpoint_id)

# Replay when ready.
await agent.replay(run.checkpoint_id)

The DLQ stores enough state to resume the agent from where it failed — not from scratch. This is huge for expensive multi-step workflows.

Mistakes

  • No DLQ, just logs. “We’ll figure out failures later” → users don’t get answers, you have no automated way to recover.
  • DLQ for every error. Deterministic failures (bad input) clog the DLQ. They should fail to your error log, not enqueue.
  • No retention policy. A DLQ that never drains becomes a swamp.
  • No replay tooling. A DLQ you can’t replay from is just a fancy log.

Sensible defaults

  • DLQ storage: same backend as your checkpointer (Postgres usually). Reuse the connection pool.
  • Retention: 30 days for replayable, 7 days for known-bad.
  • Replay tooling: a CLI or admin endpoint. agentmatic-cli dlq replay —run-id abc.
  • Alerting: page on > N DLQ entries per minute (indicates upstream is down, not just a flake).

Putting it together

A production-grade agent uses all three:

agent = (Agent.builder("production")
    .llm(OpenAI("gpt-4o"))
    .tools([search, calculator, send_email])
    .checkpoint(PostgresSaver.from_env())
    .circuit_breaker("openai", failure_threshold=5, cooldown_seconds=30)
    .circuit_breaker("search", failure_threshold=3, cooldown_seconds=10)
    .retry_policy(RetryPolicy.exponential(max_attempts=3, jitter=True))
    .dead_letter_queue(DeadLetterQueue.postgres(POSTGRES_URL))
    .interrupt_before(["send_email"])  # human approval for outbound
    .build())

That’s a ~10-line config. It survives:

  • OpenAI 4-hour incident (breaker opens; runs DLQ; replay when healthy).
  • Search rate-limit (per-tool breaker + retry; circuit independent of LLM).
  • Tool flake (retry with backoff handles transients).
  • Worker crash (checkpointer + DLQ preserve state).
  • Bad LLM decision (HITL interrupt before risky tool).

Why this is in the open-source core

Most frameworks bundle these into a paid tier. Our position: production primitives belong in the open-source core. Teams shouldn’t have to choose between writing it themselves and paying for a SaaS to get the basics right.

We ship the primitives. You run them. Free, forever.