Resilience

Production primitives, in the box.

LangGraph bundles circuit breakers, retry, and DLQs into a paid SaaS tier. Agentmatic ships them in the open-source core. Same primitives, MIT-licensed, self-hosted.

Three primitives:

Circuit breakers — open the circuit after N consecutive failures; fail fast for a cooldown window; half-open with a probe call.
Retry policies — exponential backoff with jitter, capped attempts, per-tool overrides.
Dead-letter queues — failed runs land in a durable queue with full state, replayable later.

Circuit breakers

from agentmatic.resilience import CircuitBreaker

agent = (Agent.builder("ai")
    .llm(OpenAI())
    .tools([search])
    .circuit_breaker("openai", failure_threshold=5, cooldown_seconds=30)
    .circuit_breaker("search", failure_threshold=3, cooldown_seconds=10)
    .build())

Per-service breakers — your OpenAI errors don't trip your search circuit. Half-open probes after the cooldown to recover automatically.

Retry with backoff

from agentmatic.resilience import RetryPolicy

policy = RetryPolicy.exponential(
    max_attempts=3,
    initial_delay=1.0,
    max_delay=30.0,
    jitter=True,
)

agent = Agent.builder("ai").llm(OpenAI()).retry_policy(policy).build()

Retry is at the node-execution level — a failed LLM call retries, a failed tool retries, but the graph state stays correct. Idempotency is your responsibility, but the framework guarantees re-execution from the same checkpoint.

Dead-letter queues

from agentmatic.resilience import DeadLetterQueue

dlq = DeadLetterQueue.sqlite("dlq.db")  # or .postgres(...) / .redis(...) / .s3(...)

agent = (Agent.builder("ai")
    .llm(OpenAI())
    .dead_letter_queue(dlq)
    .build())

# A run that exhausts retries lands in dlq with full state + error trace.
failed_runs = await dlq.list(limit=10)
for run in failed_runs:
    print(run.run_id, run.last_error, run.checkpoint_id)
    # Replay later when the upstream is healthy.
    await agent.replay(run.checkpoint_id)

When this matters

LLM provider outages. Anthropic and OpenAI both have multi-hour incidents quarterly. Circuit breakers prevent your retry storm from making it worse.
Tool flakiness. Search APIs rate-limit. Database connections drop. Retry with backoff smooths it.
Long-running graphs. A 30-step graph that fails on step 28 doesn't restart from zero — the DLQ holds the state at step 27.

Observability

Every circuit state change, retry attempt, and DLQ enqueue is emitted as an OpenTelemetry span and a structured log line. Wire it to Datadog, Honeycomb, or Tempo. agent.metrics() returns a snapshot of every breaker and counter.

Ship your next agent in minutes, not weeks.

MIT licensed. Drop-in for LangGraph. Native SDKs in 5 languages. Battle-tested resilience primitives in the box.

Migrate from LangGraph → Star on GitHub ★