Resilience
Production primitives, in the box.
LangGraph bundles circuit breakers, retry, and DLQs into a paid SaaS tier. Agentmatic ships them in the open-source core. Same primitives, MIT-licensed, self-hosted.
Three primitives:
- Circuit breakers — open the circuit after N consecutive failures; fail fast for a cooldown window; half-open with a probe call.
- Retry policies — exponential backoff with jitter, capped attempts, per-tool overrides.
- Dead-letter queues — failed runs land in a durable queue with full state, replayable later.
Circuit breakers
from agentmatic.resilience import CircuitBreaker
agent = (Agent.builder("ai")
.llm(OpenAI())
.tools([search])
.circuit_breaker("openai", failure_threshold=5, cooldown_seconds=30)
.circuit_breaker("search", failure_threshold=3, cooldown_seconds=10)
.build()) Per-service breakers — your OpenAI errors don't trip your search circuit. Half-open probes after the cooldown to recover automatically.
Retry with backoff
from agentmatic.resilience import RetryPolicy
policy = RetryPolicy.exponential(
max_attempts=3,
initial_delay=1.0,
max_delay=30.0,
jitter=True,
)
agent = Agent.builder("ai").llm(OpenAI()).retry_policy(policy).build() Retry is at the node-execution level — a failed LLM call retries, a failed tool retries, but the graph state stays correct. Idempotency is your responsibility, but the framework guarantees re-execution from the same checkpoint.
Dead-letter queues
from agentmatic.resilience import DeadLetterQueue
dlq = DeadLetterQueue.sqlite("dlq.db") # or .postgres(...) / .redis(...) / .s3(...)
agent = (Agent.builder("ai")
.llm(OpenAI())
.dead_letter_queue(dlq)
.build())
# A run that exhausts retries lands in dlq with full state + error trace.
failed_runs = await dlq.list(limit=10)
for run in failed_runs:
print(run.run_id, run.last_error, run.checkpoint_id)
# Replay later when the upstream is healthy.
await agent.replay(run.checkpoint_id) When this matters
- LLM provider outages. Anthropic and OpenAI both have multi-hour incidents quarterly. Circuit breakers prevent your retry storm from making it worse.
- Tool flakiness. Search APIs rate-limit. Database connections drop. Retry with backoff smooths it.
- Long-running graphs. A 30-step graph that fails on step 28 doesn't restart from zero — the DLQ holds the state at step 27.
Observability
Every circuit state change, retry attempt, and DLQ enqueue is emitted as an OpenTelemetry span and a structured log line. Wire it to Datadog, Honeycomb, or Tempo. agent.metrics() returns a snapshot of every breaker and counter.
Ship your next agent in minutes, not weeks.
MIT licensed. Drop-in for LangGraph. Native SDKs in 5 languages. Battle-tested resilience primitives in the box.