distributed recipe
Run an agent across a distributed cluster
Coordinator + worker (or P2P mesh) over gRPC, with least-loaded routing and checkpoint hand-off. Ship the same agent code to a cluster without rewriting it.
5 min read · Published May 6, 2026 · Languages: python, typescript, go
The pattern
A single-host event loop maxes out at a few hundred concurrent agent runs. Beyond that you need to spread work across processes, machines, or regions. Agentmatic ships a distributed runtime in the open-source core.
Distributed in one line: Run agentmatic-worker on each box, point your app at the workers with ClusterConfig, and your graph executes across the cluster — no code change to the agent itself.
Workers
On each box:
agentmatic-worker --listen 0.0.0.0:9090 --metrics 0.0.0.0:9091
Coordinator/worker (centralized)
from agentmatic.cluster import ClusterConfig
config = ClusterConfig(
topology="coordinator-worker",
transport="grpc",
workers=["worker-1.internal:9090", "worker-2.internal:9090", "worker-3.internal:9090"],
load_balancing="least-loaded",
)
agent = Agent.builder("production").llm(OpenAI()).cluster(config).build()
The application process is the coordinator. Schedule a run → it picks the least-loaded worker → checkpoints stream back over gRPC.
P2P mesh (no central coordinator)
config = ClusterConfig(
topology="p2p",
transport="grpc",
peers=["peer-1:9090", "peer-2:9090", "peer-3:9090"],
consensus="raft",
)
Raft consensus ensures the cluster agrees on which peer owns each in-flight run. Higher availability, slightly more complex ops.
Checkpoint hand-off
When a worker dies mid-run, the cluster reads the last checkpoint from your shared store (Postgres/Redis/S3) and resumes on another worker. Use a shared checkpointer:
.checkpoint(PostgresSaver.from_env()) # any backend that's not Memory
Health probes
Every worker exposes:
:9090— gRPC for graph execution:9091/healthz— k8s liveness:9091/readyz— k8s readiness:9091/metrics— Prometheus