distributed recipe

Run an agent across a distributed cluster

Coordinator + worker (or P2P mesh) over gRPC, with least-loaded routing and checkpoint hand-off. Ship the same agent code to a cluster without rewriting it.

5 min read · Published May 6, 2026 · Languages: python, typescript, go

The pattern

A single-host event loop maxes out at a few hundred concurrent agent runs. Beyond that you need to spread work across processes, machines, or regions. Agentmatic ships a distributed runtime in the open-source core.

Distributed in one line: Run agentmatic-worker on each box, point your app at the workers with ClusterConfig, and your graph executes across the cluster — no code change to the agent itself.

Workers

On each box:

agentmatic-worker --listen 0.0.0.0:9090 --metrics 0.0.0.0:9091

Coordinator/worker (centralized)

from agentmatic.cluster import ClusterConfig

config = ClusterConfig(
    topology="coordinator-worker",
    transport="grpc",
    workers=["worker-1.internal:9090", "worker-2.internal:9090", "worker-3.internal:9090"],
    load_balancing="least-loaded",
)
agent = Agent.builder("production").llm(OpenAI()).cluster(config).build()

The application process is the coordinator. Schedule a run → it picks the least-loaded worker → checkpoints stream back over gRPC.

P2P mesh (no central coordinator)

config = ClusterConfig(
    topology="p2p",
    transport="grpc",
    peers=["peer-1:9090", "peer-2:9090", "peer-3:9090"],
    consensus="raft",
)

Raft consensus ensures the cluster agrees on which peer owns each in-flight run. Higher availability, slightly more complex ops.

Checkpoint hand-off

When a worker dies mid-run, the cluster reads the last checkpoint from your shared store (Postgres/Redis/S3) and resumes on another worker. Use a shared checkpointer:

.checkpoint(PostgresSaver.from_env())  # any backend that's not Memory

Health probes

Every worker exposes:

:9090 — gRPC for graph execution
:9091/healthz — k8s liveness
:9091/readyz — k8s readiness
:9091/metrics — Prometheus