When do I need a cluster?

When a single host can't keep up. Symptoms: queue latency rises during peak, worker memory keeps growing across long-running graphs, or you need geo-distributed agents for data residency.

Coordinator/worker or P2P?

Coordinator/worker is simpler and 95% of teams should pick it. P2P (with Raft) is for high-availability requirements where no single coordinator is acceptable.

What about Kubernetes?

Agentmatic clusters run great under k8s. The worker is a stateless container; the coordinator is your application code. We ship Helm charts but you can also drop the binary into your existing stack.

Where does the checkpointer fit?

Use a shared checkpointer (Postgres, Redis, or S3) so any worker can resume a run when its predecessor dies. Memory-only checkpointers don't work in clusters.

Deep dive

Distributed AI agents: self-hosted at scale without a SaaS dependency

How to run an AI agent runtime across a cluster, in your VPC, without paying a platform tier. Coordinator/worker, P2P mesh, checkpoint hand-off, observability — all from open-source primitives.

Dipankar Sarkar May 27, 2026 9 min read

distributedself-hostedproductioninfrastructure

You shipped your agent. Traffic grew. Single-host event loops are queuing. Memory crept up under long-running graphs. You need to scale out.

This is the playbook for running Agentmatic on a cluster, in your own infrastructure, without paying for a SaaS control plane.

When you need a cluster

Three honest symptoms:

Queue latency rising during peak. Your single-host event loop is the bottleneck.
Worker memory creeping up. Long-running graphs accumulate state; restarts are expensive.
Geo / residency requirements. You need to run agents in EU-only nodes.

Below those thresholds, scale up before scale out. A single c7i.4xlarge handles a lot of concurrent agent runs with the Rust runtime. Don’t add cluster complexity unless you have to.

Coordinator/worker topology

The simpler of the two patterns. One coordinator (your app) talks to N workers over gRPC. The coordinator decides which worker runs a graph; the worker checkpoints back to a shared store.

from agentmatic.cluster import ClusterConfig

config = ClusterConfig(
    topology="coordinator-worker",
    transport="grpc",
    workers=["worker-1:9090", "worker-2:9090", "worker-3:9090"],
    load_balancing="least-loaded",  # or "round-robin", "consistent-hash"
)
agent = Agent.builder("prod").llm(OpenAI()).cluster(config).build()

On each worker box:

agentmatic-worker --listen 0.0.0.0:9090 --metrics 0.0.0.0:9091

The worker is stateless. State lives in the checkpointer.

Peer-to-peer topology

For when “what if the coordinator dies” matters. Workers form a Raft cluster; consensus decides which peer owns each run. Higher availability, more complex.

config = ClusterConfig(
    topology="p2p",
    peers=["peer-1:9090", "peer-2:9090", "peer-3:9090"],
    consensus="raft",
)

Choose this if uptime SLA > 99.95% and you can’t accept coordinator restart windows.

The shared checkpointer

Critical: every worker needs to read/write the same checkpoint store. If you use MemorySaver, only the local worker can resume the run. Use Postgres, Redis, or S3.

agent = (Agent.builder("prod")
    .checkpoint(PostgresSaver.from_env())  # or Redis / S3
    .cluster(config)
    .build())

This is what enables checkpoint hand-off: when worker A crashes mid-run, worker B reads the last checkpoint from Postgres and resumes.

Load balancing strategies

least-loaded — sends each new run to the worker with the lowest active-run count. Best general default.
round-robin — fair, predictable. Use when graph cost is uniform.
consistent-hash — routes runs with the same thread_id to the same worker. Useful for hot-checkpoint locality but loses some availability.

Health probes

Every worker exposes:

:9090 — gRPC for graph execution
:9091/healthz — k8s liveness
:9091/readyz — k8s readiness
:9091/metrics — Prometheus

Observability

The runtime is wired with OpenTelemetry by default. Set OTEL_EXPORTER_OTLP_ENDPOINT and traces flow to your collector. Per-graph-run correlation IDs propagate across workers.

Useful spans:

agentmatic.graph.run — the whole graph execution.
agentmatic.node.execute — each node, with input / output state diffs.
agentmatic.tool.call — each tool call, with args and return.
agentmatic.checkpoint.write — every checkpoint write, with backend latency.
agentmatic.cluster.dispatch — coordinator → worker hand-off.

Kubernetes example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agentmatic-worker
spec:
  replicas: 12
  template:
    spec:
      containers:
      - name: worker
        image: ghcr.io/neul-labs/agentmatic-worker:0.1.0
        args: ["--listen", "0.0.0.0:9090", "--metrics", "0.0.0.0:9091"]
        ports:
        - containerPort: 9090
        - containerPort: 9091
        readinessProbe: { httpGet: { path: /readyz, port: 9091 } }
        livenessProbe: { httpGet: { path: /healthz, port: 9091 } }
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://otel-collector:4318"

The coordinator (your app) discovers workers via the k8s Service and ClusterConfig.from_dns("agentmatic-worker.default.svc.cluster.local:9090").

What this doesn’t do

Replaces your job scheduler. k8s / Nomad / etc. still run the worker pods.
Auto-scales itself. Use the HPA on the worker Deployment (Prometheus metric: queue depth).
Provides a hosted control plane. That’s by design — everything stays in your VPC.

A concrete success case

A Fortune 500 manufacturer ran 12 workers serving 6,800 queries/day across business units, no SaaS dependency. Annual infra cost $76k vs ~$400k for the hosted alternative. Full case study.

When you should NOT cluster

You’re under 100 concurrent agent runs at peak.
Your bottleneck is the LLM, not the framework.
You don’t have a shared checkpointer (and don’t want one).

Optimize single-host first. Add cluster when you’re sure.