You shipped your agent. Traffic grew. Single-host event loops are queuing. Memory crept up under long-running graphs. You need to scale out.
This is the playbook for running Agentmatic on a cluster, in your own infrastructure, without paying for a SaaS control plane.
When you need a cluster
Three honest symptoms:
- Queue latency rising during peak. Your single-host event loop is the bottleneck.
- Worker memory creeping up. Long-running graphs accumulate state; restarts are expensive.
- Geo / residency requirements. You need to run agents in EU-only nodes.
Below those thresholds, scale up before scale out. A single c7i.4xlarge handles a lot of concurrent agent runs with the Rust runtime. Don’t add cluster complexity unless you have to.
Coordinator/worker topology
The simpler of the two patterns. One coordinator (your app) talks to N workers over gRPC. The coordinator decides which worker runs a graph; the worker checkpoints back to a shared store.
from agentmatic.cluster import ClusterConfig
config = ClusterConfig(
topology="coordinator-worker",
transport="grpc",
workers=["worker-1:9090", "worker-2:9090", "worker-3:9090"],
load_balancing="least-loaded", # or "round-robin", "consistent-hash"
)
agent = Agent.builder("prod").llm(OpenAI()).cluster(config).build()
On each worker box:
agentmatic-worker --listen 0.0.0.0:9090 --metrics 0.0.0.0:9091
The worker is stateless. State lives in the checkpointer.
Peer-to-peer topology
For when “what if the coordinator dies” matters. Workers form a Raft cluster; consensus decides which peer owns each run. Higher availability, more complex.
config = ClusterConfig(
topology="p2p",
peers=["peer-1:9090", "peer-2:9090", "peer-3:9090"],
consensus="raft",
)
Choose this if uptime SLA > 99.95% and you can’t accept coordinator restart windows.
The shared checkpointer
Critical: every worker needs to read/write the same checkpoint store. If you use MemorySaver, only the local worker can resume the run. Use Postgres, Redis, or S3.
agent = (Agent.builder("prod")
.checkpoint(PostgresSaver.from_env()) # or Redis / S3
.cluster(config)
.build())
This is what enables checkpoint hand-off: when worker A crashes mid-run, worker B reads the last checkpoint from Postgres and resumes.
Load balancing strategies
- least-loaded — sends each new run to the worker with the lowest active-run count. Best general default.
- round-robin — fair, predictable. Use when graph cost is uniform.
- consistent-hash — routes runs with the same
thread_idto the same worker. Useful for hot-checkpoint locality but loses some availability.
Health probes
Every worker exposes:
:9090— gRPC for graph execution:9091/healthz— k8s liveness:9091/readyz— k8s readiness:9091/metrics— Prometheus
Observability
The runtime is wired with OpenTelemetry by default. Set OTEL_EXPORTER_OTLP_ENDPOINT and traces flow to your collector. Per-graph-run correlation IDs propagate across workers.
Useful spans:
agentmatic.graph.run— the whole graph execution.agentmatic.node.execute— each node, with input / output state diffs.agentmatic.tool.call— each tool call, with args and return.agentmatic.checkpoint.write— every checkpoint write, with backend latency.agentmatic.cluster.dispatch— coordinator → worker hand-off.
Kubernetes example
apiVersion: apps/v1
kind: Deployment
metadata:
name: agentmatic-worker
spec:
replicas: 12
template:
spec:
containers:
- name: worker
image: ghcr.io/neul-labs/agentmatic-worker:0.1.0
args: ["--listen", "0.0.0.0:9090", "--metrics", "0.0.0.0:9091"]
ports:
- containerPort: 9090
- containerPort: 9091
readinessProbe: { httpGet: { path: /readyz, port: 9091 } }
livenessProbe: { httpGet: { path: /healthz, port: 9091 } }
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4318"
The coordinator (your app) discovers workers via the k8s Service and ClusterConfig.from_dns("agentmatic-worker.default.svc.cluster.local:9090").
What this doesn’t do
- Replaces your job scheduler. k8s / Nomad / etc. still run the worker pods.
- Auto-scales itself. Use the HPA on the worker
Deployment(Prometheus metric: queue depth). - Provides a hosted control plane. That’s by design — everything stays in your VPC.
A concrete success case
A Fortune 500 manufacturer ran 12 workers serving 6,800 queries/day across business units, no SaaS dependency. Annual infra cost $76k vs ~$400k for the hosted alternative. Full case study.
When you should NOT cluster
- You’re under 100 concurrent agent runs at peak.
- Your bottleneck is the LLM, not the framework.
- You don’t have a shared checkpointer (and don’t want one).
Optimize single-host first. Add cluster when you’re sure.