You shipped your agent. Traffic grew. Single-host event loops are queuing. Memory crept up under long-running graphs. You need to scale out.

This is the playbook for running Agentmatic on a cluster, in your own infrastructure, without paying for a SaaS control plane.

When you need a cluster

Three honest symptoms:

  1. Queue latency rising during peak. Your single-host event loop is the bottleneck.
  2. Worker memory creeping up. Long-running graphs accumulate state; restarts are expensive.
  3. Geo / residency requirements. You need to run agents in EU-only nodes.

Below those thresholds, scale up before scale out. A single c7i.4xlarge handles a lot of concurrent agent runs with the Rust runtime. Don’t add cluster complexity unless you have to.

Coordinator/worker topology

The simpler of the two patterns. One coordinator (your app) talks to N workers over gRPC. The coordinator decides which worker runs a graph; the worker checkpoints back to a shared store.

from agentmatic.cluster import ClusterConfig

config = ClusterConfig(
    topology="coordinator-worker",
    transport="grpc",
    workers=["worker-1:9090", "worker-2:9090", "worker-3:9090"],
    load_balancing="least-loaded",  # or "round-robin", "consistent-hash"
)
agent = Agent.builder("prod").llm(OpenAI()).cluster(config).build()

On each worker box:

agentmatic-worker --listen 0.0.0.0:9090 --metrics 0.0.0.0:9091

The worker is stateless. State lives in the checkpointer.

Peer-to-peer topology

For when “what if the coordinator dies” matters. Workers form a Raft cluster; consensus decides which peer owns each run. Higher availability, more complex.

config = ClusterConfig(
    topology="p2p",
    peers=["peer-1:9090", "peer-2:9090", "peer-3:9090"],
    consensus="raft",
)

Choose this if uptime SLA > 99.95% and you can’t accept coordinator restart windows.

The shared checkpointer

Critical: every worker needs to read/write the same checkpoint store. If you use MemorySaver, only the local worker can resume the run. Use Postgres, Redis, or S3.

agent = (Agent.builder("prod")
    .checkpoint(PostgresSaver.from_env())  # or Redis / S3
    .cluster(config)
    .build())

This is what enables checkpoint hand-off: when worker A crashes mid-run, worker B reads the last checkpoint from Postgres and resumes.

Load balancing strategies

  • least-loaded — sends each new run to the worker with the lowest active-run count. Best general default.
  • round-robin — fair, predictable. Use when graph cost is uniform.
  • consistent-hash — routes runs with the same thread_id to the same worker. Useful for hot-checkpoint locality but loses some availability.

Health probes

Every worker exposes:

  • :9090 — gRPC for graph execution
  • :9091/healthz — k8s liveness
  • :9091/readyz — k8s readiness
  • :9091/metrics — Prometheus

Observability

The runtime is wired with OpenTelemetry by default. Set OTEL_EXPORTER_OTLP_ENDPOINT and traces flow to your collector. Per-graph-run correlation IDs propagate across workers.

Useful spans:

  • agentmatic.graph.run — the whole graph execution.
  • agentmatic.node.execute — each node, with input / output state diffs.
  • agentmatic.tool.call — each tool call, with args and return.
  • agentmatic.checkpoint.write — every checkpoint write, with backend latency.
  • agentmatic.cluster.dispatch — coordinator → worker hand-off.

Kubernetes example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agentmatic-worker
spec:
  replicas: 12
  template:
    spec:
      containers:
      - name: worker
        image: ghcr.io/neul-labs/agentmatic-worker:0.1.0
        args: ["--listen", "0.0.0.0:9090", "--metrics", "0.0.0.0:9091"]
        ports:
        - containerPort: 9090
        - containerPort: 9091
        readinessProbe: { httpGet: { path: /readyz, port: 9091 } }
        livenessProbe: { httpGet: { path: /healthz, port: 9091 } }
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://otel-collector:4318"

The coordinator (your app) discovers workers via the k8s Service and ClusterConfig.from_dns("agentmatic-worker.default.svc.cluster.local:9090").

What this doesn’t do

  • Replaces your job scheduler. k8s / Nomad / etc. still run the worker pods.
  • Auto-scales itself. Use the HPA on the worker Deployment (Prometheus metric: queue depth).
  • Provides a hosted control plane. That’s by design — everything stays in your VPC.

A concrete success case

A Fortune 500 manufacturer ran 12 workers serving 6,800 queries/day across business units, no SaaS dependency. Annual infra cost $76k vs ~$400k for the hosted alternative. Full case study.

When you should NOT cluster

  • You’re under 100 concurrent agent runs at peak.
  • Your bottleneck is the LLM, not the framework.
  • You don’t have a shared checkpointer (and don’t want one).

Optimize single-host first. Add cluster when you’re sure.