What's the headline number?

10–15× faster graph traversal in the framework, ~8× p95 reduction on real multi-agent workloads where the LLM is mocked. Real production workloads see 4–10× p95 improvement depending on graph density.

When is the gap small?

Single-LLM-call agents where the LLM accounts for >95% of wall time. The framework's contribution is small there; you'll see 1.1–1.3× wins.

Can I reproduce these numbers?

Yes — the harness is in agentmatic/benchmarks. Clone the repo, install LangGraph and Agentmatic, run `python bench.py --all`.

Is the comparison fair?

We use the same prompts, same graphs, same LLM (mocked at fixed latency), same hardware, same Python version. LangGraph is on its latest stable release.

Comparison

LangGraph vs Agentmatic: a careful performance comparison

An honest, reproducible LangGraph vs Agentmatic benchmark across ReAct, Supervisor, RAG, and Map fan-out workloads. With methodology, hardware specs, raw numbers, and the cases where the gap is small.

Dipankar Sarkar May 22, 2026 8 min read

langgraphbenchmarksperformancecomparison

I’ll skip the “we’re 10× faster” marketing and just show the numbers, the methodology, and the cases where the gap doesn’t matter.

TL;DR

Metric	LangGraph	Agentmatic	Ratio
Graph traversal (ops/ms)	8.4	100.0	11.9×
Channel throughput (msgs/ms)	1.3	100.0	76.9×
Memory (MB, 1000-node graph)	184	71	0.39× (61% less)
Cold start (ms)	410	90	0.22× (78% faster)
ReAct p95 (LLM mocked @ 50 ms)	285 ms	62 ms	4.6×
Supervisor (1+4) p95	1.42 s	0.18 s	7.9×
RAG (3-node + vector lookup) p95	320 ms	71 ms	4.5×
Map fan-out (1 → 100) p95	8.4 s	0.71 s	11.8×

Methodology

Hardware: AWS c7i.4xlarge (16 vCPU, 32 GB), Ubuntu 24.04. Numbers below are from this. We also ran on Apple M3 Max + 64GB; Mac numbers are within 8% of the c7i.
Python: 3.12.5, uvloop event loop.
Versions: LangGraph 0.2.40 (latest stable at time of writing), Agentmatic 0.1.0.
LLM: mocked at a fixed 50 ms latency. This lets us measure framework overhead, not OpenAI’s API.
Iterations: 1,000 runs per configuration, p50 / p95 / p99 reported. Warmup of 50 runs discarded.
Source: agentmatic/benchmarks/bench.py in the repo. Run python bench.py --all to reproduce.

Why the framework overhead matters

In a single-LLM-call agent, the LLM accounts for >95% of wall time. Framework speed barely matters.

In a multi-agent supervisor with retry and HITL — 30 LLM calls, 100+ graph transitions, 5+ checkpoint writes — the framework accounts for 30–60% of wall time. There, framework speed dominates.

This is why the Supervisor and Map fan-out numbers are the most dramatic: those workloads stress the runtime harder than the LLM.

Where the speed comes from

Three sources:

1. Lock-free Rust scheduler. The graph executor is a work-stealing Tokio runtime. Node execution doesn’t queue through Python’s event loop; it queues in Rust. The Python GIL only matters at the tool call boundary.

2. Pregel-style supersteps. Messages between nodes batch at each barrier. Fewer FFI crossings; better cache locality.

3. Zero-copy state. State snapshots are Arc<Frame>. Cloning is cheap (pointer copy). Mutation is copy-on-write. LangGraph deep-copies on each step.

You can verify all three with py-spy --rate 5000 against either framework. LangGraph spends ~40% in copy.deepcopy and asyncio machinery. Agentmatic spends ~3% in PyO3 boundary crossings.

Where the gap is small

Single-LLM-call ReAct. One LLM call, no retry, ~3 graph nodes. We measured 1.18× — barely worth mentioning. Use the framework you prefer.

Long-context RAG. When 95% of latency is the LLM processing 100k tokens, framework overhead is irrelevant. ~1.05× win. Pick by feature set.

Tool-bound agents with slow tools. If your dominant tool is a database query that takes 800 ms, the framework’s 50 ms vs 250 ms barely matters.

Where the gap is large

Multi-agent supervisors. 30+ graph transitions per turn. The c7i numbers show 7.9× on p95.

Map fan-out. Parallel evaluation of N branches. The Rust scheduler is dramatically better at this — 11.8× on p95 for 1 → 100 fan-out.

Retry loops. Each retry is a graph re-entry. High graph density = high framework cost.

Streaming agents. The astream event loop in LangGraph is event-driven Python; ours is a Tokio channel. ~6× p95 in our streaming benchmark.

What about memory

Configuration	LangGraph	Agentmatic
100-node graph, idle	64 MB	28 MB
1,000-node graph, idle	184 MB	71 MB
1,000 concurrent sessions	4.2 GB	1.1 GB

For high-concurrency workloads (a single process serving many simultaneous agent runs), the memory savings translate to fewer pods, lower infra cost.

Cold start

	LangGraph	Agentmatic
Import time	340 ms	65 ms
First-graph compile	70 ms	25 ms
Total time-to-first-token	410 ms	90 ms

This matters for serverless deploys (Lambda, Cloud Run). 320 ms saved per cold-start at scale adds up.

Reproducibility

git clone https://github.com/neul-labs/agentmatic
cd agentmatic/benchmarks

# Both frameworks installed in the same venv
pip install agentmatic langgraph

python bench.py --all --output bench.json
python bench.py --report bench.json

The harness prints a table like the TL;DR above for whatever hardware you’re on.

Caveat

These numbers measure framework overhead. They don’t measure prompt quality, tool design, or product fit. Switching frameworks won’t save a bad agent. But if your agent is good and your bottleneck is the runtime, this is the cleanest single-step win available.