What's a 'superstep'?

A discrete unit of graph execution in the Pregel model — between two superstep barriers, all messages from the previous step have been delivered. Each superstep boundary is a natural checkpoint.

How much storage does checkpointing take?

Depends on state size. Typical multi-agent workflow: 5–15 KB per superstep, compressed. A 30-step run is ~300 KB. Postgres can hold millions of these on modest hardware.

Can I delete old checkpoints?

Yes — every backend has a TTL or compaction API. Default retention is 30 days; many teams set it to 7 days for cost.

What's the perf cost of checkpointing?

Postgres: ~3–8 ms per checkpoint. Redis: ~0.5 ms. S3: ~50 ms. Choose by your latency / cost balance.

Deep dive

Agent checkpointing, time travel, and human-in-the-loop done right

How checkpointing actually works in Agentmatic, why every superstep is a save point, how time-travel debugging changes development, and the HITL patterns that scale to production.

Dipankar Sarkar May 29, 2026 9 min read

checkpointinghitltime-travelproduction

Checkpointing isn’t a feature you tack on. It’s the foundation. Done right, it gives you time travel, human-in-the-loop, crash safety, and reproducibility — all from the same primitive. Done wrong, it’s an afterthought that fails at the worst moments.

This is how it works in Agentmatic, and how to use it well.

The primitive

Every superstep boundary is a checkpoint. A “superstep” is a unit of graph execution: between two barriers, all messages from the previous step have been delivered, and the state is consistent.

START → node A → [checkpoint] → node B → [checkpoint] → node C → [checkpoint] → END

The state at each [checkpoint] is immutable, identified by a checkpoint_id, and stored in your configured backend.

Backends

from agentmatic.checkpoint import (
    MemorySaver,           # in-process, for dev
    SQLiteSaver,           # single-host durability
    PostgresSaver,         # multi-host, transactional
    RedisSaver,            # fast, ephemeral-friendly
    S3Saver,               # cheap, geo-replicated
)

agent = Agent.builder("durable").llm(OpenAI()).checkpoint(S3Saver(bucket="agents")).build()

Choose by:

Memory — tests and notebooks.
SQLite — single-host dev, small prod (fewer than 100 concurrent runs).
Postgres — multi-host prod, transactional. The default for serious workloads.
Redis — high throughput, fine with state loss on a crash (ephemeral sessions).
S3 — cheap retention. Higher latency; use for archive or long-tail.

Time travel: the killer feature

Because every superstep is immutable, you can resume from any prior point.

# List the history of a run.
history = agent.get_state_history(thread_id="run-42")
for state in history:
    print(state.next, state.config["configurable"]["checkpoint_id"])

# Fork from superstep 3.
checkpoint_id = history[3].config["configurable"]["checkpoint_id"]

# Replay with new input.
result = agent.invoke(
    {"messages": [HumanMessage("Try a different approach this time.")]},
    config={"configurable": {"thread_id": "run-42", "checkpoint_id": checkpoint_id}},
)

This changes how you debug. You hit a bad run in prod; you pull its thread_id; you load the state history; you find where it went wrong; you fork from before the wrong step; you fix the prompt; you replay. No re-running from scratch.

HITL: pause the graph for a human

The pattern: pause before a tool fires that’s hard to undo. Surface the proposed call to a human. Let them approve, modify, or reject. Resume.

agent = (Agent.builder("payments")
    .llm(OpenAI())
    .tools([read_account, transfer_funds])
    .checkpoint(PostgresSaver.from_env())
    .interrupt_before(["transfer_funds"])
    .build())

# Run pauses at the first transfer_funds call.
state = await agent.ainvoke(
    {"messages": [HumanMessage("Refund order #42 for $250.")]},
    config={"configurable": {"thread_id": "case-42"}},
)
print(state.next)  # ('transfer_funds',)

# A different process, hours later, resumes after approval.
await agent.ainvoke(None, config={"configurable": {"thread_id": "case-42"}})

The key insight: because the state is checkpointed, the resume can happen in a different process, on a different machine, at a different time. The human approval workflow doesn’t have to share the agent process.

Conditional interrupts

Most interrupts shouldn’t fire on every call — only on risky ones:

.interrupt_before_when(
    tool="transfer_funds",
    predicate=lambda args: args["amount"] >= 100,
)

Refunds under $100 go through without approval; refunds over $100 wait for a human. This is the right granularity for most workflows.

Modify state during HITL

Sometimes the human wants to adjust the proposed call before resuming:

# Inspect the proposed tool call.
state = await agent.aget_state({"configurable": {"thread_id": "case-42"}})
proposed = state.next_tool_call
print(proposed.args)  # {"to": "...", "amount": 250}

# Override.
await agent.update_state(
    {"configurable": {"thread_id": "case-42"}},
    {"messages": [HumanMessage("Approved, but amount should be 245.")]},
)
await agent.ainvoke(None, config={"configurable": {"thread_id": "case-42"}})

The model sees the human’s modification as a new user message and re-proposes the tool call with the updated amount.

Branching

Every checkpoint is immutable. Branching is just resuming with a new thread_id:

# A is the original run.
# B starts from A's superstep 5 with a new prompt.
await agent.ainvoke(
    new_input,
    config={"configurable": {"thread_id": "branch-B", "checkpoint_id": a_history[5].checkpoint_id}},
)

Useful for A/B testing prompt changes against the same state, or letting users explore “what if” continuations.

Retention

Postgres retention SQL (run nightly):

DELETE FROM agentmatic_checkpoints WHERE created_at < NOW() - INTERVAL '30 days';

Or use the built-in helper:

from agentmatic.checkpoint.postgres import compact

await compact(POSTGRES_URL, retention_days=30)

Format compatibility with LangGraph

The Memory and SQLite checkpoint formats are wire-compatible with LangGraph. You can read existing LangGraph checkpoints in Agentmatic with no migration. Postgres requires a one-line schema upgrade we ship:

from agentmatic.checkpoint.postgres import upgrade
await upgrade(POSTGRES_URL)  # idempotent

What this enables in production

Long-horizon workflows. Multi-hour graphs survive process restarts, deploys, and crashes.
Multi-process coordination. The worker that pauses for HITL doesn’t have to be the worker that resumes.
Debuggable failures. Every failed run has its full state history. Time-travel to the moment it went wrong.
Compliance / audit. Every state mutation is recorded, timestamped, attributable. Optional tamper-evident chain.

Checkpointing is the difference between “demo-quality agent” and “production agent.” It’s not optional.