ArchitectureApr 11, 20266 min read

Why persistent Python REPL across agent turns cuts your LLM bill 10x

Container-per-tool-call agents re-parse every CSV, re-import every library, re-run every setup cell on every turn. A persistent REPL eliminates all of it. Here's the math on why that saves 10x tokens.

Robel TegegnePodflare, founder

Most cloud-sandbox platforms for AI agents give you a fresh container per tool call. Simple to reason about, clean isolation between executions. Also: expensive in ways that don’t show up on the sandbox bill but absolutely show up on your LLM bill.

The hidden cost is that every time the model wants to do something with df, it has to re-create df. Which means re-running all the setup code. Which means re-emitting all that code from the model. Which means paying for the tokens.

A persistent Python REPL — one whereglobals() survives between tool calls — is what makes long-running agent sessions affordable. This post is the math on why.

The scenario

A data-analysis agent answering questions about a 500 MB parquet file. Realistic user journey: ten questions in sequence. Each question requires running some Python against the loaded DataFrame.

Without persistent REPL

Every tool call gets a fresh container. For each turn the model has to re-emit the setup:

# Tool call #1 (the model writes this)
import pandas as pd
df = pd.read_parquet('/data/sales.parquet')
print(df.groupby('region').revenue.sum())

# Tool call #2 — fresh container, globals wiped
import pandas as pd
df = pd.read_parquet('/data/sales.parquet')
print(df[df.region == 'EU'].describe())

# Tool call #3 — fresh container again
import pandas as pd
df = pd.read_parquet('/data/sales.parquet')
print(df.pivot_table(...))

Let’s count tokens for 10 turns:

  • Boilerplate per turn: ~20 tokens (import pandas... + df = ...)
  • Unique logic per turn: ~40 tokens
  • Per-turn input: 60 tokens emitted by the model, 60 tokens sent back to the model on the next turn as tool history
  • 10 turns × 2 × 60 = 1,200 tokens just for the setup boilerplate round-trip

That’s ignoring the runtime cost too — re-parsing a 500 MB parquet on every turn is 200–500 ms of wall-clock, which the user feels.

With persistent REPL

# Tool call #1
import pandas as pd
df = pd.read_parquet('/data/sales.parquet')
print(df.groupby('region').revenue.sum())

# Tool call #2 — same sandbox, df + pandas still in globals
print(df[df.region == 'EU'].describe())

# Tool call #3
print(df.pivot_table(...))
  • Boilerplate total: 20 tokens, emitted once
  • Per-turn unique logic: 40 tokens
  • 10 turns: 20 + (10 × 40 × 2) = 820 tokens

That’s a 32% token saving on this simple example. On longer sessions with bigger setups (load a trained ML model, fit an index, open a database connection) the savings go nonlinear — the setup cost you’re paying 10x without persistence might be 500 tokens, not 20.

Real-world numbers from our customers

We monitor this across Podflare agent workloads. Cases we’ve seen:

  • Data-analysis agent, 20-turn sessions. Setup cost: load a 2 GB parquet, fit a scikit-learn model, pre-compute some features. ~450 tokens of setup. Without persistence: 450 × 20 × 2 = 18,000 setup-tokens per session. With persistence: 450 × 2 = 900. 20x savings on that slice.
  • Trading research agent, 50-turn sessions. Setup: import pandas/numpy/scipy, fetch historical data, build a lookup table. ~300 tokens. 50-turn session without persistence: 300 × 50 × 2 = 30,000 tokens. With persistence: 600. 50x.
  • Coding agent doing multi-file refactors. Setup: clone the repo, parse the AST. ~2000 tokens of setup code. Without persistence, every file edit re-emits the full clone + parse. Measurably unviable at scale.

Runtime latency, not just tokens

Tokens are the cheap part. The time saved is often the bigger win:

  • Parquet parse: 200–500 ms per call
  • Model load: 1–10 s per call
  • Warm network connection (DB, API client): 50–200 ms per call

On a 20-turn session that expensive setup happens once total instead of 20 times. The difference between a user waiting 500 ms vs 3 s for each turn.

How Podflare implements it

Every sandbox starts an ipykernel process at boot, sitting on a vsock control channel. Your sb.run_code("...") call pushes the code to the kernel, which execs it in the main namespace. Variables, imports, open files — all sit in globals() between calls. The kernel outlives the RPC that created it.

For cross-session persistence (user closes the chat, comes back tomorrow), use persistent=True at create time. Podflare freezes the full VM memory to a Space when idle, resumes it next time. The Python process is the same one — id() of any object in memory is the same before and after the freeze/resume.

sb = Sandbox(persistent=True)
sb.run_code("model = train_big_model()   # 10 min")
sb.idle()   # freeze to Space

# Tomorrow:
sb = Sandbox.resume(space_id)
sb.run_code("print(model.predict(...))") # model still in memory, no retrain

When you DON’T want persistent state

One real case: if the user’s input to the agent is adversarial and the previous turn’s state might have been tampered with. Classic prompt-injection-defense pattern — for security-critical code paths, use a fresh sandbox per call. Podflare supports both; choose per-use-case.

Related reading

Try it

Free Podflare account, $200 starter credit. Docs on persistent REPL + Spaces at docs.podflare.ai/concepts/repl.

#persistent python repl#ai agent memory#llm token cost#agent state persistence#code interpreter state#ai agent architecture

Keep reading

Ship an AI agent on Podflare in under a minute.

Hardware-isolated microVM per sandbox, ~190 ms round-trip, 80 ms fork(), full Python REPL persistence. Free tier includes $200 credit.

Get started free
Why persistent Python REPL across agent turns cuts your LLM bill 10x — Podflare