Most cloud-sandbox platforms for AI agents give you a fresh container per tool call. Simple to reason about, clean isolation between executions. Also: expensive in ways that don’t show up on the sandbox bill but absolutely show up on your LLM bill.
The hidden cost is that every time the model wants to do something with df, it has to re-create df. Which means re-running all the setup code. Which means re-emitting all that code from the model. Which means paying for the tokens.
A persistent Python REPL — one whereglobals() survives between tool calls — is what makes long-running agent sessions affordable. This post is the math on why.
The scenario
A data-analysis agent answering questions about a 500 MB parquet file. Realistic user journey: ten questions in sequence. Each question requires running some Python against the loaded DataFrame.
Without persistent REPL
Every tool call gets a fresh container. For each turn the model has to re-emit the setup:
# Tool call #1 (the model writes this)
import pandas as pd
df = pd.read_parquet('/data/sales.parquet')
print(df.groupby('region').revenue.sum())
# Tool call #2 — fresh container, globals wiped
import pandas as pd
df = pd.read_parquet('/data/sales.parquet')
print(df[df.region == 'EU'].describe())
# Tool call #3 — fresh container again
import pandas as pd
df = pd.read_parquet('/data/sales.parquet')
print(df.pivot_table(...))Let’s count tokens for 10 turns:
- Boilerplate per turn: ~20 tokens (
import pandas...+df = ...) - Unique logic per turn: ~40 tokens
- Per-turn input: 60 tokens emitted by the model, 60 tokens sent back to the model on the next turn as tool history
- 10 turns × 2 × 60 = 1,200 tokens just for the setup boilerplate round-trip
That’s ignoring the runtime cost too — re-parsing a 500 MB parquet on every turn is 200–500 ms of wall-clock, which the user feels.
With persistent REPL
# Tool call #1
import pandas as pd
df = pd.read_parquet('/data/sales.parquet')
print(df.groupby('region').revenue.sum())
# Tool call #2 — same sandbox, df + pandas still in globals
print(df[df.region == 'EU'].describe())
# Tool call #3
print(df.pivot_table(...))- Boilerplate total: 20 tokens, emitted once
- Per-turn unique logic: 40 tokens
- 10 turns: 20 + (10 × 40 × 2) = 820 tokens
That’s a 32% token saving on this simple example. On longer sessions with bigger setups (load a trained ML model, fit an index, open a database connection) the savings go nonlinear — the setup cost you’re paying 10x without persistence might be 500 tokens, not 20.
Real-world numbers from our customers
We monitor this across Podflare agent workloads. Cases we’ve seen:
- Data-analysis agent, 20-turn sessions. Setup cost: load a 2 GB parquet, fit a scikit-learn model, pre-compute some features. ~450 tokens of setup. Without persistence: 450 × 20 × 2 = 18,000 setup-tokens per session. With persistence: 450 × 2 = 900. 20x savings on that slice.
- Trading research agent, 50-turn sessions. Setup: import pandas/numpy/scipy, fetch historical data, build a lookup table. ~300 tokens. 50-turn session without persistence: 300 × 50 × 2 = 30,000 tokens. With persistence: 600. 50x.
- Coding agent doing multi-file refactors. Setup: clone the repo, parse the AST. ~2000 tokens of setup code. Without persistence, every file edit re-emits the full clone + parse. Measurably unviable at scale.
Runtime latency, not just tokens
Tokens are the cheap part. The time saved is often the bigger win:
- Parquet parse: 200–500 ms per call
- Model load: 1–10 s per call
- Warm network connection (DB, API client): 50–200 ms per call
On a 20-turn session that expensive setup happens once total instead of 20 times. The difference between a user waiting 500 ms vs 3 s for each turn.
How Podflare implements it
Every sandbox starts an ipykernel process at boot, sitting on a vsock control channel. Your sb.run_code("...") call pushes the code to the kernel, which execs it in the main namespace. Variables, imports, open files — all sit in globals() between calls. The kernel outlives the RPC that created it.
For cross-session persistence (user closes the chat, comes back tomorrow), use persistent=True at create time. Podflare freezes the full VM memory to a Space when idle, resumes it next time. The Python process is the same one — id() of any object in memory is the same before and after the freeze/resume.
sb = Sandbox(persistent=True)
sb.run_code("model = train_big_model() # 10 min")
sb.idle() # freeze to Space
# Tomorrow:
sb = Sandbox.resume(space_id)
sb.run_code("print(model.predict(...))") # model still in memory, no retrainWhen you DON’T want persistent state
One real case: if the user’s input to the agent is adversarial and the previous turn’s state might have been tampered with. Classic prompt-injection-defense pattern — for security-critical code paths, use a fresh sandbox per call. Podflare supports both; choose per-use-case.
Related reading
- What is a cloud sandbox for AI agents? — the broader pitch.
- The fork() primitive — the other state primitive. Persistent + forkable is the combination that really changes what agent architectures are possible.
- Benchmark — per-call latency numbers.
Try it
Free Podflare account, $200 starter credit. Docs on persistent REPL + Spaces at docs.podflare.ai/concepts/repl.