Every serious AI agent framework shipping in 2026 — OpenAI Agents, Anthropic tool_use, Vercel AI SDK, LangChain, Google Gemini, MCP — exposes the same primitive: a code-execution tool the model can call. The model writes Python or Bash, the tool runs it, the stdout comes back, and the next turn of the loop uses the result.
That tool has to run the code somewhere. And because the code is written by a large language model — often on behalf of an end user you don't fully trust — it can't run on your own servers, can't run in the end user's browser, and probably shouldn't run in a container you built yourself.
It runs in a cloud sandbox. This post walks through what that means, why the category emerged, and what properties you should be shopping for.
The definition
A cloud sandbox (sometimes called a code-execution sandbox or microVM sandbox) is a short-lived, isolated Linux environment that your AI agent — or your application on behalf of an agent — can create on demand, run arbitrary code in, and destroy when done.
In practice that means an HTTP API like:
from podflare import Sandbox
with Sandbox() as sb:
sb.run_code("pip install scikit-learn pandas", language="bash")
out = sb.run_code("""
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_csv('https://example.com/data.csv')
model = LinearRegression().fit(df[['x']], df['y'])
print(model.coef_)
""")
print(out.stdout)Four things are happening under the hood:
- Create — a fresh, hardware-isolated Linux machine is handed to your agent. Good sandboxes do this in well under 200 ms by keeping a pool of pre-booted VMs.
- Execute — arbitrary code runs inside with real internet access, a real filesystem, and a real Python REPL whose state persists across calls.
- Capture — stdout, stderr, errors, and artifacts come back to your code over a streaming response.
- Destroy — the VM is terminated and its memory + disk are reclaimed. Nothing persists unless you explicitly ask it to.
Why not just use a container?
This is the question every team asks first. The short answer: containers share a kernel with the host, and kernels are complicated.
Docker, Podman, Kubernetes — all container runtimes rely on Linux namespaces and cgroups for isolation. These are kernel features. A bug in the kernel that lets a guest process escape its namespace is a container escape. The last five years of Linux kernel CVEs have included at least a dozen that could be used this way.
When your guest code is written by you, that's an acceptable risk — you're running your own trusted binaries. When your guest code is written by an LLM that just got prompt-injected by a hostile user-supplied input, the risk calculus changes. Every Podflare customer we talk to, sooner or later, runs into the scenario of "our model was asked to summarize a webpage, the webpage contained an instruction to curl -X POST our internal API, and the model wrote the exploit itself."
The cloud sandbox answer to that is hardware isolation: put each piece of LLM-written code inside its own dedicated microVM (a Podflare Pod), so the security boundary is KVM (the Linux hypervisor) instead of namespaces. A container escape CVE doesn't apply — the hypervisor is on the other side of the boundary.
Why not serverless functions (Lambda, Workers, Cloud Run)?
Serverless runtimes are close but miss three things:
- Persistence within a session. When your agent imports pandas, parses a 500 MB CSV, and then asks three follow-up questions, you don't want to re-parse the CSV three times. A sandbox keeps the Python REPL alive across calls; state in
globals()survives. A serverless function discards it. - Long-running execution. Lambda caps at 15 minutes. Workers cap at seconds. An agent training a small model, scraping a site, or running a test suite quickly exceeds both.
- Full root-level system access to build things.
apt install,git clone, build a Docker image inside the sandbox, write to arbitrary files. Serverless locks most of this down; a sandbox doesn't.
V8 isolates (Cloudflare Workers) spawn in ~1 ms — dramatically faster than any microVM. But they're JavaScript or Wasm only, no native dependencies, no filesystem, no Python REPL state. Real agents that run pip install scikit-learn && model.fit(X) don't run in isolates.
The properties to shop for
If you're evaluating cloud sandbox platforms, these are the dimensions that matter most in production:
1. Cold-start latency, specifically p95/p99
p50 is easy to optimize. It's the tail that kills agent UX. A 95th-percentile of 3 seconds means every tool call has a one-in-twenty chance of stalling the agent for three seconds — the user feels this. We benchmarked the three major platforms head-to-head in this post; look specifically at p99 columns.
2. Isolation boundary
Hardware isolation (Podflare Pod microVM) > user-namespace sandboxing (gVisor, Sysbox) > OS-level containers (Docker default). The stronger the isolation, the less you have to worry about a prompt-injected agent escaping into your host environment.
3. Persistence semantics
Your options span a spectrum: the sandbox dies completely on close (default), its filesystem is archived for later (Daytona, E2B), or its full VM memory + running processes are frozen to disk and resume-able later (Podflare Spaces). The last is what you want for genuinely long-running agent sessions — restart the sandbox tomorrow and your Python interpreter is still holding the same df.
4. A fork primitive, if you're doing tree-search
Tree-of-thought, multi-attempt code synthesis, parallel hypothesis testing — these patterns all want "take the sandbox's current state and spawn N copies of it, each trying a different branch." Only one of the three major platforms exposes this today (Podflare's fork(n) in ~80 ms); on the others you have to either redo the expensive setup N times or commit a snapshot in seconds and wait.
5. Egress controls and outbound network policy
Most agent workloads want full internet so pip install works. But for a security-sensitive workload, you want the option to lock egress down to a specific allowlist of domains ("only pypi.org and api.openai.com"), or to disable egress entirely for truly adversarial code. Check whether the platform exposes an opt-out per sandbox.
6. Multi-region routing
If your users are global, you want sandboxes created near them. A single-region sandbox platform gives 50–300 ms of extra round-trip latency to users on the wrong continent. Multi-region with haversine routing + automatic failover is worth paying for.
Where Podflare fits
Podflare is a cloud sandbox purpose-built for AI agents. Every sandbox is a hardware-isolated Podflare Pod microVM. Create + exec + close round-trips at ~190 ms p50 from a US laptop. fork(n) in ~80 ms for tree-of-thought patterns. Full VM memory freeze into a Space for cross-session persistence. Five production regions with Cloudflare-edge haversine routing. Drop-in integrations with every major agent framework.
The head-to-head benchmark post covers how we compare against E2B and Daytona on latency. For security thinking, see Why Docker isn't enough for LLM-generated code.
Ready to ship? Create a free account — you get a $200 starter credit, all 5 regions, and the full SDK in under a minute.