Why not run AI-agent code inside a Docker container?

Docker containers share a kernel with the host. When the code inside the container was written by an LLM in response to prompt-injected user input, that shared kernel is a live threat surface — every Linux container-escape CVE (Dirty Pipe, runc leak, etc.) lives on the host side of the boundary. A microVM-based sandbox (such as a Podflare Pod) puts the isolation boundary at the hypervisor instead, which is orders of magnitude smaller attack surface.

What's the difference between a cloud sandbox and a serverless function like AWS Lambda?

Serverless functions discard in-memory state between invocations, cap at 15 minutes, and tend to restrict filesystem and root access. A cloud sandbox keeps the Python REPL alive across calls (so variables persist), has no hard runtime cap, and gives you full Linux root inside the sandbox — pip install, apt install, git clone all work.

How fast should a cloud sandbox create() return?

On platforms with a warm pool (pre-booted VMs), server-side create() is in the 6–12 ms range. End-to-end from a laptop — including TLS and the round-trip — you should expect sub-200 ms p50 and p99 under 300 ms from a well-optimized platform. Anything slower than that gets noticeable in an interactive agent loop.

What is fork(n) and why do AI agents need it?

fork(n) takes a snapshot of a running sandbox's full state (memory, open files, Python REPL globals) and spawns N children, each starting from exactly that state. It's the primitive tree-of-thought, multi-attempt code synthesis, and parallel hypothesis testing patterns actually want. Only Podflare exposes this today at ~80 ms server-side.

Can I restrict a sandbox's outbound network access?

Yes. On Podflare, pass Sandbox(egress=False) at create time — the guest's tap device is detached from the host bridge, so every outbound packet dies at the host. The guest still sees eth0 and can resolve DNS to no effect. This is the right default when running known-adversarial code; for normal agent workloads you probably want egress on so pip install works.

What is a cloud sandbox for AI agents? (And why you need one in 2026)

Every serious AI agent framework shipping in 2026 — OpenAI Agents, Anthropic tool_use, Vercel AI SDK, LangChain, Google Gemini, MCP — exposes the same primitive: a code-execution tool the model can call. The model writes Python or Bash, the tool runs it, the stdout comes back, and the next turn of the loop uses the result.

That tool has to run the code somewhere. And because the code is written by a large language model — often on behalf of an end user you don't fully trust — it can't run on your own servers, can't run in the end user's browser, and probably shouldn't run in a container you built yourself.

It runs in a cloud sandbox. This post walks through what that means, why the category emerged, and what properties you should be shopping for.

The definition

A cloud sandbox (sometimes called a code-execution sandbox or microVM sandbox) is a short-lived, isolated Linux environment that your AI agent — or your application on behalf of an agent — can create on demand, run arbitrary code in, and destroy when done.

In practice that means an HTTP API like:

from podflare import Sandbox

with Sandbox() as sb:
    sb.run_code("pip install scikit-learn pandas", language="bash")
    out = sb.run_code("""
        import pandas as pd
        from sklearn.linear_model import LinearRegression
        df = pd.read_csv('https://example.com/data.csv')
        model = LinearRegression().fit(df[['x']], df['y'])
        print(model.coef_)
    """)
    print(out.stdout)

Four things are happening under the hood:

Create — a fresh, hardware-isolated Linux machine is handed to your agent. Good sandboxes do this in well under 200 ms by keeping a pool of pre-booted VMs.
Execute — arbitrary code runs inside with real internet access, a real filesystem, and a real Python REPL whose state persists across calls.
Capture — stdout, stderr, errors, and artifacts come back to your code over a streaming response.
Destroy — the VM is terminated and its memory + disk are reclaimed. Nothing persists unless you explicitly ask it to.

Why not just use a container?

This is the question every team asks first. The short answer: containers share a kernel with the host, and kernels are complicated.

Docker, Podman, Kubernetes — all container runtimes rely on Linux namespaces and cgroups for isolation. These are kernel features. A bug in the kernel that lets a guest process escape its namespace is a container escape. The last five years of Linux kernel CVEs have included at least a dozen that could be used this way.

When your guest code is written by you, that's an acceptable risk — you're running your own trusted binaries. When your guest code is written by an LLM that just got prompt-injected by a hostile user-supplied input, the risk calculus changes. Every Podflare customer we talk to, sooner or later, runs into the scenario of "our model was asked to summarize a webpage, the webpage contained an instruction to curl -X POST our internal API, and the model wrote the exploit itself."

The cloud sandbox answer to that is hardware isolation: put each piece of LLM-written code inside its own dedicated microVM (a Podflare Pod), so the security boundary is KVM (the Linux hypervisor) instead of namespaces. A container escape CVE doesn't apply — the hypervisor is on the other side of the boundary.

Why not serverless functions (Lambda, Workers, Cloud Run)?

Serverless runtimes are close but miss three things:

Persistence within a session. When your agent imports pandas, parses a 500 MB CSV, and then asks three follow-up questions, you don't want to re-parse the CSV three times. A sandbox keeps the Python REPL alive across calls; state in globals() survives. A serverless function discards it.
Long-running execution. Lambda caps at 15 minutes. Workers cap at seconds. An agent training a small model, scraping a site, or running a test suite quickly exceeds both.
Full root-level system access to build things. apt install, git clone, build a Docker image inside the sandbox, write to arbitrary files. Serverless locks most of this down; a sandbox doesn't.

V8 isolates (Cloudflare Workers) spawn in ~1 ms — dramatically faster than any microVM. But they're JavaScript or Wasm only, no native dependencies, no filesystem, no Python REPL state. Real agents that run pip install scikit-learn && model.fit(X) don't run in isolates.

The properties to shop for

If you're evaluating cloud sandbox platforms, these are the dimensions that matter most in production:

1. Cold-start latency, specifically p95/p99

p50 is easy to optimize. It's the tail that kills agent UX. A 95th-percentile of 3 seconds means every tool call has a one-in-twenty chance of stalling the agent for three seconds — the user feels this. We benchmarked the three major platforms head-to-head in this post; look specifically at p99 columns.

2. Isolation boundary

Hardware isolation (Podflare Pod microVM) > user-namespace sandboxing (gVisor, Sysbox) > OS-level containers (Docker default). The stronger the isolation, the less you have to worry about a prompt-injected agent escaping into your host environment.

3. Persistence semantics

Your options span a spectrum: the sandbox dies completely on close (default), its filesystem is archived for later (Daytona, E2B), or its full VM memory + running processes are frozen to disk and resume-able later (Podflare Spaces). The last is what you want for genuinely long-running agent sessions — restart the sandbox tomorrow and your Python interpreter is still holding the same df.

4. A fork primitive, if you're doing tree-search

Tree-of-thought, multi-attempt code synthesis, parallel hypothesis testing — these patterns all want "take the sandbox's current state and spawn N copies of it, each trying a different branch." Only one of the three major platforms exposes this today (Podflare's fork(n) in ~80 ms); on the others you have to either redo the expensive setup N times or commit a snapshot in seconds and wait.

5. Egress controls and outbound network policy

Most agent workloads want full internet so pip install works. But for a security-sensitive workload, you want the option to lock egress down to a specific allowlist of domains ("only pypi.org and api.openai.com"), or to disable egress entirely for truly adversarial code. Check whether the platform exposes an opt-out per sandbox.

6. Multi-region routing

If your users are global, you want sandboxes created near them. A single-region sandbox platform gives 50–300 ms of extra round-trip latency to users on the wrong continent. Multi-region with haversine routing + automatic failover is worth paying for.

Where Podflare fits

Podflare is a cloud sandbox purpose-built for AI agents. Every sandbox is a hardware-isolated Podflare Pod microVM. Create + exec + close round-trips at ~190 ms p50 from a US laptop. fork(n) in ~80 ms for tree-of-thought patterns. Full VM memory freeze into a Space for cross-session persistence. Five production regions with Cloudflare-edge haversine routing. Drop-in integrations with every major agent framework.

The head-to-head benchmark post covers how we compare against E2B and Daytona on latency. For security thinking, see Why Docker isn't enough for LLM-generated code.

Ready to ship? Create a free account — you get a $200 starter credit, all 5 regions, and the full SDK in under a minute.