If you're building an AI agent that runs LLM-generated code, three cloud sandbox platforms dominate the space today: E2B, Daytona, and Podflare. They all do the same basic thing — hand your agent a fresh, disposable Linux VM it can pip install, curl, and execute arbitrary code inside without risking anything on your host. The question we kept getting from customers evaluating us head-to-head was simple: how do they actually compare on latency?

So we wrote an identical-harness benchmark and ran all three from the same MacBook on residential wifi, within the same ten-minute window. Below are the numbers, the methodology, the architecture behind why the gap is what it is, and — most importantly — the commands you need to reproduce it from wherever your agent actually runs.

The setup

Thirty sequential Sandbox.create() → exec("echo ready") → close() cycles per platform. No warmup-and-discard, no cherry-picking, no selective sampling. Bench scripts are public at github.com/PodFlare-ai/demo:

pip install podflare e2b-code-interpreter daytona

PODFLARE_API_KEY=pf_live_... python benchmarks/bench-reliability.py podflare
E2B_API_KEY=...               python benchmarks/bench-reliability.py e2b
DAYTONA_API_KEY=...           python benchmarks/bench-reliability.py daytona

Each platform uses its own SDK's default region (E2B → us-east4, Daytona → nearest single region, Podflare → Cloudflare-edge-routed to us-west). Each iteration is a full create + exec + close round-trip including TLS and the three HTTPS calls that implies.

The results

30 iterations each, milliseconds:

               min   p50   p95   p99   max   mean
  Podflare     143   153   170   236   263   173
  E2B          418   467   750   852   888   509
  Daytona      439   713  1130  1136  1137   722

Podflare wins every percentile:

p50: 3.0× faster than E2B, 4.7× faster than Daytona
p95: 4.4× vs E2B, 6.6× vs Daytona
p99: 3.6× vs E2B, 4.8× vs Daytona
max: bounded under 270 ms — the other two have outliers above 850 ms

Zero errors across 90 total iterations. Reliability in the traditional uptime sense is identical across all three platforms; the differentiator is latency distribution, which is what interactive agent loops actually feel.

Why the gap is what it is

Every exec() call looks identical from the SDK side, but the transport underneath is fundamentally different on each platform:

Platform	First-exec transport stack
Podflare	hostd → vsock binary protocol → in-VM agent. No TCP, no TLS inside the guest.
E2B	client-proxy → orchestrator → ConnectRPC over TCP + TLS to `envd` in-VM.
Daytona	proxy → runner → Docker / Sysbox + HTTP server inside the container.

vsock is a host-to-guest socket family that skips the TCP + TLS + HTTP framing E2B and Daytona pay inside the guest for every exec. Server-side round-trip on Podflare is around 3 ms; the remaining ~150 ms is the network between the caller and the region — which you can't optimize away on any of the platforms because physics.

Going through the edge is faster than going direct

One counter-intuitive finding worth calling out: api.podflare.ai is a Cloudflare Worker that haversine-routes incoming requests to the nearest region. Instinct says fewer hops = lower latency, and that going straight to a region URL like usw1.podflare.ai should be faster. From residential wifi it isn't:

  via api.podflare.ai  →  p99 = 236 ms, max = 263 ms
  direct to usw1       →  p99 = 483 ms, max = 594 ms

Because Cloudflare's edge PoP is closer to the caller than any single origin, and CF's backbone to the origin is cleaner than the public-internet route an ISP gives you. The "extra hop" is shorter wall-clock. Reproducible.

What this benchmark didn't measure

fork(n). Only Podflare exposes this primitive — snapshot a running VM mid-flight, spawn N children from the exact parent state. ~80 ms server-side for n=5. The primitive tree-of-thought and multi-attempt code synthesis patterns actually want.
Persistent state across destroy. Podflare (full VM memory freeze into a "Space"), E2B (snapshot API), Daytona (container archive). Semantics differ enough that a single "ms to resume" number isn't apples-to-apples.
HTTP outbound from inside the sandbox. Geography dominates — E2B hits api.github.com/zen in 25 ms because their colo is near GitHub's Azure us-east peering; Podflare us-west is 89 ms, us-east is 29 ms. That's about which datacenter your sandbox lives in, not platform speed.
Cost per run. Ages poorly. All three land within an order of magnitude per execution-minute; pricing pages move.

Reproduce these numbers

The whole point of publishing a bench is that you don't have to trust any of this:

git clone https://github.com/PodFlare-ai/demo
cd demo
python benchmarks/bench-reliability.py podflare
python benchmarks/bench-reliability.py e2b
python benchmarks/bench-reliability.py daytona

If your numbers differ meaningfully from mine — especially if E2B or Daytona wins on some percentile from your vantage point — tell me. Include the SDK versions, your geography, and your network. I'd much rather rerun the bench from your machine than argue about whose laptop is faster. Open a PR on the demo repo, or email hello@podflare.ai.

Takeaways

Raw latency: Podflare wins every percentile measured here. If an interactive agent's tool-call loop is your bottleneck, this is the gap that matters.
Apache-2.0 + self-host on GCP/AWS: E2B.
AGPL + self-host on Docker: Daytona.
Fork-based tree-of-thought, persistent VM memory across restarts, multi-region edge routing with failover: Podflare.

Most importantly: run this bench yourself from wherever your agent actually lives. The p99 you care about is the one your own code measures, not the one a vendor puts in a comparison table — especially if that vendor is me.

For the full architecture comparison with per-operation breakdowns, see docs.podflare.ai/architecture/comparison. For per-operation server-side latency numbers (pool hit, fork diff snapshot, hot exec), see the Performance page.

Cloud sandbox benchmark for AI agents: E2B vs Daytona vs Podflare (April 2026)