Everyone reports p50. Vendor pricing pages, benchmark blog posts, marketing comparison tables. p50 is the easy number — it’s the median, it’s stable, it’s easy to look good on.
The number that actually affects your users is p99. Or p95, depending on your tolerance. p50 is what the happy path feels like. p99 is what happens to one in every twenty users per session. It’s the number that, if bad, makes your product feel broken.
This post is about why AI-agent architects should be shopping for platforms on tail latency, not median, and how to interpret what you see.
The math nobody does at evaluation time
Suppose your agent makes 10 tool calls per user interaction. (Realistic for a non-trivial workflow. Some do 30.)
Your chosen sandbox has a 200 ms p50 and a 3000 ms p99. Looks fine at glance. Doing the math:
- Probability that a given call is ≥ p99 = 1/100
- Probability that all 10 calls come in under p99 = (99/100)10 ≈ 0.904
- Probability that at least one call hits the tail:~9.5% of sessions
So ~1 in 10 user interactions stalls for 3 seconds somewhere. Not "1% of the time this is slow" —10% of users, every session. The perceived reliability of your product is capped by this.
Now the same math for a platform with:
- 300 ms p50 (worse than the first)
- 500 ms p99 (much better than the first)
10 tool calls. Probability of any call going past 500 ms in 10 tries: ~9.6%. But the worst-case delay is 500 ms, not 3 seconds. Users barely notice.
Counterintuitive but true: the platform with worse p50 and better tail is the better user experience.
Why tails are what they are
Real distributions have tails. Network jitter, TCP SYN drops, garbage collection, upstream rate limits, thermal throttling, noisy-neighbor VMs — something causes the long call. The question isn’t whether you’ll have them; it’s what the worst case looks like when they happen.
Common tail sources for a cloud-sandbox platform:
- Cold region. Your pool is empty; the sandbox is booted on-demand. Usually 500 ms–2s.
- TCP retransmit. 0.05–0.35% of internet SYNs get dropped. Linux’s default reaction is 1s → 3s → 7s exponential backoff. That’s how you get 3-second tails on a connect call that usually takes 30 ms.
- SDK retry amplification. A well-meaning retry loop multiplies a single slow handshake into three chained timeouts. (We shipped this pathology ourselves and wrote about fixing it.)
- Concurrent-limit check at the edge. If the enforcement path involves fanning out to N regions to count "how many sandboxes does this org have," a slow region stalls every create.
Every one of these is fixable. Which brings us to...
What a platform can do about p99
Things we’ve done at Podflare to pull the tail in (rough order of impact):
- Aggressive warm pool (120 sandboxes pre-booted per region). Cold boots only happen when the pool drains faster than refill, which is rare. See the warm pool doc.
- Per-region timeouts with generous fallback. Our fan-out concurrent-limit check used to have 1500 ms per-region timeouts; one slow region stalled every create. Dropped to 400 ms; slight under-count preferable to stalling the create on the happy path.
- SDK connect timeout tuned for real networks. Generous connect timeout (2.5 s) + one retry on a fresh socket. Avoids the "tight timeout + retries=2" compounding pathology.
- Cloudflare edge in front of every region. From most residential callers, the CF edge PoP iscloser than any individual origin — the "extra hop" is shorter wall-clock. Counter- intuitive, measurable.
- vsock instead of in-VM HTTP. The hot-exec path doesn’t pay for TCP + TLS + HTTP framing inside the guest. That’s 3 ms server-side on hot exec vs 180 ms on platforms that use in-VM HTTP.
Net result: Podflare p99 from a residential laptop is 236 ms. p99 from an agent running in a nearby cloud region is ~188 ms. Max observed across 100 sequential iterations: 475 ms. No 3-second tails. The full comparison is in the benchmark post.
What you, the agent developer, can do
Independent of which platform you use:
1. Pre-warm a sandbox before the user needs it
At the start of a chat session — or better, at page load if you can predict the user will start a chat — open a sandbox. The first tool call then lands on an already-live sandbox, which has sub-50 ms hot-exec latency. Cold-start cost becomes free from the user’s perspective.
2. Use persistent REPL, not fresh containers per call
Already covered here. Every tool call on an already-open sandbox is ~46 ms hot-exec; every tool call on a fresh sandbox is ~190 ms cold-start. Pay the create cost once, amortize across every subsequent call.
3. Streaming stdout to the UI hides latency
For code that takes 1+ seconds to run, stream stdout byte-by-byte to the UI. The user sees output while the code runs, which makes perceived latency feel like zero even when wall-clock is real. Every major sandbox platform supports streaming tool results.
4. Run benchmarks from your actual agent’s location
Our p99 is 236 ms from a California laptop. From an agent running in a Vercel serverless function in us-east-1, it might be 45 ms or 400 ms depending on which region the CF edge routes to. The p99 you care about is the one measured from your compute location, not the one in a vendor’s blog post.
What you should demand in benchmarks you read
If a vendor’s benchmark only shows p50: skepticism. Ask for p95, p99, max. Ask for the number of samples — 30+ is the floor for stable tail estimates; 5-sample numbers are noise. Ask where the bench was run from — a fiber-connected desktop looks 10x better than residential wifi, and both look very different from in-cloud.
We publish min / p50 / p90 / p95 / p99 / max / mean for all three platforms we compete against, with the script public so you can reproduce. This is the bar.
The takeaway
When you’re picking a cloud sandbox for an AI agent, weight p99 and max at least as heavily as p50. A platform with a 300 ms p50 and a 240 ms p99 is a better production bet than a platform with a 150 ms p50 and a 3000 ms p99 — even though the first platform looks slower on paper.
Most agent UX failures aren’t from the median case being bad. They’re from occasional long stalls that add up across multi-step workflows. Buy tight tails.