OpenAI’s code interpreter (via the Assistants API, or the code_interpreter tool in newer Messages endpoints) is the fastest way to add "the model can run Python" to an LLM application. You declare the tool, the model writes code, the platform runs it, stdout comes back. Zero infra.
It works. It also has three problems that compound as your application grows:
- The runtime is a black box. You can’t
pip installa specific version of a library, pre-load a domain dataset into the container, or ship a custom base image. The environment is whatever OpenAI decides it is, and it changes without notice. - Access is gated and priced opaquely. The code-interpreter capability is bundled into specific Assistants and model-tier features. You pay per model-token plus a tools surcharge; cost is hard to forecast under load.
- All your data flows through OpenAI’s infrastructure. If your agent uploads a customer CSV, a trained model, or a proprietary dataset for the interpreter to analyze — that data crosses OpenAI’s perimeter. Many enterprises can’t ship that.
Self-hosting the code-execution layer with a dedicated cloud sandbox like Podflare solves all three. Here’s the recipe.
The architecture
Swap OpenAI’s code_interpreter tool for a custom run_python tool that calls your sandbox:
┌──────────┐ tool_use ┌──────────────┐ run_code ┌─────────────┐ │ Model │ ────────────▶ │ Your server │ ────────────▶ │ Podflare │ │ (any) │ │ │ │ sandbox │ │ │ ◀──────────── │ │ ◀──────────── │ (Pod microVM) └──────────┘ tool_result └──────────────┘ stdout └─────────────┘
The model side is model-agnostic. Podflare is the execution side, which your code owns and controls.
The tool declaration
Every major LLM uses the same shape for tool/function declarations. Here it is for OpenAI (works identically for Anthropic, Gemini, LangChain, etc. with minor naming tweaks):
tools = [{
"type": "function",
"function": {
"name": "run_python",
"description": (
"Execute Python code in a persistent REPL. Variables, "
"imports, and file state carry across calls. Returns "
"stdout and stderr from the execution."
),
"parameters": {
"type": "object",
"properties": {
"code": {
"type": "string",
"description": "Python source code to execute.",
}
},
"required": ["code"],
},
},
}]The execution loop
Open a sandbox at the start of the conversation; keep it alive for the duration; close it at the end. Each model turn that includes a tool_use block becomes a sandbox.run_code call.
from openai import OpenAI
from podflare import Sandbox
client = OpenAI()
def chat_with_code(user_prompt: str) -> str:
messages = [{"role": "user", "content": user_prompt}]
with Sandbox() as sb:
while True:
resp = client.chat.completions.create(
model="gpt-5",
messages=messages,
tools=tools,
)
msg = resp.choices[0].message
messages.append(msg)
if msg.tool_calls:
for call in msg.tool_calls:
if call.function.name == "run_python":
args = json.loads(call.function.arguments)
out = sb.run_code(args["code"])
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": (out.stdout or "") + (out.stderr or "")
or "(no output)",
})
continue
return msg.content # final text responseThat’s the whole thing. Three files of your application code, plus Podflare handling the execution safely in a hardware-isolated microVM.
Why this is better than the built-in
- Pre-load anything. Want the agent to start with pandas, sklearn, and your custom datasets already loaded? Call
sb.run_code("import pandas as pd; df = pd.read_parquet(...)")before the model’s first turn. The REPL keeps it for every subsequenttool_use. - Swap the model without rewriting. Same sandbox, different model. Use Claude for reasoning, Gemini for multimodal, gpt-4o-mini for cheap turns — the execution layer doesn’t care which model called it.
- Data stays in your perimeter. The sandbox runs in a Podflare region you picked; your customer data never touches OpenAI. For HIPAA / GDPR / data-residency workloads, this is often the blocker.
fork(n)is available. OpenAI doesn’t let you branch a code-interpreter session. Podflare does, in ~80 ms server-side. Tree-of-thought patterns want this.- Persistent state across conversations. Create the sandbox with
persistent=Trueand it freezes to a Podflare Space when idle. Resume tomorrow with the same Python process still holding the same DataFrame. OpenAI’s interpreter doesn’t do this.
Performance
The head-to-head benchmark is in the cloud sandbox benchmark post. From a laptop, round-trip is p50 = 153 ms, p99 = 236 ms. From an agent running in a nearby cloud region, p50 drops to ~43 ms. Hot exec on an already-live sandbox is ~46 ms. Faster than every other self-hostable option we tested, and in our measurements comparable to or faster than OpenAI’s built-in.
Security
Every sandbox is a Podflare Pod microVM — hardware isolation via KVM, dedicated guest kernel, no shared filesystem with other tenants or with your host. The model can write the worst Python you’ve ever seen and the blast radius is a single disposable VM that gets destroyed on close. See Why Docker isn’t enough for the full threat model.
Ship it
pip install openai podflare export OPENAI_API_KEY=sk-... export PODFLARE_API_KEY=pf_live_...
Create a free Podflare account ($200 starter credit, 10 concurrent sandboxes, all 5 regions). The full working example is on GitHub at PodFlare-ai/demo.