What are the limitations of OpenAI's built-in code interpreter?

Three big ones: (1) the runtime is opaque — you can't install custom packages, pin versions, or inspect what libraries are available; (2) code-interpreter capability is gated behind specific model tiers and Assistants features that change pricing; (3) data processed inside the interpreter passes through OpenAI's infrastructure, which is a compliance blocker for many enterprise workloads.

Can I use a non-OpenAI model with a code interpreter?

Yes — that's the whole point of building your own. Claude, Gemini, Llama, Mistral, Qwen, DeepSeek all support tool-use or function-calling protocols. You declare a run_code tool, the model emits code, your code routes it to a sandbox like Podflare for execution, and the stdout comes back as a tool_result. Works identically across every major model.

How much slower is a self-hosted sandbox than OpenAI's code interpreter?

Per-exec latency is comparable — OpenAI doesn't publish exact numbers but anecdotally lands in the hundreds of milliseconds for cold start + first exec. Podflare's full round-trip is ~190 ms p50 from a laptop, ~43 ms from a nearby cloud region. For hot execs on an already-live sandbox, ~46 ms. Faster than OpenAI in most setups we've measured.

Building your own OpenAI code interpreter: a self-hostable alternative with full control

OpenAI’s code interpreter (via the Assistants API, or the code_interpreter tool in newer Messages endpoints) is the fastest way to add "the model can run Python" to an LLM application. You declare the tool, the model writes code, the platform runs it, stdout comes back. Zero infra.

It works. It also has three problems that compound as your application grows:

The runtime is a black box. You can’t pip install a specific version of a library, pre-load a domain dataset into the container, or ship a custom base image. The environment is whatever OpenAI decides it is, and it changes without notice.
Access is gated and priced opaquely. The code-interpreter capability is bundled into specific Assistants and model-tier features. You pay per model-token plus a tools surcharge; cost is hard to forecast under load.
All your data flows through OpenAI’s infrastructure. If your agent uploads a customer CSV, a trained model, or a proprietary dataset for the interpreter to analyze — that data crosses OpenAI’s perimeter. Many enterprises can’t ship that.

Self-hosting the code-execution layer with a dedicated cloud sandbox like Podflare solves all three. Here’s the recipe.

The architecture

Swap OpenAI’s code_interpreter tool for a custom run_python tool that calls your sandbox:

┌──────────┐   tool_use    ┌──────────────┐   run_code    ┌─────────────┐
│  Model   │ ────────────▶ │  Your server │ ────────────▶ │  Podflare   │
│  (any)   │               │              │               │  sandbox    │
│          │ ◀──────────── │              │ ◀──────────── │  (Pod microVM)
└──────────┘  tool_result  └──────────────┘    stdout     └─────────────┘

The model side is model-agnostic. Podflare is the execution side, which your code owns and controls.

The tool declaration

Every major LLM uses the same shape for tool/function declarations. Here it is for OpenAI (works identically for Anthropic, Gemini, LangChain, etc. with minor naming tweaks):

tools = [{
    "type": "function",
    "function": {
        "name": "run_python",
        "description": (
            "Execute Python code in a persistent REPL. Variables, "
            "imports, and file state carry across calls. Returns "
            "stdout and stderr from the execution."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "code": {
                    "type": "string",
                    "description": "Python source code to execute.",
                }
            },
            "required": ["code"],
        },
    },
}]

The execution loop

Open a sandbox at the start of the conversation; keep it alive for the duration; close it at the end. Each model turn that includes a tool_use block becomes a sandbox.run_code call.

from openai import OpenAI
from podflare import Sandbox

client = OpenAI()

def chat_with_code(user_prompt: str) -> str:
    messages = [{"role": "user", "content": user_prompt}]
    with Sandbox() as sb:
        while True:
            resp = client.chat.completions.create(
                model="gpt-5",
                messages=messages,
                tools=tools,
            )
            msg = resp.choices[0].message
            messages.append(msg)
            if msg.tool_calls:
                for call in msg.tool_calls:
                    if call.function.name == "run_python":
                        args = json.loads(call.function.arguments)
                        out = sb.run_code(args["code"])
                        messages.append({
                            "role": "tool",
                            "tool_call_id": call.id,
                            "content": (out.stdout or "") + (out.stderr or "")
                                      or "(no output)",
                        })
                continue
            return msg.content  # final text response

That’s the whole thing. Three files of your application code, plus Podflare handling the execution safely in a hardware-isolated microVM.

Why this is better than the built-in

Pre-load anything. Want the agent to start with pandas, sklearn, and your custom datasets already loaded? Call sb.run_code("import pandas as pd; df = pd.read_parquet(...)") before the model’s first turn. The REPL keeps it for every subsequent tool_use.
Swap the model without rewriting. Same sandbox, different model. Use Claude for reasoning, Gemini for multimodal, gpt-4o-mini for cheap turns — the execution layer doesn’t care which model called it.
Data stays in your perimeter. The sandbox runs in a Podflare region you picked; your customer data never touches OpenAI. For HIPAA / GDPR / data-residency workloads, this is often the blocker.
fork(n) is available. OpenAI doesn’t let you branch a code-interpreter session. Podflare does, in ~80 ms server-side. Tree-of-thought patterns want this.
Persistent state across conversations. Create the sandbox with persistent=True and it freezes to a Podflare Space when idle. Resume tomorrow with the same Python process still holding the same DataFrame. OpenAI’s interpreter doesn’t do this.

Performance

The head-to-head benchmark is in the cloud sandbox benchmark post. From a laptop, round-trip is p50 = 153 ms, p99 = 236 ms. From an agent running in a nearby cloud region, p50 drops to ~43 ms. Hot exec on an already-live sandbox is ~46 ms. Faster than every other self-hostable option we tested, and in our measurements comparable to or faster than OpenAI’s built-in.

Security

Every sandbox is a Podflare Pod microVM — hardware isolation via KVM, dedicated guest kernel, no shared filesystem with other tenants or with your host. The model can write the worst Python you’ve ever seen and the blast radius is a single disposable VM that gets destroyed on close. See Why Docker isn’t enough for the full threat model.

Ship it

pip install openai podflare
export OPENAI_API_KEY=sk-...
export PODFLARE_API_KEY=pf_live_...

Create a free Podflare account ($200 starter credit, 10 concurrent sandboxes, all 5 regions). The full working example is on GitHub at PodFlare-ai/demo.