Anthropic's Messages API gives Claude a clean tool-use protocol: you declare tools, Claude emits tool_use blocks when it wants to call one, and you return tool_result blocks with the output. It's the right shape for everything except running code.
For real code execution you want three properties that eval() or a subprocess can't give you:
- Hardware isolation so Claude's generated code can't touch your process or host.
- Persistent state across turns so when Claude imports pandas on turn 3, it's still imported on turn 7.
- A fast round-trip so each tool call feels interactive.
This post walks through wiring Anthropic's tool_use up to a Podflare cloud sandbox — a hardware-isolated Podflare Pod microVM with a persistent Python REPL, round-trip under 200 ms. The full, runnable example is on GitHub at PodFlare-ai/demo.
The shape of the integration
Both Anthropic and Podflare expose simple, well-typed APIs that compose with each other cleanly. The flow is:
- You open a
Sandboxat the start of the conversation. - On every Claude turn you include a tool definition for
run_pythonthat callssandbox.run_code(). - You loop: model turn → maybe a
tool_useblock → execute → returntool_result→ repeat, until the model returns a final text message. - You
close()the sandbox at the end. Or keep it open as a persistent Space to resume later.
The full example
Install the two SDKs:
pip install anthropic podflare
Set your API keys:
export ANTHROPIC_API_KEY=sk-ant-... export PODFLARE_API_KEY=pf_live_...
And here's the full loop:
import os
from anthropic import Anthropic
from podflare import Sandbox
client = Anthropic()
# Define the code-execution tool Claude can call.
TOOLS = [
{
"name": "run_python",
"description": (
"Execute Python code in a persistent REPL. Variables, "
"imports, and state carry across calls. Returns stdout "
"and stderr from the execution."
),
"input_schema": {
"type": "object",
"properties": {
"code": {
"type": "string",
"description": "Python source to execute.",
}
},
"required": ["code"],
},
},
]
def run_conversation(user_prompt: str) -> str:
"""Run a tool-using conversation with Claude until it returns text."""
messages = [{"role": "user", "content": user_prompt}]
with Sandbox() as sb:
while True:
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
tools=TOOLS,
messages=messages,
)
# Append the assistant's full turn to history.
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn":
# Claude is done — pull out the final text.
return "".join(
b.text for b in resp.content if b.type == "text"
)
if resp.stop_reason == "tool_use":
# Execute every tool_use block in the turn.
tool_results = []
for block in resp.content:
if block.type != "tool_use":
continue
if block.name == "run_python":
code = block.input["code"]
result = sb.run_code(code)
out = (result.stdout or "") + (result.stderr or "")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": out or "(no output)",
})
messages.append({"role": "user", "content": tool_results})
continue
raise RuntimeError(f"unexpected stop_reason: {resp.stop_reason}")
if __name__ == "__main__":
answer = run_conversation(
"Fetch the last 10 days of Bitcoin price from "
"api.coingecko.com, compute the daily return, "
"and tell me the standard deviation."
)
print(answer)What's happening on each turn
Trace the flow for the Bitcoin example:
- Turn 1 (Claude):
tool_useblock withrun_python(code="import requests; ... fetch JSON"). Your loop runs that in the sandbox, gets back the raw JSON as stdout, returns it astool_result. - Turn 2 (Claude): another
tool_use, this time parsing the JSON and computing daily returns. It can assumerequestsis already imported because the sandbox REPL kept it. - Turn 3 (Claude):
tool_usefor the stdev computation using numpy. Claude writesimport numpy as npinline; the sandbox installs it with pip if needed (or it was already present). - Turn 4 (Claude):
end_turn— Claude writes a natural-language summary of the result based on the stdev it saw in turn 3.
The cost is 4 round-trips to the Anthropic API plus 3 round-trips to the sandbox. Each sandbox call is ~46 ms from an in-cloud agent, ~190 ms from a laptop. The whole interaction completes in a couple of seconds including Anthropic's model-side latency.
Persistent state is what makes this cheap
The big win over "spin up a container per tool call" is that the Python REPL stays alive between run_code calls. When Claude imports pandas on turn 1 and calls pd.read_csv(...) on turn 2, the import isn't re-evaluated; globals()["pd"] still points at the pandas module. Same for any heavydf already in memory.
This is the feature that makes Claude agents that do multi-turn data exploration actually affordable. Every container-per-call platform forces the agent to re-parse, re-import, re-load on every turn, and the cost compounds fast.
Branching with fork() for tree-of-thought
For patterns where you want Claude to try multiple solutions in parallel and keep the best one, you can fork() the sandbox mid-conversation. Each child inherits the parent's full state — all imports, all variables:
with Sandbox() as parent:
parent.run_code("import pandas as pd")
parent.run_code("df = pd.read_csv('/data/big.csv')") # expensive
# Spawn 5 children, each with df already loaded
children = parent.fork(n=5)
for child, strategy in zip(children, strategies):
child.run_code(strategy.code)
# ...pick the best, merge it back into parent, destroy losersFork takes about 80 ms server-side for n=5, all parallel. No other cloud sandbox platform exposes this primitive today.
What about streaming?
If you want to stream Claude's tool calls as they emerge — useful for a chat UI where the user sees the model "thinking" — use the client.messages.stream(...) variant. The tool-use loop structure stays the same; you just assemble the content blocks from the stream rather than reading them off resp.content. The Podflare call itself is already streaming: run_code returns stdout/stderr as NDJSON over the wire, and you can hook into it with sb.run_code(code, on_stdout=lambda chunk: ...).
Security defaults you probably want
If the user prompt is untrusted — say the agent is serving a consumer product — tighten the sandbox at create time:
with Sandbox(
egress=False, # no outbound network
max_lifetime_seconds=300,
idle_timeout_seconds=60,
) as sb:
# ...egress=False detaches the guest's tap device from the host bridge; the guest still sees eth0 but every outbound packet dies at the host. That's usually too restrictive for agent workloads (no pip install), but it's the right default when you're running known-adversarial code. Domain- allowlist egress is on the Enterprise roadmap.
Related posts and references
- Cloud sandbox benchmark for AI agents: E2B vs Daytona vs Podflare — latency distributions across all three platforms.
- Why Docker isn't enough for LLM-generated code — the security argument for microVMs when the code writer is a language model.
- Full working example on GitHub — the code in this post, ready to run.
- Anthropic tool-use API reference
Ready to try it? Create a free account, grab an API key, and the example above will be running in under a minute.