GitHub - computesdk/benchmarks: Compare startup time-to-interactive for top sandbox providers.

TTI (Time to Interactive) = API call to first command execution. Lower is better.

What We Measure

Daily: Time to Interactive (TTI)

API Request → Provisioning → Boot → Ready → First Command
└───────────────────── TTI ─────────────────────┘

Each benchmark creates a fresh sandbox, runs echo "benchmark", and records wall-clock time. 100 iterations per provider, every day, fully automated.

Powered by ComputeSDK — We use ComputeSDK, a multi-provider SDK, to test all sandbox providers with the same code. One API, multiple providers, fair comparison. Interested in multi-provider failover, sandbox packing, and warm pooling? Check out ComputeSDK.

Sponsor-only tests coming soon: Stress tests, warm starts, multi-region, and more. See roadmap →

Methodology

Each benchmark creates a fresh sandbox, runs echo "benchmark", and records wall-clock time. We run three test modes daily:

Sequential — Sandboxes are created one at a time. Each is created, tested, and destroyed before the next begins. 100 iterations per provider. This is the baseline — isolated cold-start performance with no contention.

Staggered — 100 sandboxes are launched per provider with a 200ms delay between each, gradually ramping up concurrent load. Reveals how TTI degrades under increasing pressure, queue depth effects, and rate limiting behavior.

Burst — 100 sandboxes are created simultaneously with no delay between launches. Tests how providers handle sudden spikes — provisioning queue depth, rate limiting, and failure rates under peak demand.

For each provider we report min, max, median, P95, P99, and average TTI, plus a composite score (0–100) that combines weighted timing metrics with success rate. Providers must be both fast and reliable to score well.

Composite Score

Each timing metric is scored against a fixed 10-second ceiling: score = 100 × (1 − value / 10,000ms). A 200ms median scores 98; anything ≥10s scores 0. These individual scores are combined with weighted emphasis on median (50%), P95 (20%), max (15%), P99 (10%), and min (5%), then multiplied by the provider's success rate (0–1). A provider with 90% success has its score reduced by 10% — reliability is non-negotiable.

All tests run on GitHub Actions at 00:00 UTC daily. Providers are tested using ComputeSDK — no gateway or proxy layer.

Full methodology →

Transparency

📖 Open source — All benchmark code is public
📊 Raw data — Every result committed to repo
🔁 Reproducible — Anyone can run the same tests
⚙️ Automated — Daily at 5pm Pacific (00:00 UTC) via GitHub Actions on Namespace runners
🛡️ Independent — Sponsors cannot influence results

Roadmap

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github		.github
results		results
sponsors		sponsors
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
SPONSORSHIP.md		SPONSORSHIP.md
burst_tti.svg		burst_tti.svg
env.example		env.example
package-lock.json		package-lock.json
package.json		package.json
results.svg		results.svg
sequential_tti.svg		sequential_tti.svg
staggered_tti.svg		staggered_tti.svg
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What We Measure

Methodology

Composite Score

Transparency

Sponsors

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Languages

Folders and files

Latest commit

History

Repository files navigation

What We Measure

Methodology

Composite Score

Transparency

Sponsors

Roadmap

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Languages

Packages