fix(api): prevent thundering herd when scheduling sandboxes to orchestrator nodes by morningtzh · Pull Request #2201 · e2b-dev/infra

morningtzh · 2026-03-23T09:26:56Z

📖 Background & Problem Statement

Currently, the api module uses a "Best of K" algorithm internally to schedule and place user Sandbox creation requests onto the underlying Orchestrator nodes.

During high-concurrency stress tests (a massive influx of requests in a very short window), we observed severe load skew across the Orchestrator nodes.

Root Cause (Stale Metrics & Thundering Herd):
The scheduling Score within the api module heavily relies on periodic node.Metrics() reports from the Orchestrator nodes. Under high concurrency, node state reporting naturally lags due to network and heartbeat intervals. Consequently, the api module makes scheduling decisions based on a stale, "idle" static snapshot. Orchestrator nodes with initially low scores are repeatedly selected, while local "in-progress" allocations are completely ignored. This creates a "Thundering Herd" effect where traffic piles up inappropriately.

📉 Real-world Cluster Validation

To validate the accuracy of our newly introduced LaggyNode local benchmark, we conducted a real-world stress test with 500 concurrent requests on the actual cluster.

As the actual runtime chart clearly illustrates:
When high concurrency hits, the placement distribution exhibits extreme, sudden spikes. Due to the real physical delay in metric reporting, a few specific nodes (the lines shooting up to 100+) become "victims" of the scheduler and are instantly flooded with Sandboxes. Meanwhile, the vast majority of the nodes in the cluster remain completely idle (the flat lines near 0 at the bottom).

This real-world behavioral pattern perfectly aligns with the extreme imbalance observed in our local benchmark. This proves that our refactored LaggyNode benchmark accurately reproduces the real-world scheduling pain points.

🛠️ Proposed Changes

This PR focuses on the placement and nodemanager packages within the api module. It completely resolves the load imbalance issue by introducing state prediction and compensating for the metric reporting delay.

Scoring Phase: Introduce Local Shadow State
- Modified packages/api/internal/orchestrator/placement/placement_best_of_K.go.
- The Score function now actively incorporates pending CPUs from node.PlacementMetrics.InProgress() into the metrics.CpuAllocated calculation. This shifts the api module's scheduling logic from passive reporting to active expected load prediction.
Execution Phase: Optimistic Updates
- Addressed the "invisible window" between a successful RPC call and the next heartbeat sync.
- Added the OptimisticAdd method to packages/api/internal/orchestrator/nodemanager/node.go. In PlaceSandbox, once a Sandbox is successfully created, the resources are immediately and optimistically deducted locally.
Benchmark Refactoring: Custom Node Simulation & Visualization
- Heavily refactored placement_benchmark_test.go to abstract the hardcoded node logic into a NodeSimulator interface, introducing LaggyNode to simulate "stale metrics" scenarios.
- Added the BenchmarkPlacementDistribution suite, which outputs an ASCII histogram to visually demonstrate the Sandbox distribution and Coefficient of Variation (CV).

📊 Benchmark Results

Tested under extreme conditions (10-node cluster, 500 burst traffic with simulated metric reporting delays):

Before Fix (Relying purely on stale metrics with BestOfK_K3):
Because the api relied on outdated snapshots, the cluster experienced severe load skew. The busiest node was assigned 120 Sandboxes, while the idlest nodes received only 2.

Imbalance(CV): 0.941 (Severe load imbalance)

After Fix (BestOfK_K3 + InProgress Shadow State + Optimistic Update):
With local shadow state and optimistic updates, the api module accurately predicts expected loads. Traffic is distributed smoothly and perfectly across all available nodes.

Imbalance(CV): 0.021 (Near-perfect distribution)

📋 Checklist

Updated the scoring calculation logic for the BestOfK algorithm (Shadow State).
Implemented OptimisticAdd and integrated it into the placement flow.
Refactored the benchmark suite to support LaggyNode simulation.
Executed go test -v -bench=BenchmarkPlacementDistribution locally with passing, expected results.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5bea0fd95

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

packages/api/internal/orchestrator/placement/placement_benchmark_test.go

…unting and shadow state High concurrency placement requests were causing severe "thundering herd" issues due to stale node metrics. The orchestrator would continuously schedule multiple sandboxes on the same seemingly "empty" node before it could report updated resource usage (the "invisibility gap"). This commit introduces active load prediction and optimistic resource reservation to ensure perfectly balanced placement even during metric reporting intervals. Changes: - fix(placement): factor `InProgress` pending resources into the `BestOfK` scoring calculation to predict expected load. - fix(nodemanager): implement `OptimisticAdd` to immediately reserve resources upon successful placement, bridging the gap before async metric updates arrive. - test(placement): refactor `SimulatedNode` into a `NodeSimulator` interface to support diverse node behavior simulations. - test(placement): introduce `LaggyNode` to simulate real-world scenarios with stale/delayed node metrics. - test(placement): add `BenchmarkPlacementDistribution` to visualize load distribution and verify the elimination of the thundering herd effect under high concurrency. Signed-off-by: MorningTZH <morningtzh@yeah.net>

Signed-off-by: MorningTZH <morningtzh@yeah.net>

jakubno

Hey @morningtzh , thanks for submitting this. Really like the direction, there're some issues we're introducing here, but it seems they are smaller than having the delay. but maybe there're still some improvements

Issues I currently see

I don't see anything like OptimisticRemove. So If only some sandboxes are short lived, it may result in very skew, won't it?
OptimisticAdd introduces a race condition between loading metrics and adding it optimistically to the metrics, so some sandboxes can be temporarily be double counted (sandbox is created, metrics sync, api process the response, add to the metrics)

Could you please address the first and check the second if different order of operation doesn't help here?

e2b-request-same-site-reviewers bot assigned jakubno Mar 23, 2026

morningtzh marked this pull request as ready for review March 23, 2026 09:33

morningtzh requested review from ValentaTomas, dobrac and jakubno as code owners March 23, 2026 09:33

chatgpt-codex-connector bot reviewed Mar 23, 2026

View reviewed changes

packages/api/internal/orchestrator/placement/placement_benchmark_test.go Show resolved Hide resolved

packages/api/internal/orchestrator/placement/placement_benchmark_test.go Show resolved Hide resolved

morningtzh force-pushed the optimize-placement branch 2 times, most recently from f8e7061 to 25ac879 Compare March 24, 2026 07:49

morningtzh added 2 commits March 24, 2026 19:04

unit tests

c153598

Signed-off-by: MorningTZH <morningtzh@yeah.net>

morningtzh force-pushed the optimize-placement branch from 0e7e906 to c153598 Compare March 24, 2026 11:04

jakubno requested changes Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): prevent thundering herd when scheduling sandboxes to orchestrator nodes#2201

fix(api): prevent thundering herd when scheduling sandboxes to orchestrator nodes#2201
morningtzh wants to merge 2 commits intoe2b-dev:mainfrom
morningtzh:optimize-placement

morningtzh commented Mar 23, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

jakubno left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

morningtzh commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📖 Background & Problem Statement

📉 Real-world Cluster Validation

🛠️ Proposed Changes

📊 Benchmark Results

📋 Checklist

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

jakubno left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

morningtzh commented Mar 23, 2026 •

edited

Loading