Skip to content

fix(api): prevent thundering herd when scheduling sandboxes to orchestrator nodes#2201

Open
morningtzh wants to merge 2 commits intoe2b-dev:mainfrom
morningtzh:optimize-placement
Open

fix(api): prevent thundering herd when scheduling sandboxes to orchestrator nodes#2201
morningtzh wants to merge 2 commits intoe2b-dev:mainfrom
morningtzh:optimize-placement

Conversation

@morningtzh
Copy link

@morningtzh morningtzh commented Mar 23, 2026

📖 Background & Problem Statement

Currently, the api module uses a "Best of K" algorithm internally to schedule and place user Sandbox creation requests onto the underlying Orchestrator nodes.

During high-concurrency stress tests (a massive influx of requests in a very short window), we observed severe load skew across the Orchestrator nodes.

Root Cause (Stale Metrics & Thundering Herd):
The scheduling Score within the api module heavily relies on periodic node.Metrics() reports from the Orchestrator nodes. Under high concurrency, node state reporting naturally lags due to network and heartbeat intervals. Consequently, the api module makes scheduling decisions based on a stale, "idle" static snapshot. Orchestrator nodes with initially low scores are repeatedly selected, while local "in-progress" allocations are completely ignored. This creates a "Thundering Herd" effect where traffic piles up inappropriately.

📉 Real-world Cluster Validation

To validate the accuracy of our newly introduced LaggyNode local benchmark, we conducted a real-world stress test with 500 concurrent requests on the actual cluster.

image

As the actual runtime chart clearly illustrates:
When high concurrency hits, the placement distribution exhibits extreme, sudden spikes. Due to the real physical delay in metric reporting, a few specific nodes (the lines shooting up to 100+) become "victims" of the scheduler and are instantly flooded with Sandboxes. Meanwhile, the vast majority of the nodes in the cluster remain completely idle (the flat lines near 0 at the bottom).

This real-world behavioral pattern perfectly aligns with the extreme imbalance observed in our local benchmark. This proves that our refactored LaggyNode benchmark accurately reproduces the real-world scheduling pain points.

🛠️ Proposed Changes

This PR focuses on the placement and nodemanager packages within the api module. It completely resolves the load imbalance issue by introducing state prediction and compensating for the metric reporting delay.

  1. Scoring Phase: Introduce Local Shadow State
    • Modified packages/api/internal/orchestrator/placement/placement_best_of_K.go.
    • The Score function now actively incorporates pending CPUs from node.PlacementMetrics.InProgress() into the metrics.CpuAllocated calculation. This shifts the api module's scheduling logic from passive reporting to active expected load prediction.
  2. Execution Phase: Optimistic Updates
    • Addressed the "invisible window" between a successful RPC call and the next heartbeat sync.
    • Added the OptimisticAdd method to packages/api/internal/orchestrator/nodemanager/node.go. In PlaceSandbox, once a Sandbox is successfully created, the resources are immediately and optimistically deducted locally.
  3. Benchmark Refactoring: Custom Node Simulation & Visualization
    • Heavily refactored placement_benchmark_test.go to abstract the hardcoded node logic into a NodeSimulator interface, introducing LaggyNode to simulate "stale metrics" scenarios.
    • Added the BenchmarkPlacementDistribution suite, which outputs an ASCII histogram to visually demonstrate the Sandbox distribution and Coefficient of Variation (CV).

📊 Benchmark Results

Tested under extreme conditions (10-node cluster, 500 burst traffic with simulated metric reporting delays):


Before Fix (Relying purely on stale metrics with BestOfK_K3):
Because the api relied on outdated snapshots, the cluster experienced severe load skew. The busiest node was assigned 120 Sandboxes, while the idlest nodes received only 2.

  • Imbalance(CV): 0.941 (Severe load imbalance)
image

After Fix (BestOfK_K3 + InProgress Shadow State + Optimistic Update):
With local shadow state and optimistic updates, the api module accurately predicts expected loads. Traffic is distributed smoothly and perfectly across all available nodes.

  • Imbalance(CV): 0.021 (Near-perfect distribution)
image

📋 Checklist

  • Updated the scoring calculation logic for the BestOfK algorithm (Shadow State).
  • Implemented OptimisticAdd and integrated it into the placement flow.
  • Refactored the benchmark suite to support LaggyNode simulation.
  • Executed go test -v -bench=BenchmarkPlacementDistribution locally with passing, expected results.

@morningtzh morningtzh marked this pull request as ready for review March 23, 2026 09:33
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e5bea0fd95

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@morningtzh morningtzh force-pushed the optimize-placement branch 2 times, most recently from f8e7061 to 25ac879 Compare March 24, 2026 07:49
…unting and shadow state

High concurrency placement requests were causing severe "thundering herd" issues due to stale node metrics. The orchestrator would continuously schedule multiple sandboxes on the same seemingly "empty" node before it could report updated resource usage (the "invisibility gap").

This commit introduces active load prediction and optimistic resource reservation to ensure perfectly balanced placement even during metric reporting intervals.

Changes:
- fix(placement): factor `InProgress` pending resources into the `BestOfK` scoring calculation to predict expected load.
- fix(nodemanager): implement `OptimisticAdd` to immediately reserve resources upon successful placement, bridging the gap before async metric updates arrive.
- test(placement): refactor `SimulatedNode` into a `NodeSimulator` interface to support diverse node behavior simulations.
- test(placement): introduce `LaggyNode` to simulate real-world scenarios with stale/delayed node metrics.
- test(placement): add `BenchmarkPlacementDistribution` to visualize load distribution and verify the elimination of the thundering herd effect under high concurrency.

Signed-off-by: MorningTZH <morningtzh@yeah.net>
Signed-off-by: MorningTZH <morningtzh@yeah.net>
Copy link
Member

@jakubno jakubno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @morningtzh , thanks for submitting this. Really like the direction, there're some issues we're introducing here, but it seems they are smaller than having the delay. but maybe there're still some improvements

Issues I currently see

  1. I don't see anything like OptimisticRemove. So If only some sandboxes are short lived, it may result in very skew, won't it?
  2. OptimisticAdd introduces a race condition between loading metrics and adding it optimistically to the metrics, so some sandboxes can be temporarily be double counted (sandbox is created, metrics sync, api process the response, add to the metrics)

Could you please address the first and check the second if different order of operation doesn't help here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants