fix(api): prevent thundering herd when scheduling sandboxes to orchestrator nodes#2201
fix(api): prevent thundering herd when scheduling sandboxes to orchestrator nodes#2201morningtzh wants to merge 2 commits intoe2b-dev:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e5bea0fd95
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
packages/api/internal/orchestrator/placement/placement_benchmark_test.go
Show resolved
Hide resolved
packages/api/internal/orchestrator/placement/placement_benchmark_test.go
Show resolved
Hide resolved
f8e7061 to
25ac879
Compare
…unting and shadow state High concurrency placement requests were causing severe "thundering herd" issues due to stale node metrics. The orchestrator would continuously schedule multiple sandboxes on the same seemingly "empty" node before it could report updated resource usage (the "invisibility gap"). This commit introduces active load prediction and optimistic resource reservation to ensure perfectly balanced placement even during metric reporting intervals. Changes: - fix(placement): factor `InProgress` pending resources into the `BestOfK` scoring calculation to predict expected load. - fix(nodemanager): implement `OptimisticAdd` to immediately reserve resources upon successful placement, bridging the gap before async metric updates arrive. - test(placement): refactor `SimulatedNode` into a `NodeSimulator` interface to support diverse node behavior simulations. - test(placement): introduce `LaggyNode` to simulate real-world scenarios with stale/delayed node metrics. - test(placement): add `BenchmarkPlacementDistribution` to visualize load distribution and verify the elimination of the thundering herd effect under high concurrency. Signed-off-by: MorningTZH <morningtzh@yeah.net>
Signed-off-by: MorningTZH <morningtzh@yeah.net>
0e7e906 to
c153598
Compare
jakubno
left a comment
There was a problem hiding this comment.
Hey @morningtzh , thanks for submitting this. Really like the direction, there're some issues we're introducing here, but it seems they are smaller than having the delay. but maybe there're still some improvements
Issues I currently see
- I don't see anything like OptimisticRemove. So If only some sandboxes are short lived, it may result in very skew, won't it?
- OptimisticAdd introduces a race condition between loading metrics and adding it optimistically to the metrics, so some sandboxes can be temporarily be double counted (sandbox is created, metrics sync, api process the response, add to the metrics)
Could you please address the first and check the second if different order of operation doesn't help here?
📖 Background & Problem Statement
Currently, the
apimodule uses a "Best of K" algorithm internally to schedule and place user Sandbox creation requests onto the underlying Orchestrator nodes.During high-concurrency stress tests (a massive influx of requests in a very short window), we observed severe load skew across the Orchestrator nodes.
Root Cause (Stale Metrics & Thundering Herd):
The scheduling
Scorewithin theapimodule heavily relies on periodicnode.Metrics()reports from the Orchestrator nodes. Under high concurrency, node state reporting naturally lags due to network and heartbeat intervals. Consequently, theapimodule makes scheduling decisions based on a stale, "idle" static snapshot. Orchestrator nodes with initially low scores are repeatedly selected, while local "in-progress" allocations are completely ignored. This creates a "Thundering Herd" effect where traffic piles up inappropriately.📉 Real-world Cluster Validation
To validate the accuracy of our newly introduced
LaggyNodelocal benchmark, we conducted a real-world stress test with 500 concurrent requests on the actual cluster.As the actual runtime chart clearly illustrates:
When high concurrency hits, the placement distribution exhibits extreme, sudden spikes. Due to the real physical delay in metric reporting, a few specific nodes (the lines shooting up to 100+) become "victims" of the scheduler and are instantly flooded with Sandboxes. Meanwhile, the vast majority of the nodes in the cluster remain completely idle (the flat lines near 0 at the bottom).
This real-world behavioral pattern perfectly aligns with the extreme imbalance observed in our local benchmark. This proves that our refactored
LaggyNodebenchmark accurately reproduces the real-world scheduling pain points.🛠️ Proposed Changes
This PR focuses on the
placementandnodemanagerpackages within theapimodule. It completely resolves the load imbalance issue by introducing state prediction and compensating for the metric reporting delay.packages/api/internal/orchestrator/placement/placement_best_of_K.go.Scorefunction now actively incorporates pending CPUs fromnode.PlacementMetrics.InProgress()into themetrics.CpuAllocatedcalculation. This shifts theapimodule's scheduling logic from passive reporting to active expected load prediction.OptimisticAddmethod topackages/api/internal/orchestrator/nodemanager/node.go. InPlaceSandbox, once a Sandbox is successfully created, the resources are immediately and optimistically deducted locally.placement_benchmark_test.goto abstract the hardcoded node logic into aNodeSimulatorinterface, introducingLaggyNodeto simulate "stale metrics" scenarios.BenchmarkPlacementDistributionsuite, which outputs an ASCII histogram to visually demonstrate the Sandbox distribution and Coefficient of Variation (CV).📊 Benchmark Results
Tested under extreme conditions (10-node cluster, 500 burst traffic with simulated metric reporting delays):
Before Fix (Relying purely on stale metrics with
BestOfK_K3):Because the
apirelied on outdated snapshots, the cluster experienced severe load skew. The busiest node was assigned 120 Sandboxes, while the idlest nodes received only 2.0.941(Severe load imbalance)After Fix (
BestOfK_K3+ InProgress Shadow State + Optimistic Update):With local shadow state and optimistic updates, the
apimodule accurately predicts expected loads. Traffic is distributed smoothly and perfectly across all available nodes.0.021(Near-perfect distribution)📋 Checklist
BestOfKalgorithm (Shadow State).OptimisticAddand integrated it into the placement flow.LaggyNodesimulation.go test -v -bench=BenchmarkPlacementDistributionlocally with passing, expected results.