Kushagrathapar/optimize cosmos ci shared build#48210
Draft
kushagraThapar wants to merge 6 commits intoAzure:mainfrom
Draft
Kushagrathapar/optimize cosmos ci shared build#48210kushagraThapar wants to merge 6 commits intoAzure:mainfrom
kushagraThapar wants to merge 6 commits intoAzure:mainfrom
Conversation
1. PR-conditional emulator matrix (16 → 11 jobs): Drops redundant JDK variants for Spark/Kafka in PR builds. Full matrix on main merges. Dropped for PRs (5 jobs, ~5 agent hours saved): - Spark 3.3 Java 11 (keeping Java 8) - Spark 3.4 Java 8 (keeping Java 11) - Spark 3.5/Scala 2.12 Java 8 (keeping Java 17) - Spark 4.0/Scala 2.13 Java 17 (keeping Java 21) - Kafka Java 11 (keeping Java 17) 2. Increase BuildParallelization from 1 to 2 in all stages (Build, TestEmulator, TestVNextEmulator). 3. Skip maven-shade-plugin for non-Spark/non-Kafka emulator jobs: Core emulator, long emulator, and encryption jobs don't need Spark/Kafka uber JARs. Adding -Dshade.skip=true saves ~90s of shade plugin execution per Spark module × 5 modules = ~7-8 min per non-Spark job (5 jobs × 7 min = ~35 min agent time saved). 4. Remove outdated comment about emulator download time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…properties NetworkFailureTest#createCollectionWithUnreachableHost takes 121s because it waits for ClientRetryPolicy to exhaust 120 retries × 1s interval. Changes: - Add COSMOS.CLIENT_ENDPOINT_FAILOVER_MAX_RETRY_COUNT and COSMOS.CLIENT_ENDPOINT_FAILOVER_RETRY_INTERVAL_IN_MS to Configs.java (defaults: 120 retries, 1000ms — no behavior change in production) - ClientRetryPolicy reads from Configs at each usage point (not cached in final static), allowing runtime override via system properties - NetworkFailureTest sets 5 retries × 100ms at test start, restores defaults in finally block → test completes in ~0.5s instead of 121s - Other tests in the same JVM are unaffected (properties restored) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace Thread.sleep(sleepTime) in validateChangeFeedProcessing with a polling loop that returns as soon as all documents are received. The previous implementation always slept the full duration (10-50s) even if documents arrived in 1-2s. The polling loop checks every 100ms if receivedDocuments.size() has reached the expected count, with sleepTime as the maximum timeout. Estimated savings: 5-40s per test invocation depending on how quickly documents are processed by the change feed processor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The antrun 03-repack phase expects shade output (native .jnilib/.so files in target/tmp/). When -Dshade.skip=true, the shade output doesn't exist and antrun fails with 'Could not find file'. Add -Dmaven.antrun.skip=true alongside -Dshade.skip=true. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test step runs 'clean verify' which recompiles everything from scratch, including Spark shade. Our BuildOptions only affected the build step. Add -Dshade.skip=true -Dmaven.antrun.skip=true to AdditionalArgs for non-Spark jobs so it flows into TestOptions too. Keep BuildOptions for the build step as well (both steps need it). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add BuildOptions parameter through ci.yml → ci.tests.yml → build-and-test.yml pipeline chain. Defaults to empty string (no behavior change for other SDKs). Cosmos Build stage sets BuildOptions to '-Dshade.skip=true -Dmaven.antrun.skip=true' to skip Spark/Kafka uber JAR creation during unit test matrix jobs, saving ~14 min per job. The release artifact deploy step is unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimize Cosmos DB CI Pipeline — Emulator Tests
Summary
Reduce CI agent time and test execution time for the Cosmos DB emulator test pipeline. Based on analysis of build #5953527 (82 min wall time, 23.4 agent hours, 30 jobs).
Changes
1. PR-conditional emulator matrix (16 → 11 jobs)
For PR builds, use a reduced matrix that drops redundant JDK variant jobs for Spark and Kafka connectors. Full matrix continues to run on main branch merges.
Dropped for PRs (5 jobs):
Savings: ~5 agent hours per PR
2. Increase Maven build parallelization (1 → 2)
All three stages (Build, TestEmulator, TestVNextEmulator) now use
BuildParallelization: 2instead of1.Savings: ~5 min per job × 16 jobs = ~80 min agent time
3. Skip maven-shade-plugin for non-Spark/non-Kafka emulator jobs
Core emulator, long emulator, and encryption jobs don't need Spark/Kafka uber JARs. Added
BuildOptions: "-Dshade.skip=true"to these 5 matrix entries. The build step in each emulator job spends ~14 min (88% of 17 min total) creating 5 Spark uber JARs that are never used by non-Spark tests.Savings: ~7-8 min per non-Spark job × 5 jobs = ~35 min agent time
4. Configurable endpoint failover retry constants
NetworkFailureTest#createCollectionWithUnreachableHosttakes 121s because it waits forClientRetryPolicyto exhaust 120 retries × 1s interval.COSMOS.CLIENT_ENDPOINT_FAILOVER_MAX_RETRY_COUNTandCOSMOS.CLIENT_ENDPOINT_FAILOVER_RETRY_INTERVAL_IN_MSsystem properties toConfigs.java(defaults unchanged: 120 retries, 1000ms)ClientRetryPolicyreads fromConfigsat each usage point, allowing runtime overrideNetworkFailureTestoverrides to 5 retries × 100ms, restores defaults after testSavings: 121s → 0.5s per test run
5. Poll instead of fixed sleep in ChangeFeedProcessor tests
IncrementalChangeFeedProcessorTest.validateChangeFeedProcessingpreviously didThread.sleep(sleepTime)for the full duration (10-50s) even if documents arrived in 1-2s. Replaced with a polling loop that checks every 100ms and returns as soon as all documents are received.Savings: 5-40s per CFP test invocation
Impact Summary
Files Changed
Pipeline config
eng/pipelines/templates/stages/cosmos-sdk-client.yml— PR-conditional matrix, build parallelization, shade skipeng/pipelines/templates/stages/cosmos-emulator-matrix.json— AddedBuildOptionsper jobeng/pipelines/templates/stages/cosmos-emulator-matrix-pr.json— New reduced matrix for PRsProduction code (no behavior change)
sdk/cosmos/azure-cosmos/.../Configs.java— New configurable retry propertiessdk/cosmos/azure-cosmos/.../ClientRetryPolicy.java— Read retry constants from ConfigsTest code
sdk/cosmos/azure-cosmos-tests/.../NetworkFailureTest.java— Override retry for fast executionsdk/cosmos/azure-cosmos-tests/.../IncrementalChangeFeedProcessorTest.java— Poll instead of sleepTesting
These changes will be validated by the CI pipeline itself. The production code changes (Configs/ClientRetryPolicy) maintain identical defaults — no behavioral change unless system properties are explicitly set.