Sonic dynamic model loading#25
Sonic dynamic model loading#25trevin-lee wants to merge 34 commits intofastmachinelearning:masterfrom
Conversation
…r method in TritonClient. Update BuildFile.xml and fix formatting in header files.
…tructor for TritonClient, and update BuildFile.xml to include Catch2 for testing.
…tests; remove old cfg
…lection; remove unused parameters and improve documentation.
- Introduced `loadModel` and `unloadModel` methods for managing model lifecycle. - Added mutex for thread safety during model operations. - Updated `TritonService` header and implementation to support dynamic model management. - Enhanced logging for model loading and unloading processes. - Updated test configurations to include dynamic model loading tests.
…requirements - Modified input handling to utilize actual model input for "x" instead of dummy data. - Adjusted shape and data allocation for input to meet base class expectations. - Updated parameter set description method to use TritonClient for configuration.
|
I have not looked at the test code yet because I think the logic in the TritonService needs to be addressed first. Another general point: part of the idea for dynamic loading with the fallback server would be to get rid of the |
- Applied scram b code-format to fix formatting issues - Added TRITON_THROW_IF_ERROR for error handling in loadModel/unloadModel - Fixed unload semantics to not erase tracking data on failure - Added comments explaining dynamic loading test requirements - Removed spurious formatting changes
- Removed loadedModels_ set, now derive loaded status from modelRefCount_ > 0 - Simplified loadModel() and unloadModel() to only work with fallback server - Removed server searching logic since only fallback is supported - Added documentation clarifying fallback-only limitation - Updated error messages to reflect this architectural decision - Fixed XML comment syntax in BuildFile.xml
…ading - Updated retry logic to switch to the fallback server directly for model loading. - Removed previous best server selection logic, simplifying the retry process. - Added logging for fallback loading failures and re-evaluation after loading. - Ensured dynamic loading is explicitly handled in TritonService with updated command options.
…in TritonService - Introduced FallbackModelState to track dynamic model state, including reference count and model path. - Updated loadModel and unloadModel methods to utilize FallbackModelState for managing model lifecycle. - Enhanced logging to reflect model state changes and operations. - Adjusted test configuration to reflect changes in module handling for dynamic model loading.
… handling - Eliminated unservedModels_ from TritonService, simplifying model management. - Updated addModel and preBeginJob methods to work directly with models_. - Enhanced error handling in loadModel for fallback server scenarios. - Improved comments for clarity on model path handling and fallback server usage.
HeterogeneousCore/SonicTriton/test/DynamicModelLoadingProducer.cc
Outdated
Show resolved
Hide resolved
- Move --model-control-mode flag from TritonService.cc to cmsTriton script with new -L option to disable (enabled by default) - Remove path parameter from loadModel(); retrieve path from models_ map - Use client's modelName() in DynamicModelLoadingProducer instead of separate testModelName/testModelPath parameters - Restore getBestServer() logic in RetryActionDiffServer to try remote servers before falling back to local server with dynamic loading - Remove unnecessary fallbackName variable in RetryActionDiffServer - Restore default --modules value to TritonGraphProducer in test config - Add clarifying comment for path fix-up in addModel() Deferred for future work: - Consolidate Model/FallbackModelState structs to eliminate refcount duplication - Further examination of lock scope in loadModel/unloadModel
- Add RefCount.cc: unit test for refcount correctness with 7 test cases - Verifies multiple loads increment refcount without server calls - Verifies unloads only call server when refcount reaches 0 - Verifies models are independent and paths are preserved - Rename test_RetryActionDiffServer.cc to RetryActionDiffServer.cc to match naming convention of other test files
- Fix error message: fallback server not found is global, not model-specific - Reorder checks: check startedFallback_ before checking servers_ map - Wrap LoadModel/UnloadModel calls directly in TRITON_THROW_IF_ERROR - Remove models_.emplace in loadModel (models already added during init) - Remove unnecessary parentheses in --state.refCount
- Add refCount member to Model struct (removes duplication) - Remove FallbackModelState struct (no longer needed) - Remove modelRefCount_ map (redundant with Model.refCount) - Rename fallbackModels_ to fallbackLoadedModels_ (now just a set of names) - Update loadModel/unloadModel to operate on Model directly This addresses review comment about unnecessary duplication between Model and FallbackModelState objects.
|
@trevin-lee thanks, this is in good shape now! Just a few more minor comments. Further thoughts on mutex for load/unload: I have actually changed my mind somewhat. Because we are doing both There is a more streamlined approach possible using |
…ror message in TritonService loadModel function. This improves clarity in model control handling and error reporting for fallback server scenarios.
…t and update model loading/unloading methods - Replace std::unordered_map with tbb::concurrent_hash_map for models_ to enhance thread safety during concurrent access. - Simplify loadModel and unloadModel methods to utilize accessors for model state management. - Update cmsTriton script to set MODELCONTROL to "none" for improved model control handling. - Remove deprecated comments and adjust test configurations for concurrent model loading.
| fi | ||
| DEVICE=auto | ||
| THREADCONTROL="" | ||
| MODELCONTROL="none" |
There was a problem hiding this comment.
this is an incorrect change. "explicit" should still be the (new) default. "none" is only needed on L101 (as the value if -L is specified)
There was a problem hiding this comment.
Fixed - reverted the default back to "explicit" and updated the -L flag to set MODELCONTROL="none" instead of empty string.
|
@trevin-lee the most recent commit has a lot of changes that I don't understand. At least one of them is definitely incorrect, as remarked above. There are also tests commented out for unclear reasons, etc. More broadly, maybe my previous comment wasn't clear. The key remark was here:
Making serial modifications where one concurrent map is locked with an accessor and then another normal map is locked with a mutex can still lead to data races. I would just revert the last commit. For now, the single mutex approach is safest. We can reconsider the design later (i.e. not in this PR) if it becomes a bottleneck. This will require careful design and implementation, not widespread changes with unclear motivations. |
* Reverts the fine-grained locking approach (concurrent_hash_map with * per-model accessors + separate serverModelsMutex_) in favor of the * original single mutex. Serial locking of one concurrent map accessor * followed by a separate mutex can still lead to data races across the * two maps. The single mutex approach is safest for now; the design can * be revisited later if it becomes a bottleneck. Reverts b6b7777 and 5c2acaf.
PR description:
This PR introduces dynamic model loading/unloading capabilities and server health monitoring to the SONIC Triton integration in CMSSW. The main features include:
1. Dynamic Model Loading and Unloading:
loadModel()andunloadModel()methods toTritonServicefor managing model lifecycle at runtimeDynamicModelLoadingProducertest module to validate dynamic model management4. Code Improvements:
customize.pyfor better configurabilityTritonClientwith new constructor for testing and enhanced server connection methodsExpected Output Changes:
Dependencies:
Files Modified:
HeterogeneousCore/SonicCore/src/SonicClientBase.ccHeterogeneousCore/SonicCore/plugins/BuildFile.xmlHeterogeneousCore/SonicTriton/interface/TritonService.hHeterogeneousCore/SonicTriton/interface/TritonClient.hHeterogeneousCore/SonicTriton/src/TritonService.ccHeterogeneousCore/SonicTriton/src/TritonClient.ccHeterogeneousCore/SonicTriton/src/RetryActionDiffServer.ccHeterogeneousCore/SonicTriton/test/BuildFile.xmlHeterogeneousCore/SonicTriton/test/tritonTest_cfg.pyNew Files:
HeterogeneousCore/SonicTriton/test/DynamicModelLoadingProducer.ccHeterogeneousCore/SonicTriton/test/test_RetryActionDiffServer.ccRemoved Files:
HeterogeneousCore/SonicTriton/test/RetryActionDiffServer.cc(replaced with unit test)PR validation:
Integration Tests:
scram b -j 8Known Issues:
Documentation:
Backport Information:
This PR is NOT a backport. It is intended for CMSSW_15_1_X release cycle.
If backporting becomes necessary, it would target future release cycles after initial integration and validation in CMSSW_15_1_X.