DFlash VLM training support with SGLang backend#505
DFlash VLM training support with SGLang backend#505Mandy3311 wants to merge 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the DFlash speculative decoding training framework by integrating robust support for Vision-Language Models (VLMs). It allows for seamless training on datasets containing both text-only and multimodal samples, addressing critical aspects of VLM data handling, such as image injection in conversations and consistent data schema management. The changes also include architectural adaptations for VLM-specific model configurations and optimized SGLang backend interactions for multimodal inputs, laying the groundwork for more versatile and efficient VLM training. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements to enable DFlash training for Vision-Language Models (VLMs), with a focus on the SGLang backend. The changes include adding multimodal data ingestion, constructing SGLang multimodal requests, and making weight key paths configurable for VLM architectures. The PR also contains several important correctness fixes and refactorings, such as ensuring images are only attached to the first user turn in multi-turn conversations and gracefully handling text-only samples during mixed VLM training. My review focuses on improving code clarity and robustness by addressing non-standard argument usage and adding explanatory comments for non-obvious code.
scripts/train_dflash.py
Outdated
| ckpt_info = None # 预定义以防万一,虽然下面的逻辑更稳妥 | ||
|
|
||
| # --- 步骤 2: 尝试获取 checkpoint 信息 --- |
There was a problem hiding this comment.
These comments are in Chinese. For consistency and maintainability in this English-language codebase, please translate them to English.
| ckpt_info = None # 预定义以防万一,虽然下面的逻辑更稳妥 | |
| # --- 步骤 2: 尝试获取 checkpoint 信息 --- | |
| ckpt_info = None # Pre-define to be safe, although the logic below is more robust | |
| # --- Step 2: Try to get checkpoint information --- |
| processor = AutoProcessor.from_pretrained( | ||
| args.target_model_path, | ||
| min_pixels=args.min_pixels, | ||
| max_pixels=args.max_pixels, | ||
| trust_remote_code=args.trust_remote_code, | ||
| exist_ok=True | ||
| ) |
There was a problem hiding this comment.
The exist_ok=True argument passed to AutoProcessor.from_pretrained does not appear to be a standard argument for this Hugging Face Transformers method. While it might be ignored if trust_remote_code=True allows for custom arguments in the model's loading code, it's not guaranteed and could lead to unexpected behavior or errors with different models or library versions. It would be safer to remove this argument if it's not strictly required by the Qwen-VL model's custom code.
| if 'enable_piecewise_cuda_graph' in kwargs: | ||
| del kwargs['enable_piecewise_cuda_graph'] |
There was a problem hiding this comment.
The code explicitly deletes enable_piecewise_cuda_graph from kwargs before calling SGLangDFlashTargetModel.from_pretrained. This seems like a workaround for a potential issue, but there's no explanation. This argument is related to an SGLang inference optimization (CUDA graphs), which might not be compatible with the training-time hidden state generation being done here. Please add a comment explaining why this argument needs to be removed to improve code clarity and maintainability.
c5455f2 to
8781f41
Compare
Co-authored-by: hukongyi <hukongyi@cmbchina.com>
8781f41 to
2ef8024
Compare
Summary
This PR extends DFlash speculative decoding training to support Vision-Language Models (VLMs),
with Qwen3.5-VL as the primary validated target. It adds multimodal data ingestion,
SGLang multimodal request construction using
MRotaryEmbedding, configurable weight keypaths for VLM architectures, and several correctness fixes to the base training loop.
Mixed Text + VLM Training
This PR supports training on a mixture of text-only and vision-language samples within
the same run. When
--is-vlmis set and the dataset contains samples whereimageisNone, those samples are processed as text-only with empty tensor placeholders forpixel_values/image_grid_thwto maintain a consistent HuggingFace Arrow schema acrossparallel preprocessing shards.
Changes
scripts/train_dflash.py--embed-key/--lm-head-keyto configure weight key paths; VLMs typically usemodel.language_model.embed_tokens.weightinstead of the LLM defaultAutoProcessorwhen--is-vlmis set; thread it throughbuild_dataloaderspecforge/data/preprocessing.pyPreviously every user turn received the image, breaking multi-turn conversation formatting. Currently we do not support multiple images in one data item.
chat template is responsible for system prompts.
AttributeError: 'NoneType'.startswithcrash: samples withimage=Noneare nowhandled gracefully — processed as text-only with empty tensor placeholders for
pixel_values/image_grid_thwto keep the HF Arrow schema consistent across shards.specforge/data/utils.pyVlmDataCollatorWithPadding: enforcebatch_size=1with an assertion; passpixel_valuesdirectly for VLM samples andNonefor text-only samples so thetraining loop falls back to the text-only SGLang path.
specforge/modeling/target/dflash_target_model.py_build_vlm_reqs()to construct SGLangReqobjects with fullMultimodalInputs:pixel_valuesper sample usingimage_grid_thwpatch counts (t×h×w)image_grid_thwtensorsMRotaryEmbedding.get_rope_indexusing theauto-detected
vlm_model_typeMultiModalityDataPaddingPatternMultimodalTokensRadixCachecreation out of_extend()into__init__asself.dummy_tree_cacheto avoid re-allocating the cache on every forward pass
generate_dflash_data()to branch onis_vlm_batchand dispatch to_build_vlm_reqs; refactor the text-only path to no longer maintain adata_cacheset_dflash_layers_to_capture()dispatch alongside the existingset_eagle3_layers_to_capture()inset_capture_layers()HFDFlashTargetModel.generate_dflash_data()to forwardpixel_values/image_grid_thwto the HF model viamodel_kwargsspecforge/modeling/target/target_utils.pytext_configforvocab_size,hidden_size, andpad_token_idwhen thetop-level VLM config does not expose them directly
References
patch.pyandmodel_runner.pychanges in this PR are a subset of that work