Open
Conversation
Core Implementation: - Add mask_phylogenetic_tags() to genomic_masking_functions.py - State machine from Evo2 for detecting taxonomy tags - Format: |d__Bacteria;p__Proteobacteria;...| - Draws upon Evo2 NeMo implementation exactly - Update GenomicDataCollator with mask_phylo_tags parameter Features (Milestone 2): - mask_phylo_tags: Default False (metagenomics doesn't need it) - Enable for full OpenGenome2 pretraining (has phylo tags) - Backward compatible: Existing code unaffected Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Add test_dataloader_with_phylo_masking: - Tests full pipeline (dataloader + phylo collator) - Verifies phylo tags are masked in batches - Uses realistic sequences with taxonomy annotations Signed-off-by: savitha-eng <savithas@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Phylogenetic Tag Masking Support (for training with full opengenome2 pretraining dataset)
Summary
Extends the genomic data collator to support masking phylogenetic taxonomy tags for training on the full OpenGenome2 dataset. Uses Evo2's phylogenetic tag detection algorithm for sequences containing taxonomy annotations.
Description
This PR adds support for masking phylogenetic taxonomy tags that appear in OpenGenome2's full pretraining dataset. These tags have the format
|d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria|and should not be predicted by the model since they are metadata annotations, not DNA sequence.The implementation is the same as NeMo's Evo2 dataset and uses a state machine to detect taxonomy patterns between pipe delimiters. The masking is optional and disabled by default (only needed for full dataset, not metagenomics-only training).
Key changes:
mask_phylogenetic_tags()function (160 lines from Evo2 implementation)mask_phylo_tagsparameter to GenomicDataCollator (default: False)Backward compatibility:
Usage
Enable phylogenetic tag masking for full OpenGenome2 pretraining:
Or via Hydra config:
Order of operations:
This order is critical - phylo detection relies on pipes (
|) and lowercase letters that would be removed by degenerate/uppercase masking.Type of changes
Pre-submit Checklist