Skip to content

Conversation

@realmarcin
Copy link
Collaborator

The EC ontology (ec.json) contains relationships between enzyme commission numbers and protein sequences, including TrEMBL (unreviewed UniProt) entries. These protein nodes were being imported but not filtered out.

Fix:

  • Add TrEMBL: prefix to protein filtering in EC ontology post-processing
  • Now filters both UniprotKB: and TrEMBL: prefixes
  • Reduces EC nodes from 81,679 to 13,132 (removes 236,220 TrEMBL entries)

This complements the earlier fix for Rhea mappings (commit 410491e) which filtered TrEMBL from that transform. TrEMBL protein nodes should not be in the knowledge graph as they are not relevant to microbial traits/media data.

🤖 Generated with Claude Code

The EC ontology (ec.json) contains relationships between enzyme commission
numbers and protein sequences, including TrEMBL (unreviewed UniProt) entries.
These protein nodes were being imported but not filtered out.

Fix:
- Add TrEMBL: prefix to protein filtering in EC ontology post-processing
- Now filters both UniprotKB: and TrEMBL: prefixes
- Reduces EC nodes from 81,679 to 13,132 (removes 236,220 TrEMBL entries)

This complements the earlier fix for Rhea mappings (commit 410491e) which
filtered TrEMBL from that transform. TrEMBL protein nodes should not be in
the knowledge graph as they are not relevant to microbial traits/media data.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the EC ontology transform to filter out TrEMBL (unreviewed UniProt) protein nodes in addition to the existing UniprotKB filtering. This reduces the EC node count from 81,679 to 13,132 by removing 236,220 TrEMBL entries that are not relevant to microbial traits/media data.

Key Changes:

  • Added TrEMBL prefix to protein filtering logic in EC ontology post-processing
  • Refactored filtering from single-prefix check to multi-prefix list-based approach
  • Updated inline comment to reflect both UniprotKB and TrEMBL filtering

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

new_nf_lines = [line for line in new_nf_lines if UNIPROT_PREFIX not in line]
new_ef_lines = [line for line in new_ef_lines if UNIPROT_PREFIX not in line]
# Remove UniProt and TrEMBL nodes since accounted for elsewhere
protein_prefixes = [UNIPROT_PREFIX, "TrEMBL:"]
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "TrEMBL:" prefix is hardcoded as a string literal, which is inconsistent with the codebase pattern. The UNIPROT_PREFIX constant is imported from constants.py (line 47), but no corresponding TREMBL_PREFIX constant exists. Consider defining a TREMBL_PREFIX constant in kg_microbe/transform_utils/constants.py (similar to UNIPROT_PREFIX at line 452) and importing it here for consistency and maintainability.

Copilot uses AI. Check for mistakes.
realmarcin and others added 2 commits December 22, 2025 21:49
The comment claimed to filter TrEMBL entries, but:
1. Rhea mappings doesn't produce TrEMBL nodes (verified: 0 TrEMBL entries)
2. The code only filters UniprotKB: and CHEBI: prefixes, not TrEMBL:
3. TrEMBL nodes were actually coming from EC transform (fixed separately)

Updated comment to accurately describe what the code does: only includes
CHEBI, EC, and GO entries, filtering out UniProt entries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Following codebase patterns, replaced the hardcoded "TrEMBL:" string with a
proper constant for consistency and maintainability.

Changes:
- Added TREMBL_PREFIX = "TrEMBL:" to constants.py (line 453)
- Imported TREMBL_PREFIX in ontologies_transform.py
- Replaced hardcoded "TrEMBL:" with TREMBL_PREFIX constant (line 477)

This matches the existing pattern used for UNIPROT_PREFIX and other prefix
constants throughout the codebase.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@realmarcin realmarcin merged commit e2861c4 into master Dec 23, 2025
0 of 3 checks passed
@realmarcin realmarcin deleted the trembl-fix branch December 23, 2025 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants