Skip to content

nucmer gives no alignment with certain records in subject file #238

@peterjc

Description

@peterjc

This is a reproducible bug where certain sequences in the reference/subject file result in no alignments. It is one of three similar examples found in a set of fungal genome assembliess as per pyani-plus/pyani-plus#417 and affects both NUCmer (NUCleotide MUMmer) version 3.1 and version 4.0.1 as well. Tested on both macOS and Linux, with mummer installed from BioConda.

Setup

Download smaller query genome:

❯ curl -o "GCA_009806305.1.fna.gz" "https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_009806305.1?download=true&gzip=true"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11.6M    0 11.6M    0     0  1481k      0 --:--:--  0:00:08 --:--:-- 1554k

Download larger genome where last two records are problematic:

❯ curl -o "GCA_011745365.1.fna.gz" "https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_011745365.1?download=true&gzip=true"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 25.9M    0 25.9M    0     0  1203k      0 --:--:--  0:00:22 --:--:-- 1165k

Checksums:

❯ md5sum GCA_*.fna.gz
6448363d6650a4a6a1db37000b1dde80  GCA_009806305.1.fna.gz
aa6bc2e22196bbdf26daaae2185fde88  GCA_011745365.1.fna.gz

Decompress and make subset of subject file without last two records:

❯ cat GCA_009806305.1.fna.gz | gunzip > query.fasta
❯ cat GCA_011745365.1.fna.gz | gunzip > subject59.fasta
❯ head -n 1461172 subject59.fasta > subject57.fasta
❯ grep -c "^>" query.fasta subject59.fasta subject57.fasta
query.fasta:35
subject59.fasta:59
subject57.fasta:57

Notice record 57 is complete:

❯ tail -n 2 subject57.fasta
TAAGATATTTATACCAAACATAATTAATTATTTTCTAAAATAAAGCTAGAATATTTATTA
AATTAAAGAATATAATTAAATATTTATAAT

Checksums:

❯ md5sum query.fasta subject59.fasta subject57.fasta
b028729332c5b2071eb3bba1be58e8ba  query.fasta
74227e57109fede619e4f4ae473bf926  subject59.fasta
5b53388c322b5ebadc028ebc989af913  subject57.fasta

Running mummer4

With just 57 references, under 30s:

❯ nucmer --version
4.0.1
❯ time nucmer -p query_vs_subject57 --mum subject57.fasta query.fasta
(no output)
❯ cat query_vs_subject57.delta
/private/tmp/mum_bug/subject57.fasta /private/tmp/mum_bug/query.fasta
NUCMER
>ENA|WHUS01000032|WHUS01000032.1 ENA|RAGY01000030|RAGY01000030.1 2728 43253
2 2728 24861 27587 416 416 0
-111
9
1132
19
-4
-3
3
-267
536
14
-48
4
104
-150
6
-124
3
-143
-31
-20
0

With the full 59 references, similar run time:

❯ nucmer -p query_vs_subject59 --mum subject59.fasta query.fasta
(no output)
❯ cat query_vs_subject59.delta
/private/tmp/mum_bug/subject59.fasta /private/tmp/mum_bug/query.fasta
NUCMER

Expected output - similar or the same as the above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions