From 55d6bfd806cf8819d698d1034bf928a92e20fa57 Mon Sep 17 00:00:00 2001 From: Your Name Date: Sun, 24 Aug 2025 15:29:00 +0200 Subject: [PATCH 01/10] added no-reassembly feature --- README.md | 67 +++++++++++++++++++---------- src/main.rs | 107 +++++++++++++++++++++++++++++++--------------- src/reassemble.rs | 69 +++++++++++++++++++++++------- 3 files changed, 170 insertions(+), 73 deletions(-) diff --git a/README.md b/README.md index 1615f1b..1d99ae2 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,16 @@ The output directory contains dereplicated bins, and a text file listing the com magmax -b -r -m -f fasta -t 24 magmax -b -r -m -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available - magmax -b -r -m -f fasta -t 24 --split // if input bins are not already split by sample id + magmax -b -r -m -f fasta -t 24 --split // if input bins are not already split by sample id + + +## Dereplication without reassembly +MAGmax provides an option to peform dereplication without reassembly using `--no-reassembly` flag. In this mode, MAGmax selects the best bin within each genomic cluster based on a quality score (defined as completeness - 5 * contamination) that also meets the user-defined completeness and contamination thresholds. When this option is enabled, only the bin directory (`-b`) is required as input. + + magmax -b --no-reassembly -t 24 -f fasta + magmax -b --no-reassembly -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available + magmax -b --no-reassembly -f fasta -t 24 --split // if input bins are not already split by sample id + ## Install ### Prerequisites @@ -62,40 +71,61 @@ Option 2: Build from source ## Options -b, --bindir Directory containing fasta files of bins + -r, --readdir + Directory containing read files + -m, --mapdir + Directory containing mapids files -i, --ani ANI for clustering bins (%) [default: 99] -c, --completeness Minimum completeness of bins (%) [default: 50] -p, --purity Mininum purity (1- contamination) of bins (%) [default: 95] - -m, --mapdir - Directory containing mapids files - -r, --readdir - Directory containing read files -f, --format Bin file extension [default: fasta] -t, --threads Number of threads to use [default: 8] + --no-reassembly + Perform dereplication without bin merging and reassembly --split Split clusters into sample-wise bins before processing -q, --qual Quality file produced by CheckM2 (quality_report.tsv) --assembler - assembler choice for reassembly step (spades|megahit) [default: spades, recommended] + Assembler choice for reassembly step (spades|megahit), spades is recommended [default: spades] -h, --help Print help -V, --version Print version -### Test run using toy data +## Test run using toy data This example test run demonstrates dereplication of bins using the provided toy dataset. In the `test/bins` directory, example bins generated with MetaBAT2 are given. In the `test/reads` directory, paired-end read files for two samples are given and in the `test/mapids` directory, mapid files mapping reads to contigs for each sample are given. Precomputed CheckM2 quality scores for the input bins are given in the `test/quality_report.tsv`. Run the following command to execute the test: magmax -b test/bins -r test/reads -m test/mapids -t 24 -q test/quality_report.tsv +To run without reassembly, -## Notes -1. Input contigs should have id prefixed with the sample ID, separated by 'C', as commonly practiced in the single-sample and multi-sample binning. Perform mapping and binning on contig files with these updated contig ids. -2. Mapid files can be generated using aligner2counts (https://github.com/soedinglab/binning_benchmarking/tree/main/util#aligner2counts) with `only-mapids` option. + magmax -b test/bins --no-reassembly -t 24 -q test/quality_report.tsv // run dereplication without reassembly + +After running MAGmax, an output folder named `mags_50comp_95purity` will be created in the `test` directory. This folder contains the following files: + +- `bins_checkm2_qualities.tsv` — Table summarizing the quality metrics of the dereplicated bins. +- `sample_ERR3405607_metabat2_results.63.fasta` — Final bin obtained after dereplication of the input bins. + +## Input specifications +1. Input contigs must have IDs prefixed with the sample ID, separated by a `C`. This is a common practice for both single- and multi-sample binning. Ensure mapping and binning are performed on contig files with these updated contig IDs. + +2. Ensure that headers in the fastq files have read ID separated from sequencer details by a space or tab, not by a dot. This is important for `seqtk`, which is used by MAGmax, to fetch reads correctly. + + `Correct format: @SRR25448374.1 A00214R:157:HLMVMDSXY:1:1101:19868:1016:N:0.length=151#0/1` + + `Wrong format: @SRR25448374.1.A00214R:157:HLMVMDSXY:1:1101:19868:1016:N:0.length=151#0/1` + +To fix, use the below bash command + + sed -i -E 's/^(@[^.]+\.[^.]+)\./\1 /' read.fastq + +3. Mapid files can be created using [`aligner2counts`] (https://github.com/soedinglab/binning_benchmarking/tree/main/util#aligner2counts) with the `only-mapids` option. An example file format is given below, File name: `_mapids` ``` @@ -107,18 +137,9 @@ This example test run demonstrates dereplication of bins using the provided toy read4_id Ccontig4_id ``` -3. If input bins are not separated by sample IDs, such as when using MetaBAT2 or COMEBin on a concatenated set of contigs, use the `--split` option to automatically separate input bin by sample IDs. -4. Make sure that headers in the read fastq files have read_id separated by space/tab (not by `.`) from other sequencer details. This is important for `seqtk` to fetch reads correctly. - - `Correct format: @SRR25448374.1 A00214R:157:HLMVMDSXY:1:1101:19868:1016:N:0.length=151#0/1` - - `Wrong format: @SRR25448374.1.A00214R:157:HLMVMDSXY:1:1101:19868:1016:N:0.length=151#0/1` - -When read ids are not seperated by space in the headers, run the below script and use the updated read file for mapping. - - sed -i -E 's/^(@[^.]+\.[^.]+)\./\1 /' read.fastq +4. FASTQ and MAPID filenames must contain the sample ID (e.g., SRR25448374.fastq, SRR25448374_mapids). This is the default unless filenames are renamed manually. -MAGma works for paired-end (in separate files: SRR25448374_1.fastq and SRR25448374_2.fastq) and single-end read files. +## Notes +1. If input bins are not separated by sample IDs (e.g., when using MetaBAT2 or COMEBin on concatenated contigs), use the `--split` option to let MAGmax automatically separate bins by sample ID. -5. Sample IDs must be in the file name of fastq and mapid files. (E.g., SRR25448374_1.fastq & SRR25448374_2.fastq or SRR25448374.fastq and SRR25448374_mapids) -6. We recommend Spades for reassembly which produces bins with higher purity than bins assembled using Megahit. +2. We recommend Spades for reassembly which produces bins with higher purity than bins assembled using Megahit. diff --git a/src/main.rs b/src/main.rs index a00bef1..ab67397 100644 --- a/src/main.rs +++ b/src/main.rs @@ -20,10 +20,15 @@ mod reassemble; // check for valid input paths fn validate_paths(cli: &Cli) -> io::Result<(PathBuf, PathBuf, PathBuf)> { let bindir = utility::validate_path(Some(&cli.bindir), "bindir", &cli.format); - let mapdir = utility::validate_path(Some(&cli.mapdir), "mapdir", "_mapids"); - let readdir = utility::validate_path(Some(&cli.readdir), "readdir", ".fastq"); - Ok((bindir.to_path_buf(), mapdir.to_path_buf(), readdir.to_path_buf())) + if cli.no_reassembly { + Ok((bindir.to_path_buf(), PathBuf::new(), PathBuf::new())) + } else { + let mapdir = utility::validate_path(cli.mapdir.as_ref(), "mapdir", "_mapids"); + let readdir = utility::validate_path(cli.readdir.as_ref(), "readdir", ".fastq"); + Ok((bindir.to_path_buf(), mapdir.to_path_buf(), readdir.to_path_buf())) + } + } #[derive(Parser)] @@ -34,12 +39,14 @@ struct Cli { bindir: PathBuf, /// Directory containing read files - #[arg(short = 'r', long = "readdir", help = "Directory containing read files")] - readdir: PathBuf, + #[arg(short = 'r', long = "readdir", help = "Directory containing read files", + requires_if("false", "no_reassembly"))] + readdir: Option, /// Directory containing mapids files derived from alignment sam/bam files - #[arg(short = 'm', long = "mapdir", help = "Directory containing mapids files")] - mapdir: PathBuf, + #[arg(short = 'm', long = "mapdir", help = "Directory containing mapids files", + requires_if("false", "no_reassembly"))] + mapdir: Option, /// Average Nucleotide Identity cutoff #[arg(short = 'i', long = "ani", default_value_t = 99.0, help = "ANI for clustering bins (%)")] @@ -60,6 +67,11 @@ struct Cli { /// Number of threads to use #[arg(short = 't', long = "threads", default_value_t = 8, help = "Number of threads to use")] threads: usize, + + /// Disable reassembly step + #[arg(long = "no-reassembly", help = "Perform dereplication without bin merging and reassembly", + conflicts_with_all = ["readdir", "mapdir"])] + no_reassembly: bool, /// First split bins before merging (if provided, set to true) #[arg(long = "split", help = "Split clusters into sample-wise bins before processing")] @@ -70,7 +82,7 @@ struct Cli { qual: Option, /// Assembler choice - #[arg(long = "assembler", default_value = "spades", help = "assembler choice for reassembly step (spades|megahit), spades is recommended")] + #[arg(long = "assembler", default_value = "spades", help = "Assembler choice for reassembly step (spades|megahit), spades is recommended")] assembler: String, } @@ -90,34 +102,31 @@ fn main() -> io::Result<()> { let split = cli.split; let assembler: String = cli.assembler; let qual = cli.qual; + let no_reassembly = cli.no_reassembly; let parentdir = bindir.parent().map(PathBuf::from).unwrap_or_else(|| bindir.clone()); - info!("Starting MAGma with parameters:"); + info!("Starting MAGmax with parameters:"); info!(" 🔹 Bins Directory: {:?}", bindir); info!(" 🔹 ANI Cutoff: {:.1}%", cli.ani); info!(" 🔹 Completeness Cutoff: {:.1}%", cli.completeness_cutoff); info!(" 🔹 Purity/Contamination: {:.1}%/{:.1}%", cli.purity_cutoff, contamination_cutoff); - info!(" 🔹 Map Directory: {:?}", mapdir); - info!(" 🔹 Read Directory: {:?}", readdir); info!(" 🔹 File Format: {}", format); info!(" 🔹 Threads: {}", threads); - info!(" 🔹 Assembler: {}", assembler); - if !["spades", "megahit"].contains(&assembler.as_str()) { - error!("Error: Invalid assembler choice '{}'. Allowed options: 'spades' or 'megahit'.", assembler); - exit(1); + if !no_reassembly { + info!(" 🔹 Map Directory: {:?}", mapdir); + info!(" 🔹 Read Directory: {:?}", readdir); + info!(" 🔹 Assembler: {}", assembler); + if !["spades", "megahit"].contains(&assembler.as_str()) { + error!("Error: Invalid assembler choice '{}'. Allowed options: 'spades' or 'megahit'.", assembler); + exit(1); + } } - let is_paired: bool = utility::check_paired_reads(&readdir); - if is_paired { - info!("Detected paired end \ - reads in separate files as \ - _1.fastq \ - and _2.fastq.") - } else { - info!("Detected single-end reads as .fastq.") + if no_reassembly{ + info!(" 🔸 MAGmax runs dereplication of input bins without bin merging and reassembly"); } - + let binfiles = utility::get_binfiles(&bindir,&format)?; if binfiles.is_empty() { @@ -271,10 +280,10 @@ fn main() -> io::Result<()> { &format, &bin_qualities, &merged_bin_quality, - is_paired, &assembler, completeness_cutoff, contamination_cutoff, + no_reassembly, id, ) .map_err(|e| { @@ -284,19 +293,22 @@ fn main() -> io::Result<()> { }) .expect("Error during processing components"); }); - - { - let merged_bin_qualities = merged_bin_qualities.lock().unwrap(); - for (key, value) in merged_bin_qualities.iter() { - bin_qualities.insert(key.to_string(), value.clone()); + if !no_reassembly { + { + let merged_bin_qualities = merged_bin_qualities.lock().unwrap(); + + for (key, value) in merged_bin_qualities.iter() { + bin_qualities.insert(key.to_string(), value.clone()); + } } + } - + // Final dereplication using skani let _ = merge::drep_finalbins(&resultdir, &bin_qualities, ani_cutoff); - info!("MAGma is successfully completed!"); + info!("MAGmax is successfully completed!"); Ok(()) } @@ -312,10 +324,10 @@ fn process_components( format: &String, bin_qualities: &HashMap, merged_bin_quality: &Arc>>, - is_paired: bool, assembler: &String, completeness_cutoff: f64, contamination_cutoff: f64, + no_reassembly: bool, id: usize, ) -> io::Result<()> { @@ -342,6 +354,34 @@ fn process_components( return Ok(()); } + // If no_reassembly is enabled, just select the best quality bin from each cluster + if no_reassembly { + let mut selected_bin: Option = None; + if let Some((bin_name, _, _)) = + reassemble::find_bestqualitybin(component, &bin_qualities, completeness_cutoff) + { + selected_bin = Some(bin_name); + } + + let _ = reassemble::select_bestqualitybin( + selected_bin, + bindir, + resultdir, + format + ); + + return Ok(()); + } + + let is_paired: bool = utility::check_paired_reads(&readdir); + if is_paired { + info!("Detected paired end \ + reads in separate files as \ + _1.fastq \ + and _2.fastq.") + } else { + info!("Detected single-end reads as .fastq.") + } // eg. selected_binset_path = /0_combined/ let selected_binset_path = resultdir.join(format!("{}_combined", id)); @@ -367,7 +407,6 @@ fn process_components( let all_enriched_scaffolds = utility::read_fasta( &selected_binset_path.join("combined.fasta").to_string_lossy() )?; - let scaffold_inputname:&str = "combined"; diff --git a/src/reassemble.rs b/src/reassemble.rs index d61d3be..ef5c3b4 100644 --- a/src/reassemble.rs +++ b/src/reassemble.rs @@ -53,27 +53,36 @@ pub fn run_reassembly( let mut selected_contamination: Option = None; // Find the best bin based on quality score within the cluster - if let Some((bin_name, completeness, contamination)) = component - .iter() - .filter_map(|bin| { - bin_qualities.get(bin).map(|quality| (bin, quality.completeness, quality.contamination)) - }) - .filter(|(_, completeness, _)| *completeness >= completeness_cutoff) - .max_by(|(_, completeness1, contamination1), (_, completeness2, contamination2)| { + // if let Some((bin_name, completeness, contamination)) = component + // .iter() + // .filter_map(|bin| { + // bin_qualities.get(bin).map(|quality| (bin, quality.completeness, quality.contamination)) + // }) + // .filter(|(_, completeness, _)| *completeness >= completeness_cutoff) + // .max_by(|(_, completeness1, contamination1), (_, completeness2, contamination2)| { - let score1 = completeness1 - (5.0 * contamination1); - let score2 = completeness2 - (5.0 * contamination2); + // let score1 = completeness1 - (5.0 * contamination1); + // let score2 = completeness2 - (5.0 * contamination2); - score1 - .partial_cmp(&score2) - .unwrap_or(std::cmp::Ordering::Equal) - .then_with(|| contamination1.partial_cmp(contamination2).unwrap_or(std::cmp::Ordering::Equal).reverse()) - }) + // score1 + // .partial_cmp(&score2) + // .unwrap_or(std::cmp::Ordering::Equal) + // .then_with(|| contamination1.partial_cmp(contamination2).unwrap_or(std::cmp::Ordering::Equal).reverse()) + // }) + // { + // selected_bin = Some(bin_name.to_string()); + // selected_completeness = Some(completeness); + // selected_contamination = Some(contamination); + // } + + if let Some((bin_name, completeness, contamination)) = + find_bestqualitybin(component, &bin_qualities, completeness_cutoff) { - selected_bin = Some(bin_name.to_string()); + selected_bin = Some(bin_name); selected_completeness = Some(completeness); selected_contamination = Some(contamination); } + let selected_quality_score = selected_completeness .zip(selected_contamination) .map(|(completeness, contamination)| completeness - (5.0 * contamination)) @@ -259,8 +268,36 @@ fn filterscaffold(input_file: &PathBuf) -> io::Result<()> { Ok(()) } +// Find the best bin from a cluster based on the quality score +pub fn find_bestqualitybin( + component: &HashSet, + bin_qualities: &HashMap, + completeness_cutoff: f64, +) -> Option<(String, f64, f64)> { + component + .iter() + .filter_map(|bin| { + bin_qualities.get(bin).map(|quality| { + (bin.clone(), quality.completeness, quality.contamination) + }) + }) + .filter(|(_, completeness, _)| *completeness >= completeness_cutoff) + .max_by(|(_, completeness1, contamination1), (_, completeness2, contamination2)| { + let score1 = completeness1 - (5.0 * contamination1); + let score2 = completeness2 - (5.0 * contamination2); + + score1 + .partial_cmp(&score2) + .unwrap_or(std::cmp::Ordering::Equal) + .then_with(|| contamination1 + .partial_cmp(contamination2) + .unwrap_or(std::cmp::Ordering::Equal) + .reverse()) + }) +} + // Select the best bin among the cluster members and reassembled bin -fn select_bestqualitybin( +pub fn select_bestqualitybin( selected_bin: Option, bindir: &PathBuf, outputpath: &PathBuf, From df55e76beba9c1f90a8537b2aff87cd5715e5e4c Mon Sep 17 00:00:00 2001 From: Your Name Date: Sun, 24 Aug 2025 15:29:26 +0200 Subject: [PATCH 02/10] added no-reassembly feature --- environment.yml | 3 --- 1 file changed, 3 deletions(-) diff --git a/environment.yml b/environment.yml index 5ea2a12..ffc7c1c 100644 --- a/environment.yml +++ b/environment.yml @@ -8,7 +8,4 @@ dependencies: - megahit - spades - seqtk - - checkm2 - pip - - pip: - - checkm2 From 1c3d0ad13ce9637cf46602381a224ff92a996d53 Mon Sep 17 00:00:00 2001 From: yazhinia Date: Sun, 24 Aug 2025 15:50:31 +0200 Subject: [PATCH 03/10] added no-reassembly feature --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 1d99ae2..afaf470 100644 --- a/README.md +++ b/README.md @@ -125,7 +125,7 @@ To fix, use the below bash command sed -i -E 's/^(@[^.]+\.[^.]+)\./\1 /' read.fastq -3. Mapid files can be created using [`aligner2counts`] (https://github.com/soedinglab/binning_benchmarking/tree/main/util#aligner2counts) with the `only-mapids` option. An example file format is given below, +3. Mapid files can be created using `aligner2counts` (https://github.com/soedinglab/binning_benchmarking/tree/main/util#aligner2counts) with the `only-mapids` option. An example file format is given below, File name: `_mapids` ``` From 8cc3eaaaf22d70230bfe246140c574e9417a7e1f Mon Sep 17 00:00:00 2001 From: yazhinia Date: Sun, 24 Aug 2025 16:49:17 +0200 Subject: [PATCH 04/10] update --- README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index afaf470..79908b5 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # MAGmax MAGmax is a dereplication tool designed to maximize the recovery of Metagenome-Assembled Genomes (MAGs) through bin Merging and reAssembly. It performs dereplication in three stages: (i) grouping bins based on average sequence identity, (ii) merging bins within each group, and (iii) reassembling the merged bins. -## INPUTS +## Inputs MAGmax requires three input directories, 1. `binsdir`, directory containing bin files in FASTA format that need to be dereplicated. (e.g., output files from any metagenome binning tool) @@ -9,11 +9,12 @@ MAGmax requires three input directories, 3. `mapid_dir`, directory containing mapping files for each sample. Each file is a text file listing read IDs and the corresponding contig IDs they mapped to. These files are used to retrieve reads that map to each merged bin from the FASTQ files in `readdir` and to generate new bin-specific FASTQ files for reassembly. -## OUTPUT -An output directory named `mags_comp_purity` will be created, where `x` and `y` correspond to the user-specified completeness and purity thresholds used to select final bins. By default, MAGmax uses a percentage of 50 for completeness and 95 for purity. +## Outputs +An output directory named `mags_comp_purity` will be created, where `x` and `y` correspond to the user-specified completeness and purity thresholds used to select final bins. By default, MAGmax uses a percentage of 50 for completeness and 95 for purity. + The output directory contains dereplicated bins, and a text file listing the completeness and contamination scores for each bin as calculated by CheckM2. -### Example command line call +## Example command line call magmax -b -r -m -f fasta -t 24 magmax -b -r -m -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available @@ -28,7 +29,7 @@ MAGmax provides an option to peform dereplication without reassembly using `--no magmax -b --no-reassembly -f fasta -t 24 --split // if input bins are not already split by sample id -## Install +## Installation ### Prerequisites - **Rust**: Follow the instructions [here](https://www.rust-lang.org/tools/install) to install Rust. From 611a36e5e1199502b4cd49df1db36d6a6a3247ca Mon Sep 17 00:00:00 2001 From: yazhinia Date: Sun, 24 Aug 2025 16:51:22 +0200 Subject: [PATCH 05/10] update --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 79908b5..f790192 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ The output directory contains dereplicated bins, and a text file listing the com ## Dereplication without reassembly MAGmax provides an option to peform dereplication without reassembly using `--no-reassembly` flag. In this mode, MAGmax selects the best bin within each genomic cluster based on a quality score (defined as completeness - 5 * contamination) that also meets the user-defined completeness and contamination thresholds. When this option is enabled, only the bin directory (`-b`) is required as input. - magmax -b --no-reassembly -t 24 -f fasta + magmax -b --no-reassembly -f fasta -t 24 magmax -b --no-reassembly -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available magmax -b --no-reassembly -f fasta -t 24 --split // if input bins are not already split by sample id From 670333add74fce9fabda335d0858be44acedd213 Mon Sep 17 00:00:00 2001 From: yazhini Date: Tue, 26 Aug 2025 16:29:35 +0200 Subject: [PATCH 06/10] update --- README.md | 24 ------------------------ 1 file changed, 24 deletions(-) diff --git a/README.md b/README.md index 41cc56d..42bd629 100644 --- a/README.md +++ b/README.md @@ -20,18 +20,6 @@ The output directory contains dereplicated bins, and a text file listing the com magmax -b -r -m -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available magmax -b -r -m -f fasta -t 24 --split // if input bins are not already split by sample id -<<<<<<< HEAD -======= - -## Dereplication without reassembly -MAGmax provides an option to peform dereplication without reassembly using `--no-reassembly` flag. In this mode, MAGmax selects the best bin within each genomic cluster based on a quality score (defined as completeness - 5 * contamination) that also meets the user-defined completeness and contamination thresholds. When this option is enabled, only the bin directory (`-b`) is required as input. - - magmax -b --no-reassembly -f fasta -t 24 - magmax -b --no-reassembly -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available - magmax -b --no-reassembly -f fasta -t 24 --split // if input bins are not already split by sample id - - ->>>>>>> origin/no-reassembly ## Installation ### Prerequisites @@ -108,21 +96,9 @@ This example test run demonstrates dereplication of bins using the provided toy magmax -b test/bins -r test/reads -m test/mapids -t 24 -q test/quality_report.tsv After running MAGmax, an output folder named `mags_50comp_95purity` will be created in the `test` directory. This folder contains the following files: -<<<<<<< HEAD -- `bins_checkm2_qualities.tsv` — Table summarizing the quality metrics of the dereplicated bins. -- `sample_ERR3405607_metabat2_results.63.fasta` — Final bin obtained after dereplication of the input bins. - -======= -To run without reassembly, - - magmax -b test/bins --no-reassembly -t 24 -q test/quality_report.tsv // run dereplication without reassembly - -After running MAGmax, an output folder named `mags_50comp_95purity` will be created in the `test` directory. This folder contains the following files: - - `bins_checkm2_qualities.tsv` — Table summarizing the quality metrics of the dereplicated bins. - `sample_ERR3405607_metabat2_results.63.fasta` — Final bin obtained after dereplication of the input bins. ->>>>>>> origin/no-reassembly ## Input specifications 1. Input contigs must have IDs prefixed with the sample ID, separated by a `C`. This is a common practice for both single- and multi-sample binning. Ensure mapping and binning are performed on contig files with these updated contig IDs. From 818e30d0e00789a0745b8a51081b042dd2a14b34 Mon Sep 17 00:00:00 2001 From: yazhini Date: Tue, 26 Aug 2025 16:30:45 +0200 Subject: [PATCH 07/10] update --- README.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/README.md b/README.md index 42bd629..b5a7df9 100644 --- a/README.md +++ b/README.md @@ -20,6 +20,15 @@ The output directory contains dereplicated bins, and a text file listing the com magmax -b -r -m -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available magmax -b -r -m -f fasta -t 24 --split // if input bins are not already split by sample id + +## Dereplication without reassembly +MAGmax provides an option to peform dereplication without reassembly using `--no-reassembly` flag. In this mode, MAGmax selects the best bin within each genomic cluster based on a quality score (defined as completeness - 5 * contamination) that also meets the user-defined completeness and contamination thresholds. When this option is enabled, only the bin directory (`-b`) is required as input. + + magmax -b --no-reassembly -f fasta -t 24 + magmax -b --no-reassembly -f fasta -t 24 -q quality_report.tsv // if CheckM2 result is already available + magmax -b --no-reassembly -f fasta -t 24 --split // if input bins are not already split by sample id + + ## Installation ### Prerequisites @@ -96,6 +105,12 @@ This example test run demonstrates dereplication of bins using the provided toy magmax -b test/bins -r test/reads -m test/mapids -t 24 -q test/quality_report.tsv After running MAGmax, an output folder named `mags_50comp_95purity` will be created in the `test` directory. This folder contains the following files: +To run without reassembly, + + magmax -b test/bins --no-reassembly -t 24 -q test/quality_report.tsv // run dereplication without reassembly + +After running MAGmax, an output folder named `mags_50comp_95purity` will be created in the `test` directory. This folder contains the following files: + - `bins_checkm2_qualities.tsv` — Table summarizing the quality metrics of the dereplicated bins. - `sample_ERR3405607_metabat2_results.63.fasta` — Final bin obtained after dereplication of the input bins. From db610abe3b0bf5a9c5239a42b20458911d99a9a2 Mon Sep 17 00:00:00 2001 From: yazhini Date: Tue, 26 Aug 2025 16:32:31 +0200 Subject: [PATCH 08/10] update --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index b5a7df9..c74ed61 100644 --- a/README.md +++ b/README.md @@ -103,7 +103,6 @@ Option 2: Build from source This example test run demonstrates dereplication of bins using the provided toy dataset. In the `test/bins` directory, example bins generated with MetaBAT2 are given. In the `test/reads` directory, paired-end read files for two samples are given and in the `test/mapids` directory, mapid files mapping reads to contigs for each sample are given. Precomputed CheckM2 quality scores for the input bins are given in the `test/quality_report.tsv`. Run the following command to execute the test: magmax -b test/bins -r test/reads -m test/mapids -t 24 -q test/quality_report.tsv -After running MAGmax, an output folder named `mags_50comp_95purity` will be created in the `test` directory. This folder contains the following files: To run without reassembly, From 2f627beac92db8fe99d5f2693a26a36f842169db Mon Sep 17 00:00:00 2001 From: yazhini Date: Tue, 26 Aug 2025 17:25:33 +0200 Subject: [PATCH 09/10] update --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index c74ed61..054976e 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,11 @@ MAGmax is a dereplication tool designed to maximize the recovery of Metagenome-A ## Inputs MAGmax requires three input directories, -1. `binsdir`, directory containing bin files in FASTA format that need to be dereplicated. (e.g., output files from any metagenome binning tool) +1. ``, directory containing bin files in FASTA format that need to be dereplicated. (e.g., output files from any metagenome binning tool) -2. `readdir`, directory containing read files in FASTQ format for each sample. +2. ``, directory containing read files in FASTQ format for each sample. -3. `mapid_dir`, directory containing mapping files for each sample. Each file is a text file listing read IDs and the corresponding contig IDs they mapped to. These files are used to retrieve reads that map to each merged bin from the FASTQ files in `readdir` and to generate new bin-specific FASTQ files for reassembly. +3. ``, directory containing mapping files for each sample. Each file is a text file listing read IDs and the corresponding contig IDs they mapped to. These files are used to retrieve reads that map to each merged bin from the FASTQ files in `readdir` and to generate new bin-specific FASTQ files for reassembly. ## Outputs An output directory named `mags_comp_purity` will be created, where `x` and `y` correspond to the user-specified completeness and purity thresholds used to select final bins. By default, MAGmax uses a percentage of 50 for completeness and 95 for purity. From 82a3c57e111ba80c91e261616458ee82c9303791 Mon Sep 17 00:00:00 2001 From: Yazhini Date: Tue, 26 Aug 2025 17:32:44 +0200 Subject: [PATCH 10/10] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 054976e..226ccc7 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,6 @@ # MAGmax MAGmax is a dereplication tool designed to maximize the recovery of Metagenome-Assembled Genomes (MAGs) through bin Merging and reAssembly. It performs dereplication in three stages: (i) grouping bins based on average sequence identity, (ii) merging bins within each group, and (iii) reassembling the merged bins. +![MAGmax](https://github.com/user-attachments/assets/802387bf-ae34-48b5-963f-978a0e2d10d5) ## Inputs MAGmax requires three input directories,