MGnify genomes generation pipeline (GGP) produces prokaryotic and eukaryotic metagenome-assembled genomes (MAGs) from raw reads and corresponding assemblies.
This pipeline does not support co-binning and has so far only been tested on short reads.
The pipeline performs the following tasks:
Pre-processing:
- Sanity check raw-reads with seqkit.
- Rename raw-reads identifiers to corresponding assembly identifiers (that process helps to traceback what contigs were used to build particular bin/MAG).
- Change all dots to underscores in contig headers.
Data processing:
- Quality trims the reads and removes adapters using fastp.
- Runs a decontamination step using BWA to remove any host reads. By default, it uses the hg39.fna.
- Bins the contigs using Concoct, MetaBAT2 and MaxBin2.
For prokaryotes:
- Refines the bins using the metaWRAP
bin_refinement
compatible subworkflow supported separately. - Conducts bin quality control with CAT, GUNC, and CheckM.
- Performs dereplication with dRep.
- Calculates coverage using MetaBAT2 calculated depths.
- Detects rRNA and tRNA using cmsearch.
- Assigns taxonomy with GTDBtk.
For eukaryotes:
- Estimates quality and merges bins using EukCC.
- Dereplicates MAGs using dRep.
- Calculates coverage using MetaBAT2 calculated depths.
- Assesses quality with BUSCO and EukCC.
- Assigns taxonomy with BAT.
Final steps:
- Tool versions are available in
software_versions.yml
- MultiQC report
Optional steps:
- Upload MAGs to ENA using public MAG uploader. Applicable only if assemblies and reads were downloaded from ENA.
- Nextflow
- Docker/Singularity
You need to download the mentioned databases and specify them as inputs to parameters (check nextflow.config
).
- BUSCO
- CAT
- CheckM
- EukCC
- GUNC
- GTDB-Tk + ar53_metadata_r*.tsv, bac120_metadata_r*.tsv from here
- Rfam
- The reference genome of your choice for decontamination. Example, human genome hg38
Note
If you want to use the pipeline on ENA, data follow these instructions. Otherwise, download your data and organise it in the recommended format described below.
Each row corresponds to a specific dataset with information:
- row identifier
id
- paths to the raw reads files (
fastq_1
andfastq_2
) - assembly identifier
assembly_accession
- the file path to the contigs file (
assembly
)
Additionally, an optional column assembler
contains information about tool and version that was used to produce the assembly
.
id | fastq_1 | fastq_2 | assembly_accession | assembly | assembler [optional] |
---|---|---|---|---|---|
ID | /path/to/RUN_1.fastq.gz | /path/to/RUN_2.fastq.gz | ASSEMBLY | /path/to/ASSEMBLY.fasta | metaspades_v3.15.5 |
There is an example here.
nextflow run ebi-metagenomics/genomes-generation \
-profile `specify profile(s)` \
--samplesheet `samplesheet.csv` \
--outdir `full path to output directory`
--skip_preprocessing_input (default=false)
: skip input data pre-processing step--skip_prok (default=false)
: do not generate prokaryotic MAGs--skip_euk (default=false)
: do not generate eukaryotic MAGs--skip_concoct (default=false)
: skip CONCOCT binner in binning process--skip_maxbin2 (default=false)
: skip MaxBin2 binner in binning process--skip_metabat2 (default=false)
: skip METABAT2 binner in binning process--merge_pairs (default=false)
: merge paired-end reads on QC step with fastp
unclassified_genomes.txt
bins
--- eukaryotes
------- run_accession
----------- bins.fa
--- prokaryotes
------- run_accession
----------- bins.fa
coverage
--- eukaryotes
------- coverage
----------- aggregated_contigs2bins.txt
------- run_accession_***_coverage
----------- coverage.tab
----------- ***_MAGcoverage.txt
--- prokaryotes
------- coverage
----------- aggregated_contigs2bins.txt
------- run_accession_***_coverage
----------- coverage.tab
----------- ***_MAGcoverage.txt
genomes_drep
--- eukaryotes
------- dereplicated_genomes.txt
------- genomes
----------- genomes.fa
--- prokaryotes
------- dereplicated_genomes.txt
------- genomes
----------- genomes.fa
intermediate_steps
--- binning
--- eukaryotes
------- eukcc
------- qs50
--- fastp
--- prokaryotes
------- gunc
------- refinement
rna
--- cluster_name
------- cluster_name_fasta
----------- ***_rRNAs.fasta
------- cluster_name_out
----------- ***_rRNAs.out
----------- ***_tRNA_20aa.out
stats
--- eukaryotes
------- busco_final_qc.csv
------- combined_busco_eukcc.qc.csv
------- eukcc_final_qc.csv
--- prokaryotes
------- checkm2
----------- aggregated_all_stats.csv
----------- aggregated_filtered_genomes.tsv
------- checkm_results_mags.tab
taxonomy
--- eukaryotes
------- all_bin2classification.txt
------- human_readable.taxonomy.csv
--- prokaryotes
------- gtdbtk_results.tar.gz
pipeline_info
--- software_versions.yml
If you use this pipeline please make sure to cite all used software.