Genomes Generation Pipeline

MGnify genomes generation pipeline (GGP) produces prokaryotic and eukaryotic metagenome-assembled genomes (MAGs) from raw reads and corresponding assemblies.

This pipeline does not support co-binning and has so far only been tested on short reads.

Pipeline summary

The pipeline performs the following tasks:

Pre-processing:

Sanity check raw-reads with seqkit.
Rename raw-reads identifiers to corresponding assembly identifiers (that process helps to traceback what contigs were used to build particular bin/MAG).
Change all dots to underscores in contig headers.

Data processing:

Quality trims the reads and removes adapters using fastp.
Runs a decontamination step using BWA to remove any host reads. By default, it uses the hg39.fna.
Bins the contigs using Concoct, MetaBAT2 and MaxBin2.

For prokaryotes:

Refines the bins using the metaWRAP bin_refinement compatible subworkflow supported separately.
Conducts bin quality control with CAT, GUNC, and CheckM.
Performs dereplication with dRep.
Calculates coverage using MetaBAT2 calculated depths.
Detects rRNA and tRNA using cmsearch.
Assigns taxonomy with GTDBtk.

For eukaryotes:

Estimates quality and merges bins using EukCC.
Dereplicates MAGs using dRep.
Calculates coverage using MetaBAT2 calculated depths.
Assesses quality with BUSCO and EukCC.
Assigns taxonomy with BAT.

Final steps:

Tool versions are available in software_versions.yml
MultiQC report

Optional steps:

Upload MAGs to ENA using public MAG uploader. Applicable only if assemblies and reads were downloaded from ENA.

Requirements

Nextflow
Docker/Singularity

Required reference databases

You need to download the mentioned databases and specify them as inputs to parameters (check nextflow.config).

BUSCO
CAT
CheckM
EukCC
GUNC
GTDB-Tk + ar53_metadata_r*.tsv, bac120_metadata_r*.tsv from here
Rfam
The reference genome of your choice for decontamination. Example, human genome hg38

Pipeline inputs

Note

If you want to use the pipeline on ENA, data follow these instructions. Otherwise, download your data and organise it in the recommended format described below.

samplesheet.csv

Each row corresponds to a specific dataset with information:

row identifier id
paths to the raw reads files (fastq_1 and fastq_2)
assembly identifier assembly_accession
the file path to the contigs file (assembly)

Additionally, an optional column assembler contains information about tool and version that was used to produce the assembly.

id	fastq_1	fastq_2	assembly_accession	assembly	assembler [optional]
ID	/path/to/RUN_1.fastq.gz	/path/to/RUN_2.fastq.gz	ASSEMBLY	/path/to/ASSEMBLY.fasta	metaspades_v3.15.5

There is an example here.

Run pipeline

nextflow run ebi-metagenomics/genomes-generation \
-profile `specify profile(s)` \
--samplesheet `samplesheet.csv` \
--outdir `full path to output directory`

Optional arguments

--skip_preprocessing_input (default=false): skip input data pre-processing step
--skip_prok (default=false): do not generate prokaryotic MAGs
--skip_euk (default=false): do not generate eukaryotic MAGs
--skip_concoct (default=false): skip CONCOCT binner in binning process
--skip_maxbin2 (default=false): skip MaxBin2 binner in binning process
--skip_metabat2 (default=false): skip METABAT2 binner in binning process
--merge_pairs (default=false): merge paired-end reads on QC step with fastp

Pipeline results

Structure

unclassified_genomes.txt

bins
--- eukaryotes
------- run_accession
----------- bins.fa
--- prokaryotes
------- run_accession
----------- bins.fa

coverage
--- eukaryotes
------- coverage
----------- aggregated_contigs2bins.txt
------- run_accession_***_coverage
----------- coverage.tab
----------- ***_MAGcoverage.txt
--- prokaryotes
------- coverage
----------- aggregated_contigs2bins.txt
------- run_accession_***_coverage
----------- coverage.tab
----------- ***_MAGcoverage.txt

genomes_drep
--- eukaryotes
------- dereplicated_genomes.txt
------- genomes
----------- genomes.fa
--- prokaryotes
------- dereplicated_genomes.txt
------- genomes
----------- genomes.fa

intermediate_steps
--- binning
--- eukaryotes
------- eukcc
------- qs50
--- fastp
--- prokaryotes
------- gunc
------- refinement

rna
--- cluster_name
------- cluster_name_fasta
-----------  ***_rRNAs.fasta
------- cluster_name_out
----------- ***_rRNAs.out
----------- ***_tRNA_20aa.out

stats
--- eukaryotes
------- busco_final_qc.csv
------- combined_busco_eukcc.qc.csv
------- eukcc_final_qc.csv
--- prokaryotes
------- checkm2
----------- aggregated_all_stats.csv
----------- aggregated_filtered_genomes.tsv
------- checkm_results_mags.tab

taxonomy
--- eukaryotes
------- all_bin2classification.txt
------- human_readable.taxonomy.csv
--- prokaryotes
------- gtdbtk_results.tar.gz

pipeline_info
--- software_versions.yml

Citation

If you use this pipeline please make sure to cite all used software.

Name		Name	Last commit message	Last commit date
Latest commit History 478 Commits
assets		assets
bin		bin
config		config
containers		containers
docs		docs
lib		lib
modules		modules
subworkflows/local		subworkflows/local
tests		tests
workflows		workflows
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genomes Generation Pipeline

Pipeline summary

Requirements

Required reference databases

Pipeline inputs

samplesheet.csv

Run pipeline

Optional arguments

Pipeline results

Structure

Citation

About

Uh oh!

Releases 3

Packages

Contributors 6

Uh oh!

Languages

License

EBI-Metagenomics/genomes-generation

Folders and files

Latest commit

History

Repository files navigation

Genomes Generation Pipeline

Pipeline summary

Requirements

Required reference databases

Pipeline inputs

samplesheet.csv

Run pipeline

Optional arguments

Pipeline results

Structure

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 6

Uh oh!

Languages

Packages