Marie Hannaert
ILVO
The Nanopore pipeline is designed to analyze long-reads from Nanopore sequencing. This repository contains a Snakemake workflow tailored for analyzing bacterial genome long-read data. I developed this pipeline during my traineeship at ILVO-Plant.
Snakemake is a workflow management system that helps create and execute data processing pipelines. It requires Python 3 and can be easily installed via the Bioconda package.
First, isntall Miniforge:
$ curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
$ bash Miniforge3-$(uname)-$(uname -m).sh
or
$ wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
$ bash Miniforge3-$(uname)-$(uname -m).sh
If this works, Mamba is installed. If not, check the Miniforge documentation here: MiniForge
Perform a one-time setup of Bioconda with the following commands. This will modify your ~/.condarc file:
$ mamba config --add channels defaults
$ mamba config --add channels bioconda
$ mamba config --add channels conda-forge
$ mamba config --set channel_priority strict
If these steps are followed correctly, Bioconda should be installed. If not, refer to the documentation: Bioconda
Create the Snakemake environment by creating a Snakemake Mamba environment:
$ mamba create -c conda-forge -c bioconda -n snakemake snakemake
If successful, use the following commands to activate and check for help:
$ mamba activate snakemake
$ snakemake --help
For more documentation on Snakemake, visit: Snakemake
To use the Nanopore pipeline, download the complete pipeline, including scripts and Conda environments, to your local machine. It's good practice to create a Snakemake/ directory to collect all your pipelines. Download the Nanopore pipeline into your Snakemake directory using:
$ cd Snakemake/
$ git clone https://github.com/MarieHannaert/Nanopore_only_Snakemake.git
For using skANI, you need to create a database. Follow the instructions here: Creating a database for skANI
Once your database is installed, update the path to the database in the Snakefile at Snakemake/Nanopore_only_Snakemake/Snakefile, line 155.
Download the diamond database:
$ conda activate .snakemake/conda/5e00f98a73e68467497de6f423dfb41e_ #This path can differ from mine
$ checkm2 database --download
$ checkm2 testrun
Now the snakemake enviroment is ready for use with the pipeline.
Before executing the pipeline, perform the following preparatory steps:
In the Nanopore_only_Snakemake/ directory, create the following directory: data/samples
$ cd Nanopore_only_Snakemake/
$ mkdir data/samples
Place the samples you want to analyze in the samples directory. They should be named like:
- sample1.fq.gz
- sample2.fq.gz
Run the following command in the Snakemake/Nanopore_only_Snakemake/ directory to make the scripts executable:
$ chmod +x scripts/*
This is necessary to execute the scripts used in the pipeline.
The genome size is hardcoded in multiple lines. You need to change this to your genome size. Update the following lines in the Snakefile:
- 53
- 109
Now, everything is ready to run the pipeline. To check the pipeline without generating output, use the following command in the Nanopore_only_Snakemake/ directory:
$ snakemake -np
This will give you an overview of all the steps in the pipeline.
To execute the pipeline with your samples in the data/samples directory, use:
$ snakemake -j 4 --use-conda
The -j option specifies the number of threads to use, which you can adjust based on your local server. The --use-conda is needed for using the conda enviroments in the pipeline.
The pipeline has eight major steps, along with some side steps for summaries and visualizations.
NanoPlot is a tool for long-reads that provides an overview of the data quality, producing various visual outputs.
Nanoplot documentation: Nanoplot
Filtlong filters long-reads based on their quality, using both read length and read identity.
Filtlong documentation: Filtlong
Porechop ABI processes adapter sequences in ONT reads, discovering adapters directly from the reads and trimming them.
Porechop ABI documentation: PorechopABI
Flye is a tool for polishing long-reads, using output from Porechop ABI as input.
Flye documentation: Flye
Racon generates genomic consensus of high quality, requiring Minimap2 to be run on Flye's output before combining it with Porechop ABI's output.
Racon documentation: RACON
skANI calculates average nucleotide identity (ANI) from DNA sequences, outputting a summary file used for further analysis.
SkANI documentation: skANI
Quast assesses genome assemblies, producing a summary file and various visualizations for quality assessment.
Quast documentation: Quast
BUSCO evaluates genome assembly and annotation completeness, providing a summary graph for up to fifteen samples.
Busco documentation: Busco
CheckM2 is similar to CheckM but uses universally trained machine learning models.
This allows it to incorporate many lineages in its training set that have few - or even just one - high-quality genomic representatives, by putting it in the context of all other organisms in the training set.
From these result there will be made a summary table and then this summary table will be used also as input for the xlsx file: skANI_Quast_checkM2_output.xlsx.
CheckM2 documentation: CheckM2
After executing the pipeline, your Nanopore_only_Snakemake/ directory will have the following structure:
Snakemake/
├─ Nanopore_only_Snakemake/
| ├─ .snakemake
│ ├─ data/
| | ├─sampels/
| ├─ envs
| ├─ scripts/
| | ├─beeswarm_vis_assemblies.R
| | ├─summaries_busco.sh
| | ├─skani_quast_checkm2_to_xlsx.py
| ├─ Snakefile
│ ├─ results/
| | ├─01_nanoplot/
| | ├─02_filtlong/
| | ├─03_porechopABI/
| | ├─04_flye/
| | ├─05_racon/
| | ├─06_skani/
| | ├─07_quast/
| | ├─08_busco/
| | ├─09_checkm2/
| | ├─assemblies/
| | ├─busco_summary/
│ ├─ README
│ ├─ logs