Skip to content

Complete workflow for primary data analyses for bacterial genomes using nanopore

License

Notifications You must be signed in to change notification settings

MarieHannaert/Nanopore_only_Snakemake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nanopore_only pipeline

Marie Hannaert
ILVO

The Nanopore pipeline is designed to analyze long-reads from Nanopore sequencing. This repository contains a Snakemake workflow tailored for analyzing bacterial genome long-read data. I developed this pipeline during my traineeship at ILVO-Plant.

Installing the Nanopore pipeline

Snakemake is a workflow management system that helps create and execute data processing pipelines. It requires Python 3 and can be easily installed via the Bioconda package.

Installing Mamba

First, isntall Miniforge:

Unix-like platforms (Mac OS & Linux)

$ curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
$ bash Miniforge3-$(uname)-$(uname -m).sh

or

$ wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
$ bash Miniforge3-$(uname)-$(uname -m).sh

If this works, Mamba is installed. If not, check the Miniforge documentation here: MiniForge

Installing Bioconda

Perform a one-time setup of Bioconda with the following commands. This will modify your ~/.condarc file:

$ mamba config --add channels defaults
$ mamba config --add channels bioconda
$ mamba config --add channels conda-forge
$ mamba config --set channel_priority strict

If these steps are followed correctly, Bioconda should be installed. If not, refer to the documentation: Bioconda

Installing Snakemake

Create the Snakemake environment by creating a Snakemake Mamba environment:

$ mamba create -c conda-forge -c bioconda -n snakemake snakemake

If successful, use the following commands to activate and check for help:

$ mamba activate snakemake
$ snakemake --help

For more documentation on Snakemake, visit: Snakemake

Downloading the Nanopore pipeline from Github

To use the Nanopore pipeline, download the complete pipeline, including scripts and Conda environments, to your local machine. It's good practice to create a Snakemake/ directory to collect all your pipelines. Download the Nanopore pipeline into your Snakemake directory using:

$ cd Snakemake/ 
$ git clone https://github.com/MarieHannaert/Nanopore_only_Snakemake.git

Making the database that is used for skANI

For using skANI, you need to create a database. Follow the instructions here: Creating a database for skANI

Once your database is installed, update the path to the database in the Snakefile at Snakemake/Nanopore_only_Snakemake/Snakefile, line 155.

Preparing checkM2

Download the diamond database:

$ conda activate .snakemake/conda/5e00f98a73e68467497de6f423dfb41e_ #This path can differ from mine
$ checkm2 database --download
$ checkm2 testrun

Now the snakemake enviroment is ready for use with the pipeline.

Executing the Nanopore pipeline

Before executing the pipeline, perform the following preparatory steps:

Preparing

In the Nanopore_only_Snakemake/ directory, create the following directory: data/samples

$ cd Nanopore_only_Snakemake/
$ mkdir data/samples

Place the samples you want to analyze in the samples directory. They should be named like:

  • sample1.fq.gz
  • sample2.fq.gz

Making scripts executable

Run the following command in the Snakemake/Nanopore_only_Snakemake/ directory to make the scripts executable:

$ chmod +x scripts/*

This is necessary to execute the scripts used in the pipeline.

Personalize genomesize

The genome size is hardcoded in multiple lines. You need to change this to your genome size. Update the following lines in the Snakefile:

  • 53
  • 109

Executing the Nanopore pipeline

Now, everything is ready to run the pipeline. To check the pipeline without generating output, use the following command in the Nanopore_only_Snakemake/ directory:

$ snakemake -np

This will give you an overview of all the steps in the pipeline.

To execute the pipeline with your samples in the data/samples directory, use:

$ snakemake -j 4 --use-conda

The -j option specifies the number of threads to use, which you can adjust based on your local server. The --use-conda is needed for using the conda enviroments in the pipeline.

Pipeline content

The pipeline has eight major steps, along with some side steps for summaries and visualizations.

Nanoplot

NanoPlot is a tool for long-reads that provides an overview of the data quality, producing various visual outputs.

Nanoplot documentation: Nanoplot

Filtlong

Filtlong filters long-reads based on their quality, using both read length and read identity.

Filtlong documentation: Filtlong

Porechop ABI

Porechop ABI processes adapter sequences in ONT reads, discovering adapters directly from the reads and trimming them.

Porechop ABI documentation: PorechopABI

Flye

Flye is a tool for polishing long-reads, using output from Porechop ABI as input.

Flye documentation: Flye

Racon

Racon generates genomic consensus of high quality, requiring Minimap2 to be run on Flye's output before combining it with Porechop ABI's output.

Racon documentation: RACON

skANI

skANI calculates average nucleotide identity (ANI) from DNA sequences, outputting a summary file used for further analysis.

SkANI documentation: skANI

Quast

Quast assesses genome assemblies, producing a summary file and various visualizations for quality assessment.

Quast documentation: Quast

Busco

BUSCO evaluates genome assembly and annotation completeness, providing a summary graph for up to fifteen samples.

Busco documentation: Busco

CheckM2

CheckM2 is similar to CheckM but uses universally trained machine learning models.

This allows it to incorporate many lineages in its training set that have few - or even just one - high-quality genomic representatives, by putting it in the context of all other organisms in the training set.

From these result there will be made a summary table and then this summary table will be used also as input for the xlsx file: skANI_Quast_checkM2_output.xlsx.

CheckM2 documentation: CheckM2

Finish

After executing the pipeline, your Nanopore_only_Snakemake/ directory will have the following structure:

Snakemake/
├─ Nanopore_only_Snakemake/
|  ├─ .snakemake
│  ├─ data/
|  |  ├─sampels/
|  ├─ envs
|  ├─ scripts/
|  |  ├─beeswarm_vis_assemblies.R
|  |  ├─summaries_busco.sh
|  |  ├─skani_quast_checkm2_to_xlsx.py
|  ├─ Snakefile
│  ├─ results/
|  |  ├─01_nanoplot/
|  |  ├─02_filtlong/
|  |  ├─03_porechopABI/
|  |  ├─04_flye/
|  |  ├─05_racon/
|  |  ├─06_skani/
|  |  ├─07_quast/
|  |  ├─08_busco/
|  |  ├─09_checkm2/
|  |  ├─assemblies/
|  |  ├─busco_summary/
│  ├─ README
│  ├─ logs

Overview of Nanopore pipeline

A DAG of the Nanopore pipeline in snakemake

About

Complete workflow for primary data analyses for bacterial genomes using nanopore

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published