Skip to content

iganna/pannagram

Repository files navigation

Pannagram

Pannagram is a package for constructing pan-genome alignments, analyzing structural variants, and translating annotations between genomes. Additionally, Pannagram contains useful functions for visualization. The manual is available at the pannagram-page.

Setting Up the Working Environment

Follow these instructions to set up your Pannagram environment.

Prerequisites

Make sure you have one of the following package managers installed:

Use your selected package manager by replacing <manager> with conda, mamba, or micromamba.

Linux and macOS (Intel)

<manager> env create -f pannagram.yml
<manager> activate pannagram

macOS (M-series chips)

<manager> env create --platform osx-64 -f pannagram_m4.yml
<manager> activate pannagram

Alternative: Setting Up the Environment Without Explicit Versions

Use this option if you prefer an environment where package versions are not explicitly specified, and packages are installed with the latest compatible versions available:

Linux and macOS (Intel)

<manager> env create -f pannagram_min.yml
<manager> activate pannagram

macOS (M-series chips)

<manager> env create --platform osx-64 -f pannagram_min.yml
<manager> activate pannagram

Running RStudio with the Environment

Make sure that RStudio-Desktop is installed. Then run the following in the command line:

<manager> activate pannagram
open -a RStudio

One may also create an alias:

alias panR="micromamba activate pannagram && open -a RStudio"

Included Dependencies

The environment provides the following dependencies, each accessible directly via the command line:

Windows users

Can try running code from this repo under WSL (as Bash and / path separator are used extensively in the code). Nevertheless it was never tested in such environment.

1. Pangenome linear alignment

1.1 Building the alignment

Pangenome alignment can be built in two modes:

  • reference-free:
./pannagram.sh -path_in '<genome files directory path>' \
    -path_out '<output files path>' \
    -cores 8
  • reference-based:
./pannagram.sh -ref '<reference genome name>' \
    -path_in '<genome files directory path>' \
    -path_out '<output files path>' \
    -cores 8
  • quick look: If there is no information on genomes and corresponding chromosomes available, one can run preparation steps:
./pannagram.sh -ref '<reference genome name>' \
    -path_in '<genome files directory path>' \
    -path_out '<output files path>' \
    -cores 8 -pre

An extended description of the parameters for all three scripts are avaliable by executing scripts with the flag -help.

1.2 Extract information from the pangenome alignment

Synteny blocks, SNPs, and sequence consensus (for the IGV browser) can be extracted from the alignment:

./analys.sh -path_msa '<output path with consensus>' \
      -path_chr '<path with chromosomes>' \
      -blocks  \  # Find Synteny block inforamtion for visualisation
      -seq  \     # Create consensus sequence of the pangenome
      -snp        # SNP calling

1.3 Calling structural variants

When the pangenome linear alignment is built, SVs can be called using the following script:

./analys.sh -path_msa '<output path with consensus>' \
      -sv_call  \         # Create output .gff and .fasta files with SVs
      -sv_sim te.fasta \  # Compare with a set of sequences (e.g., TEs)
      -sv_graph           # Construct the graph of SVs

2. Visualisation

Pannagram contains a number of useful methods for visualization in R.

2.1 Visualisation of the pangenome alignment

All genomes together:

A dotplot for a pair of genomes:

2.2 Graph of Nestedness on Structural variants

Every node is an SV:

Every node is a unique sequence, size - the amount of this sequence in SVs:

2.3 Nucleotide plot for a fragment of the alignment

  • In the ACTG-mode:

# --- Quick start code ---
source('utils/utils.R')             # Functions to work with sequences
source('visualisation/msaplot.R')   # Visualisation
aln.seq = readFastaMy('aln.fasta')  # Vector of strings
aln.mx = aln2mx(aln.seq)            # Transfom into the matrix
msaplot(aln.mx)                     # ggplot object
  • In the Polymorphism mode:

# --- Quick start code ---
msadiff(aln.mx)                     # ggplot object

2.4 Dotplots of Sequences

Simultaneously on forward (dark color) and reverse complement (pink color) strands:

# --- Quick start code ---
source('utils/utils.R')             # Functions to work with sequences
source('visualisation/dotplot.R')   # Visualisation
s = sample(c("A","C","G","T"), 100, replace = T)
dotplot(s, s, 15, 9)                # ggplot object

2.5 ORF-finder and visualisation

# --- Quick start code ---
source('utils/utils.R')             # Functions to work with sequences
source('visualisation/orfplot.R')   # Visualisation
str = nt2seq(s)
orfs = orfFinder(str)
orfplot(orfs$pos)                   # ggplot object

3. Additional useful tools

3.1 Search for similar sequences

...on the genome

The first approach involves searching against entire genomes or individual chromosomes. The quickstart toy-example is:

./simsearch.sh -in_seq genes.fasta -on_genome genome.fasta -out out.txt

The result is a GFF file with hits matching the similarity threshold.

...on another set

The second approach, in contrast, is designed to search for similarities against another set of sequences. The quickstart toy-example is:

./simsearch.sh -in_seq genes.fasta -on_seq genome.fasta -out out.txt

The result is an RDS (R Data Structure) table. This table shows the coverage of one sequence over another and includes a flag column that indicates whether the sequences meet the similarity threshold. Additionally, the second script takes into account the coverage strand, determining not just if a sequence is covered, but also if it's covered in a specific orientation.

Acknowledgements

Development:

  • Anna Igolkina - Lead Developer and Project Initiator
  • Alexander Bezlepsky - Assistant

Testing:

  • Anna Igolkina: Lead Tester
  • Anna Glushkevich: Testing the alignment on A. lyrata genomes
  • Elizaveta Grigoreva: Testing the alignment on A. thaliana and A. lyrata genomes
  • Jilong Ma: Testing the SV-graph on spider genomes
  • Alexander Bezlepsky: Testing the Pannagram's functionality on Rhizobial genomes
  • Gregoire Bohl-Viallefond: Testing the annotation converter on A. thaliana alignment

Resources:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published