Pannagram is a package for constructing pan-genome alignments, analyzing structural variants, and translating annotations between genomes. Additionally, Pannagram contains useful functions for visualization. The manual is available at the pannagram-page.
Follow these instructions to set up your Pannagram environment.
Make sure you have one of the following package managers installed:
Use your selected package manager by replacing <manager>
with conda, mamba, or micromamba.
<manager> env create -f pannagram.yml
<manager> activate pannagram
<manager> env create --platform osx-64 -f pannagram_m4.yml
<manager> activate pannagram
Use this option if you prefer an environment where package versions are not explicitly specified, and packages are installed with the latest compatible versions available:
Linux and macOS (Intel)
<manager> env create -f pannagram_min.yml
<manager> activate pannagram
macOS (M-series chips)
<manager> env create --platform osx-64 -f pannagram_min.yml
<manager> activate pannagram
Make sure that RStudio-Desktop is installed. Then run the following in the command line:
<manager> activate pannagram
open -a RStudio
One may also create an alias:
alias panR="micromamba activate pannagram && open -a RStudio"
The environment provides the following dependencies, each accessible directly via the command line:
Can try running code from this repo under WSL (as Bash and /
path separator are used extensively in the code). Nevertheless it was never tested in such environment.
Pangenome alignment can be built in two modes:
- reference-free:
./pannagram.sh -path_in '<genome files directory path>' \
-path_out '<output files path>' \
-cores 8
- reference-based:
./pannagram.sh -ref '<reference genome name>' \
-path_in '<genome files directory path>' \
-path_out '<output files path>' \
-cores 8
- quick look: If there is no information on genomes and corresponding chromosomes available, one can run preparation steps:
./pannagram.sh -ref '<reference genome name>' \
-path_in '<genome files directory path>' \
-path_out '<output files path>' \
-cores 8 -pre
An extended description of the parameters for all three scripts are avaliable by executing scripts with the flag -help
.
Synteny blocks, SNPs, and sequence consensus (for the IGV browser) can be extracted from the alignment:
./analys.sh -path_msa '<output path with consensus>' \
-path_chr '<path with chromosomes>' \
-blocks \ # Find Synteny block inforamtion for visualisation
-seq \ # Create consensus sequence of the pangenome
-snp # SNP calling
When the pangenome linear alignment is built, SVs can be called using the following script:
./analys.sh -path_msa '<output path with consensus>' \
-sv_call \ # Create output .gff and .fasta files with SVs
-sv_sim te.fasta \ # Compare with a set of sequences (e.g., TEs)
-sv_graph # Construct the graph of SVs
Pannagram contains a number of useful methods for visualization in R.
All genomes together:
A dotplot for a pair of genomes:
Every node is an SV:
Every node is a unique sequence, size - the amount of this sequence in SVs:
- In the ACTG-mode:
# --- Quick start code ---
source('utils/utils.R') # Functions to work with sequences
source('visualisation/msaplot.R') # Visualisation
aln.seq = readFastaMy('aln.fasta') # Vector of strings
aln.mx = aln2mx(aln.seq) # Transfom into the matrix
msaplot(aln.mx) # ggplot object
- In the Polymorphism mode:
# --- Quick start code ---
msadiff(aln.mx) # ggplot object
Simultaneously on forward (dark color) and reverse complement (pink color) strands:
# --- Quick start code ---
source('utils/utils.R') # Functions to work with sequences
source('visualisation/dotplot.R') # Visualisation
s = sample(c("A","C","G","T"), 100, replace = T)
dotplot(s, s, 15, 9) # ggplot object
# --- Quick start code ---
source('utils/utils.R') # Functions to work with sequences
source('visualisation/orfplot.R') # Visualisation
str = nt2seq(s)
orfs = orfFinder(str)
orfplot(orfs$pos) # ggplot object
The first approach involves searching against entire genomes or individual chromosomes. The quickstart toy-example is:
./simsearch.sh -in_seq genes.fasta -on_genome genome.fasta -out out.txt
The result is a GFF file with hits matching the similarity threshold.
The second approach, in contrast, is designed to search for similarities against another set of sequences. The quickstart toy-example is:
./simsearch.sh -in_seq genes.fasta -on_seq genome.fasta -out out.txt
The result is an RDS (R Data Structure) table. This table shows the coverage of one sequence over another and includes a flag column that indicates whether the sequences meet the similarity threshold. Additionally, the second script takes into account the coverage strand, determining not just if a sequence is covered, but also if it's covered in a specific orientation.
Development:
- Anna Igolkina - Lead Developer and Project Initiator
- Alexander Bezlepsky - Assistant
Testing:
- Anna Igolkina: Lead Tester
- Anna Glushkevich: Testing the alignment on A. lyrata genomes
- Elizaveta Grigoreva: Testing the alignment on A. thaliana and A. lyrata genomes
- Jilong Ma: Testing the SV-graph on spider genomes
- Alexander Bezlepsky: Testing the Pannagram's functionality on Rhizobial genomes
- Gregoire Bohl-Viallefond: Testing the annotation converter on A. thaliana alignment
Resources:
- Logo was generated with the help of DALL-E
- Parallel Processing Tool: O. Tange (2018): GNU Parallel 2018, ISBN 9781387509881, DOI https://doi.org/10.5281/zenodo.1146014.