Skip to content
/ ntSynt Public

Detecting multi-genome synteny using minimizer graph mapping

License

Notifications You must be signed in to change notification settings

bcgsc/ntSynt

Repository files navigation

GitHub Release Conda Downloads Preprint

Logo

ntSynt

Multi-genome synteny detection using a dynamic minimizer graph approach.

Contents

  1. Description of the algorithm
  2. Credits
  3. Citing ntSynt
  4. Usage
  5. Installation instructions
  6. Example
  7. Output files
  8. Assessment
  9. Tips
  10. License

Description of the algorithm

ntSynt can take multiple genomes as input, and will compute synteny blocks that are in common with each of these input assemblies. ntSynt builds on the ntJoin codebase.

Main steps in the algorithm:

  1. Generate ordered minimizer sketches for each of the input genome assemblies
    • By default, the minimizer sketches will be generated using a "common" Bloom filter, which contains all the k-mers that are in common between each assembly
    • Only minimizers from k-mers that are found in this common Bloom filter will be included in the output sketch
  2. Filter the minimizers to retain only those that are found in all input assemblies and found in single copy in each assembly
  3. Build an initial minimizer graph, where the nodes are minimizers and there are edges between adjacent minimizers
  4. Run graph simplification to remove simple tangles in the graph
  5. Find linear paths through the graph, and compute the initial synteny blocks
  6. Compute minimizer sketches for regions not covered by the initial synteny blocks using a lower window size, and augment the minimizer graph
  7. Repeat the graph simplification and filtering steps, and output refined synteny blocks. Repeat steps 6-7 for each desired lower window size.
  8. Merge collinear synteny blocks within the specified range
  9. Output the final synteny blocks

Credits

Concept: Lauren Coombe and Rene Warren

Design and implementation: Lauren Coombe

Citing ntSynt

If you use ntSynt in your work, please cite:

Lauren Coombe, Parham Kazemi, Johnathan Wong, Inanc Birol, René L. Warren. Multi-genome synteny detection using minimizer graph mappings. bioRxiv (2024) https://doi.org/10.1101/2024.02.07.579356.

Usage

usage: ntSynt [-h] [--fastas_list FASTAS_LIST] -d DIVERGENCE [-p PREFIX] [-k K] [-w W] [-t T] [--fpr FPR] [-b BLOCK_SIZE] [--merge MERGE]
              [--w_rounds W_ROUNDS [W_ROUNDS ...]] [--indel INDEL] [-n] [--benchmark] [-f] [--dev] [-v]
              [fastas ...]

ntSynt: Multi-genome synteny detection using minimizer graphs

positional arguments:
  fastas                Input genome fasta files

optional arguments:
  -h, --help            show this help message and exit
  --fastas_list FASTAS_LIST
                        File listing input genome fasta files, one per line
  -d DIVERGENCE, --divergence DIVERGENCE
                        Approx. maximum percent sequence divergence between input genomes (Ex. -d 1 for 1% divergence).
                        This will be used to set --indel, --merge, --w_rounds, --block_size
                        See below for set values - You can also set any of those parameters yourself, which will override these settings.
  -p PREFIX, --prefix PREFIX
                        Prefix for ntSynt output files [ntSynt.k<k>.w<w>]
  -k K                  Minimizer k-mer size [24]
  -w W                  Minimizer window size [1000]
  -t T                  Number of threads [12]
  --fpr FPR             False positive rate for Bloom filter creation [0.025]
  -b BLOCK_SIZE, --block_size BLOCK_SIZE
                        Minimum synteny block size (bp)
  --merge MERGE         Maximum distance between collinear synteny blocks for merging (bp). 
                        Can also specify a multiple of the window size (ex. 3w)
  --w_rounds W_ROUNDS [W_ROUNDS ...]
                        List of decreasing window sizes for synteny block refinement
  --indel INDEL         Threshold for indel detection (bp)
  -n, --dry-run         Print out the commands that will be executed
  --benchmark           Store benchmarks for each step of the ntSynt pipeline
  -f, --force           Run all ntSynt steps, regardless of existing output files
  --dev                 Run in developer mode to retain intermediate files, log verbose output
  -v, --version         show program's version number and exit

Given the approximate maximum divergence between the supplied genomes, ntSynt will set these default parameters:

Divergence range Default parameters
< 1% --block_size 500 --indel 10000 --merge 10000 --w_rounds 100 10
1% - 10% --block_size 1000 --indel 50000 --merge 100000 --w_rounds 250 100
>10% --block_size 10000 --indel 100000 --merge 1000000 --w_rounds 500 250

Any of these parameters can be overridden by specifying them in your command. While these settings work generally well for the associated divergence range, we highly recommend customizing them for your particular requirements.

Installation

Installing via conda

conda install -c bioconda -c conda-forge ntsynt

Dependencies

Installing ntSynt from the source code

meson setup build --prefix=/path/to/desired/install/location
cd build
ninja install

Testing ntSynt installation

Test your ntSynt installation using our provided demo:

cd tests
./run_ntSynt_demo.sh 

Once the script has executed successfully, you can compare the output files with those in tests/expected_results

Example command

To compute the synteny blocks between 3 assemblies (assembly1.fa, assembly2.fa, assembly3.fa) with default parameters, where the maximum sequence divergence among these is ~5%, run:

ntSynt -d 5 assembly1.fa assembly2.fa assembly3.fa

Output files

The main output file has the naming scheme <prefix>.synteny_blocks.tsv. This contains the synteny blocks computed in a TSV format.

The columns of this output synteny blocks TSV:

  1. Synteny block ID - Lines with the same ID are part of the same synteny block
  2. Genome file name
  3. Genome chromosome/contig
  4. Genome start coordinate
  5. Genome end coordinate
  6. Chromosome/contig strand
  7. Number of mapped minimizers in this synteny block
  8. Reason for discontinuity with previous synteny block

Basic assessment of synteny blocks

For a basic statistical summary of the computed synteny blocks, you can use the script denovo_synteny_block_stats.py found in analysis_scripts:

python3 denovo_synteny_block_stats.py -h
usage: denovo_synteny_block_stats.py [-h] --tsv TSV --fai FAI [FAI ...]

Compute de novo stats on synteny blocks

optional arguments:
  -h, --help           show this help message and exit
  --tsv TSV            ntSynt synteny block file
  --fai FAI [FAI ...]  FAI files for the compared genomes

More information can be found on our wiki page

Tips

  • To lower the peak memory usage, increase the false positive rate (--fpr) for the constructed Bloom filter
  • Customize parameters such as --merge, --indel, --block_size and --w_rounds for your particular input data and research questions
  • For example scripts for visualizing the output synteny blocks, see the sub-directory visualization_scripts
  • If you do not know the approximate sequence divergence between the input assemblies, we recommend using Mash to estimate the divergences

License

ntSynt Copyright (c) 2023 British Columbia Cancer Agency Branch. All rights reserved.

ntSynt is released under the GNU General Public License v3

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

For commercial licensing options, please contact Patrick Rebstein [email protected]