ntStat is a toolkit that uses Bloom filters to track both k-mer count and depth information for use in downstream applications. ntStat also models the k-mer count histogram using a mixture model, and infers valuable insights about the genome (size and heterozygosity), sequencing data (ratio of non-erroneous k-mers), and individual k-mers (probability of a k-mer being erroneous, given its count), de novo.
Contents
- Make sure all dependencies are available in the setup environment.
- Download the latest release tarball
- Install ntStat using
meson
by running the following commands in the project's root directory:
meson setup build
meson install -C build
Add --prefix PREFIX
to meson setup
to change the installation prefix to PREFIX
. --prefix=$CONDA_PREFIX
will install ntStat in the activated conda environment:
meson setup build --prefix=$CONDA_PREFIX
meson install -C build
Add --destdir DESTDIR
to meson install
to change the installation path to DESTDIR
, ignoring the prefix. This is useful for development purposes.
Make sure the bin
and site-packages
directories in the installation path are respectively available in PATH
and PYTHONPATH
:
meson setup build
meson install -C build --destdir install
export PATH="$(pwd)/build/install/usr/local/bin/:$PATH"
export PYTHONPATH="$(pwd)/build/install/usr/local/lib/python3.10/site-packages:$PYTHONPATH"
To install all dependencies in a fresh conda environment called ntstat
:
conda env create --file environment.yaml
Alternatively, install the required dependencies according to their own installation guides. Download all single-header C++ libraries and add their paths to CPATH
. If the hist
module will be used, install the required Python packages using any package manager (e.g., pip
).
- Python >= 3.9
- C++ compiler with C++17 support
- Meson and meson-python
- ntCard
- btllib >=1.7.3
matplotlib
numpy
scipy
tabulate
Calculate TF-IDF information from sequencing data. This module outputs two counting Bloom filters: One for k-mer/seed counts (TF) and other for depths (DF). The number of reads will also be printed to stdout.
Usage: ntstat count [--help] [--version] [-k VAR] [-s VAR] -f VAR [-e VAR] [-b VAR] [--long] [-t VAR] -o VAR reads
Positional arguments:
reads path to sequencing data file(s) [nargs: 0 or more] [required]
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
-k k-mer length
-s path to spaced seeds file (one per line, if -k not specified)
-f path to k-mer spectrum file (from ntCard) [required]
-e target output false positive rate [nargs=0..1] [default: 0.0001]
-b size of each output CBF (bytes)
--long optimize for long read data
-t number of threads [nargs=0..1] [default: 1]
-o path to store output file [required]
- Set
-s
to count spaced seeds instead of k-mers.-k
will be ignored in this case. The file pointed to by-s
should be a plain text file containing spaced seed patterns, one per line, with1
for care and0
for don't care positions. -f
should point to the histogram file generated by ntCard. Use the same-k
for ntCard and ntStat. For spaced seeds, set ntCard's-k
parameter to the seed length.- Use
-b
to manually set the output data structure's size. Alternatively, set-e
to the desired false positive rate (0.0001 by default) and ntStat will automatically calculate a suitable value for-b
.
Gathers k-mer counts ONLY and saves k-mers that occur at least -cmin
times and at most -cmax
times in a Bloom filter. Optionally, output can be a counting Bloom filter (set --counts
), which includes the filtered k-mers' counts, resulting in a 8x larger output file.
Usage: ntstat filter [--help] [--version] [-k VAR] [-s VAR] -f VAR [-e VAR] [-b VAR] [-cmin VAR] [-cmax VAR] [--counts] [--long] [-t VAR] -o VAR reads
Positional arguments:
reads path to sequencing data file(s) [nargs: 0 or more] [required]
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
-k k-mer length
-s path to spaced seeds file (one per line, if -k not specified)
-f path to k-mer spectrum file (from ntCard) [required]
-e target output false positive rate [nargs=0..1] [default: 0.0001]
-b output BF/CBF size (bytes)
-cmin minimum count threshold (>=1, or 0 for first minimum) [nargs=0..1] [default: 1]
-cmax maximum count threshold (<=255) [nargs=0..1] [default: 255]
--counts output counts (requires ~8x RAM for CBF)
--long optimize for long read data
-t number of threads [nargs=0..1] [default: 1]
-o path to store output file [required]
- Refer to the
count
module's notes for information about the parameters.
Converts information from a BF/CBF to TSV format.
usage: ntstat query [-h] -b B [-s S] -o O data [data ...]
positional arguments:
data path to query data
options:
-h, --help show this help message and exit
-b B path to BF/CBF file
-s S path to spaced seeds file
-o O path to output TSV file
- The input data can be any FASTA, FASTQ, or BAM. Compressed files are supported. To read from stdin, use
-
. - If the data structure contains spaced seed information, pass the path to the spaced seeds text file to
-s
.
Fits a mixture model to the k-mer count histogram to estimate k-mer error and heterozygous probabilities, genome size, and thresholds useful for the filter
command.
usage: ntstat hist [-h]
[-f {asciidoc,double_grid,double_outline,fancy_grid,fancy_outline,github,grid,heavy_grid,heavy_outline,html,jira,latex,latex_booktabs,latex_longtable,latex_raw,mediawiki,mixed_grid,mixed_outline,moinmoin,orgtbl,outline,pipe,plain,presto,pretty,psql,rounded_grid,rounded_outline,rst,simple,simple_grid,simple_outline,textile,tsv,unsafehtml,youtrack}]
[-m STYLE] [-t TITLE] [--y-log | --no-y-log] [-r PLOT_RANGE] [-o PLOT] [--probs PROBS] [--fit-gif FIT_GIF] [-c CONFIG] [--no-model]
path
positional arguments:
path k-mer spectrum file (from ntCard)
options:
-h, --help show this help message and exit
-f {asciidoc,double_grid,double_outline,fancy_grid,fancy_outline,github,grid,heavy_grid,heavy_outline,html,jira,latex,latex_booktabs,latex_longtable,latex_raw,mediawiki,mixed_grid,mixed_outline,moinmoin,orgtbl,outline,pipe,plain,presto,pretty,psql,rounded_grid,rounded_outline,rst,simple,simple_grid,simple_outline,textile,tsv,unsafehtml,youtrack}, --table-format {asciidoc,double_grid,double_outline,fancy_grid,fancy_outline,github,grid,heavy_grid,heavy_outline,html,jira,latex,latex_booktabs,latex_longtable,latex_raw,mediawiki,mixed_grid,mixed_outline,moinmoin,orgtbl,outline,pipe,plain,presto,pretty,psql,rounded_grid,rounded_outline,rst,simple,simple_grid,simple_outline,textile,tsv,unsafehtml,youtrack}
stdout table format
-m STYLE, --style STYLE
matplotlib style file, url, or one of available style names: ntstat.hist.default, ntstat.hist.paper, Solarize_Light2, _classic_test_patch, _mpl-gallery, _mpl-gallery-nogrid,
bmh, classic, dark_background, fast, fivethirtyeight, ggplot, grayscale, seaborn-v0_8, seaborn-v0_8-bright, seaborn-v0_8-colorblind, seaborn-v0_8-dark, seaborn-v0_8-dark-
palette, seaborn-v0_8-darkgrid, seaborn-v0_8-deep, seaborn-v0_8-muted, seaborn-v0_8-notebook, seaborn-v0_8-paper, seaborn-v0_8-pastel, seaborn-v0_8-poster, seaborn-v0_8-talk,
seaborn-v0_8-ticks, seaborn-v0_8-white, seaborn-v0_8-whitegrid, tableau-colorblind10
-t TITLE, --title TITLE
title to put on plot
--y-log, --no-y-log plot y-axis in log scale
-r PLOT_RANGE, --plot-range PLOT_RANGE
plot x-axis limits (inclusive) separated by a colon, i.e., use a:b to show results in the range [a, b]. 'auto' will automatically adjust the limits for better visibility.
-o PLOT, --plot PLOT path to output plot
--probs PROBS path to output probabilities in csv format
--fit-gif FIT_GIF path to output model fit history animation
-c CONFIG, --config CONFIG
path to differential evolution config file (json)
- ntStat prints comprehensive information about the model, dataset, and k-mers. This information can be in various formats supported by the tabulate package, set by the
-f
parameter. - The differential evolution algorithm's parameters can be controlled by a JSON file passed to
-c
. Refer to scipy's documentation for information about these parameters. Example configuration for increasing exploration, in case ntStat fails to find a good model:
{
"popsize": 8,
"init": "sobol",
"mutation": [
0.8,
1.2
],
"recombination": 0.5,
"strategy": "best1bin"
}
- If the
-o/--plot
parameter is set, and points to a valid path for the output file, ntStat will plot the histogram, model, and estimated k-mer robustness rate, coverage, genome size, and heterozygosity. All matplotlib formats are supported (e.g., PNG, JPG, PDF). - The output figure's title can be changed using the
-t
parameter. Title text can be italicized by putting words between^
s (e.g.,^H. sapiens^ ONT, k=30
). - Setting
--y-log
will plot the y-axis in log scale. - Use
-r
to limit the range of the plot's x-axis. ntStat can set a suitable maximum count for visibility (0.999 of the final negative binomial component) using-r auto
. - The model's output value for each k-mer count can be saved to a CSV file pointed by the
--probs
parameter. - For debugging purposes and tweaking the DE algorithm's configuration (
-c
), the--fit-gif
can be used to plot the best model in each iteration and save the plots as a GIF file pointed to by this parameter. Generating a GIF will take a few minutes to complete. Example of a fit history GIF generate by ntStat:
- Parham Kazemi
- Lauren Coombe
- René L Warren
- Inanc Birol
ntStat Copyright (c) 2024-present British Columbia Cancer Agency Branch. All rights reserved.
ntStat is released under the GNU General Public License v3
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
For commercial licensing options, please contact Patrick Rebstein [email protected]