Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip empty FASTA sequences, and allow to force taxids #27

Open
wants to merge 117 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
117 commits
Select commit Hold shift + click to select a range
4487ec1
Added .gitignore files
fbreitwieser Jul 10, 2015
a07b9a2
Skip empty FASTA sequences instead of exiting the program
fbreitwieser Jul 10, 2015
d6071da
Added options -T to force taxid of the sequences, and -v for verbose …
fbreitwieser Jul 10, 2015
14b74e2
added comments
fbreitwieser Jul 10, 2015
567ae7b
update
fbreitwieser Sep 17, 2015
dfecb31
Added '>kraken:taxid|' header parsing to set_lcas - makes it possible…
fbreitwieser Dec 4, 2015
9fc61e3
Only report missing kmers when verbose
fbreitwieser Dec 20, 2015
c71ffbc
Use 'find ... -exec cat' instead 'find .. -print0 | xargs -0 cat'
fbreitwieser Dec 9, 2016
8259c6a
Allow multiple --db arguments
fbreitwieser Dec 9, 2016
7c678d6
Count unique k-mers for each taxid
fbreitwieser Feb 9, 2017
3e5a009
Skip two gi2seqid and seqid2taxid map creation for now
fbreitwieser Feb 12, 2017
93904c7
Don't display counts
fbreitwieser Feb 15, 2017
905fa08
Fix Makefile
fbreitwieser Feb 15, 2017
7b45304
Added taxdb from k-SLAM for writing report after classification
fbreitwieser Feb 15, 2017
330e186
Build report and taxDB in Kraken
fbreitwieser Feb 16, 2017
c60b8a2
Allow compressed output, and fix report file generation
fbreitwieser Feb 18, 2017
aa6b179
Don't use nodes file in classify
fbreitwieser Feb 20, 2017
0587b5e
Added generate-taxonomy-ids-for-sequences and typed TaxonomyDB
fbreitwieser Feb 21, 2017
06e7f7e
Use less critical pragma - better parallelization
fbreitwieser Feb 21, 2017
ff63944
Fix for using multiple databases for search
fbreitwieser Feb 22, 2017
6476863
Update README.md
fbreitwieser Feb 22, 2017
8f4b184
Do not need taxonomy nodes anymore
fbreitwieser Apr 12, 2017
77cac13
Check if there's a taxonomy entry
fbreitwieser Apr 12, 2017
56c5872
Merge branch 'taxid-bitmap' of https://github.com/fbreitwieser/kraken…
fbreitwieser Apr 12, 2017
4073388
Merge branch 'master' of https://github.com/fbreitwieser/kraken
fbreitwieser Apr 12, 2017
dcaec29
Use template for ReadCounts
fbreitwieser Apr 12, 2017
2ddf645
Merge branch 'taxid-bitmap'
fbreitwieser Apr 12, 2017
53560de
Renamed to kraken_hll
fbreitwieser Apr 12, 2017
c6cf5bc
Fixed building of kraken_hll
fbreitwieser Apr 12, 2017
e76de48
Update README.md
fbreitwieser Apr 12, 2017
83c075f
Show numer of unique k-mers
fbreitwieser Apr 12, 2017
cf4aeae
Start all scripts with kraken_hll
fbreitwieser May 4, 2017
e683cbf
Refactor TaxonomyDB constructor
fbreitwieser May 4, 2017
ffea4a7
Build taxdb in the end
fbreitwieser May 4, 2017
c550a1b
Update build_taxdb.cpp
fbreitwieser May 4, 2017
112b89d
Include number of k-mers in database in taxDB
fbreitwieser May 5, 2017
78a61d4
Update to read columns
fbreitwieser May 7, 2017
7ae1f5d
Added taxdb.cpp
fbreitwieser Aug 13, 2017
c5318e0
Add UID mapping to Kraken-HLL
fbreitwieser Aug 25, 2017
2873a79
Renamed to KrakenU
fbreitwieser Aug 28, 2017
607cb0c
Put version string in separate file
fbreitwieser Aug 31, 2017
432d6ce
Fixed script paths
fbreitwieser Aug 31, 2017
e52b7e0
Major update for UID mapping and having taxon entries for assemblies
fbreitwieser Sep 22, 2017
bcca5a8
Updated build script, and add some info when loading database
fbreitwieser Sep 25, 2017
a13e9fc
Added jellyfish submodule
fbreitwieser Sep 25, 2017
4d36940
specify branch for jellyfish
fbreitwieser Sep 25, 2017
e8f6873
Instal jellyfish from code archive
fbreitwieser Sep 25, 2017
c60e100
Update .gitignore
fbreitwieser Sep 25, 2017
a2f75cb
Use /usr/bin/env
fbreitwieser Sep 25, 2017
8a434fc
Much faster krakenu-download using forks and LWP
fbreitwieser Sep 25, 2017
7185d73
Add test files
fbreitwieser Sep 25, 2017
c7ea4b8
Added files for automated tests
fbreitwieser Sep 28, 2017
f930764
Added OSX files
fbreitwieser Sep 28, 2017
a359ba3
Fix jellyfish installation - don't use make install
fbreitwieser Sep 28, 2017
b32fdd8
Add parameters --library-dir and --taxonomy-dir
fbreitwieser Sep 28, 2017
3c4e7e4
Look for locally installed jellyfish, first
fbreitwieser Sep 28, 2017
3b7642d
Add slurp_file for file reading - fix for OSX
fbreitwieser Sep 28, 2017
da0978e
Added classification rater (not working yet)
fbreitwieser Sep 28, 2017
f49630a
Allow multiple library directories on command line
fbreitwieser Sep 30, 2017
314f49c
Update on tests
fbreitwieser Sep 30, 2017
3898b25
Added script to dump taxDB to NCBI dump format
fbreitwieser Oct 1, 2017
94b4326
Added dusting and 'standard' db to testing
fbreitwieser Oct 1, 2017
4130390
Make building work for OSX and Linux
fbreitwieser Oct 1, 2017
cfd0422
Fixes for building and downloading
fbreitwieser Oct 1, 2017
a88535a
Fixed contaminants download
fbreitwieser Oct 1, 2017
0145ef4
Fix for Linux/OSX building
fbreitwieser Oct 1, 2017
bb1e65d
Minor speed improvements in classify
fbreitwieser Oct 1, 2017
7bbd986
Correct number of arguments
fbreitwieser Oct 1, 2017
1021369
Ignore taxa not in taxonomy DB
fbreitwieser Oct 1, 2017
c6b2d04
Add comments
fbreitwieser Oct 1, 2017
131daed
Update to read sim
fbreitwieser Oct 1, 2017
3f5faf1
Update to read sim
fbreitwieser Oct 1, 2017
436938d
Merge branch 'master' of https://github.com/fbreitwieser/kraken-hll
fbreitwieser Oct 1, 2017
6e909c1
Various improvements and fixes for building and classification
fbreitwieser Oct 2, 2017
69815b8
Special treatment for host and conaminant taxids
fbreitwieser Oct 2, 2017
04fa400
Added build
fbreitwieser Oct 2, 2017
4206549
Add new columns
fbreitwieser Oct 2, 2017
e84091a
Up default precision to 12
fbreitwieser Oct 2, 2017
0afdcf7
use unordered maps
fbreitwieser Oct 2, 2017
1fc7d59
Merge branch 'master' of https://github.com/fbreitwieser/kraken-hll
fbreitwieser Oct 2, 2017
13adecf
Check library directories exist
fbreitwieser Oct 2, 2017
0924755
Merge branch 'master' of https://github.com/fbreitwieser/kraken-hll
fbreitwieser Oct 2, 2017
1eda03e
Use uid_mapping headers
fbreitwieser Oct 2, 2017
e116059
Merge branch 'master' of https://github.com/fbreitwieser/kraken-hll
fbreitwieser Oct 2, 2017
b84380b
update init.sh
fbreitwieser Oct 2, 2017
a352122
Fix bug in reading names and nodes
fbreitwieser Oct 2, 2017
f05a219
Dump taxdb without ending separators (for kraken-report)
fbreitwieser Oct 18, 2017
2172caa
Allow to run build_taxdb with a taxdb as consistency check
fbreitwieser Oct 18, 2017
eb9447e
Update to classification grading
fbreitwieser Oct 18, 2017
75b24ae
Do not give separate taxids for human sequences (fixed now)
fbreitwieser Oct 18, 2017
f15112d
Add species subgroup and group ranks
fbreitwieser Oct 18, 2017
3090304
Don't use .at() for vector access
fbreitwieser Oct 18, 2017
9c11d87
Test modification of lca
fbreitwieser Oct 18, 2017
cdc9fb6
Merge branch 'master' of https://github.com/fbreitwieser/kraken-hll
fbreitwieser Oct 18, 2017
7d1ab24
Update
fbreitwieser Oct 23, 2017
5ea9f06
Update
fbreitwieser Oct 25, 2017
970c4ca
Many small changes to make it working for GCC 4.4 / C++0x
fbreitwieser Oct 25, 2017
d5b8dc2
Make sure TaxDB is not copied
fbreitwieser Nov 5, 2017
16813a7
Fox parent map generation in taxDB
fbreitwieser Nov 5, 2017
7466810
Add environment CPPFLAGS and LDFLAGS
fbreitwieser Nov 5, 2017
dd7fc4f
Change name to KrakenHLL
fbreitwieser Nov 6, 2017
0c33f0e
Change name to KrakenHLL
fbreitwieser Nov 6, 2017
274e41f
Update
fbreitwieser Nov 8, 2017
fadae7b
Merge branch 'master' of https://github.com/fbreitwieser/kraken-hll
fbreitwieser Nov 8, 2017
69d8352
Create taxDB if not present
fbreitwieser Nov 8, 2017
d83c579
Compute k-mer counts if not present
fbreitwieser Nov 8, 2017
96ecc4b
Merge branch 'master' of https://github.com/fbreitwieser/kraken-hll
fbreitwieser Nov 8, 2017
30a6538
Fix indent
fbreitwieser Nov 8, 2017
4e3cbd3
Merge branch 'master' of https://github.com/fbreitwieser/kraken-hll
fbreitwieser Nov 8, 2017
f34f8d4
Fix gzstream compilation, update on HLL
fbreitwieser Nov 9, 2017
93c155f
Fix gzstream building and licensing
fbreitwieser Nov 9, 2017
c6871c1
Fix licensing
fbreitwieser Nov 9, 2017
e4904a7
Merge branch 'master' of https://github.com/fbreitwieser/kraken-hll
fbreitwieser Nov 9, 2017
53314ae
Fix taxonomy reporting order and percentage
fbreitwieser Nov 9, 2017
10f5998
Fixed HLL bug introduced by using vector for sparse representation
fbreitwieser Nov 10, 2017
1d37b54
Fix Makefile
fbreitwieser Nov 10, 2017
a95bd85
Allow setting custom precision
fbreitwieser Nov 11, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
/install/
/Debug/
/tests/dbs
/tests/data
/tests/install
*.dSYM
\.idea/
90 changes: 0 additions & 90 deletions CHANGELOG

This file was deleted.

49 changes: 42 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,45 @@
Kraken taxonomic sequence classification system
KrakenHLL taxonomic sequence classification system with unique k-mer counting
===============================================

Please see the [Kraken webpage] or the [Kraken manual]
for information on installing and operating Kraken.
A local copy of the [Kraken manual] is also present here
in the `docs/` directory (`MANUAL.html` and `MANUAL.markdown`).
[Kraken](https://github.com/DerrickWood/kraken) is a fast taxonomic classifier for metagenomics data. This project, kraken-hll, adds some additional functionality - most notably a unique k-mer count using the HyperLogLog algorithm. Spurious identifications due to sequence contamination in the dataset or database often leads to many reads, however they usually cover only a small portion of the genome.

[Kraken webpage]: http://ccb.jhu.edu/software/kraken/
[Kraken manual]: http://ccb.jhu.edu/software/kraken/MANUAL.html
KrakenHLL adds two additional columns to the Kraken report - total number of k-mers observed for taxon, and the total number of unique k-mers observed for taxon (columns 3 and 4, resp.).

Here's a small example of a classification against a viral database with k=25. There are three species identified by just one read - Enterobacteria phage BP-4795, Salmonella phage SEN22, Sulfolobus monocaudavirus SMV1. Out of those, the identification of Salmonella phage SEN22 is the strongest, as there read was matched with 116 k-mers that are unique to the sequence, while the match to Sulfolobus monocaudavirus SMV1 is only based on a single 25-mer.

```
99.0958 2192 2192 255510 272869 no rank 0 unclassified
0.904159 20 0 2361 2318 no rank 1 root
0.904159 20 0 2361 2318 superkingdom 10239 Viruses
0.904159 20 0 2361 2318 no rank 35237 dsDNA viruses, no RNA stage
0.768535 17 0 2074 2063 order 548681 Herpesvirales
0.768535 17 0 2074 2063 family 10292 Herpesviridae
0.768535 17 0 2074 2063 subfamily 10374 Gammaherpesvirinae
0.768535 17 0 2074 2063 genus 10375 Lymphocryptovirus
0.768535 17 16 2001 1987 species 10376 Human gammaherpesvirus 4
0.045208 1 1 4 4 sequence 1000041143 KC207814.1 Human herpesvirus 4 strain Mutu, complete genome
0.0904159 2 0 254 254 order 28883 Caudovirales
0.045208 1 0 28 28 family 10699 Siphoviridae
0.045208 1 0 28 28 genus 186765 Lambdavirus
0.045208 1 0 28 28 no rank 335795 unclassified Lambda-like viruses
0.045208 1 1 28 28 species 196242 Enterobacteria phage BP-4795
0.045208 1 0 116 116 family 10744 Podoviridae
0.045208 1 0 116 116 no rank 196895 unclassified Podoviridae
0.045208 1 0 116 116 no rank 1758253 Escherichia phage phi191 sensu lato
0.045208 1 1 116 116 species 1647458 Salmonella phage SEN22
0.045208 1 0 1 1 no rank 51368 unclassified dsDNA viruses
0.045208 1 1 1 1 species 1351702 Sulfolobus monocaudavirus SMV1
```

## Usage

For usage, see `krakenhll --help`. Note that you can use the same database as Kraken with one difference - instead of the files `DB_DIR/taxonomy/nodes.dmp` and `DB_DIR/taxonomy/names.dmp` than kraken relies upon, `kraken-hll` needs the file `DB_DIR/taxDB`. This can be generated with the script `build_taxdb`: `KRAKEN_DIR/build_taxdb DB_DIR/taxonomy/names.dmp DB_DIR/taxonomy/nodes.dmp > DB_DIR/taxDB`. The code behind the taxDB is based on [k-SLAM](https://github.com/aindj/k-SLAM).

### Differences to `kraken`
- Use `krakenhll --report-file FILENAME ...` to write the kraken report to `FILENAME`.
- Use `krakenhll --db DB1 --db DB2 --db DB3 ...` to first attempt, for each k-mer, to assign it based on DB1, then DB2, then DB3. You can use this to prefer identifications based on DB1 (e.g. human and contaminant sequences), then DB2 (e.g. completed bacterial genomes), then DB3, etc. Note that this option is incompatible with `krakenhll-build --generate-taxonomy-ids-for-sequences` since the taxDB between the databases has to be absolutely the same.
- Add a suffix `.gz` to output files to generate gzipped output files

### Differences to `kraken-build`
- Use `krakenhll-build --generate-taxonomy-ids-for-sequences ...` to add pseudo-taxonomy IDs for each sequence header. An example for the result using this is in the ouput above - one read has been assigned specifically to `KC207814.1 Human herpesvirus 4 strain Mutu, complete genome`.
- `seqid2taxid.map` mapping sequence IDs to taxonomy IDs does NOT parse or require `>gi|`, but rather the sequence ID is the header up to just before the first space
1 change: 1 addition & 0 deletions VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0.1
59 changes: 46 additions & 13 deletions install_kraken.sh → install_krakenhll.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
#!/bin/bash

# Portions (c) 2017, Florian Breitwieser <[email protected]>
# Copyright 2013-2015, Derrick Wood <[email protected]>
#
# This file is part of the Kraken taxonomic classification system.
#
# Kraken is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
Expand All @@ -19,11 +18,22 @@

set -e

VERSION="0.10.6-unreleased"
DIR=$(dirname $0)
VERSION=`cat $(dirname $0)/VERSION`

if [ "$1" == "--install-jellyfish" ]; then
INSTALL_JELLYFISH=1;
shift;
fi

if [ -z "$1" ] || [ -n "$2" ]
then
echo "Usage: $(basename $0) KRAKEN_DIR"
echo "Usage: $(basename $0) [--install-jellyfish] KRAKEN_DIR

If --install-jellyfish is specified, the source code for version 1.1
is downloaded from http://www.cbcb.umd.edu/software/jellyfish and installed
in KRAKEN_DIR. Note that this may overwrite other jellyfish installation in
the same path."
exit 64
fi

Expand All @@ -34,13 +44,32 @@ then
exit 1
fi


# Perl cmd used to canonicalize dirname - "readlink -f" doesn't work
# on OS X.
export KRAKEN_DIR=$(perl -MCwd=abs_path -le 'print abs_path(shift)' "$1")

mkdir -p "$KRAKEN_DIR"
make -C src install
for file in scripts/*
if [ "$INSTALL_JELLYFISH" == "1" ]; then
WD=`pwd`
cd $KRAKEN_DIR
if [[ ! -d jellyfish ]]; then
wget http://www.cbcb.umd.edu/software/jellyfish/jellyfish-1.1.11.tar.gz
tar xvvf jellyfish-1.1.11.tar.gz
mv jellyfish-1.1.11 jellyfish
fi
cd jellyfish
[[ -f Makefile ]] || ./configure
make
#make install ## doest not work for me on OSX
#cp $KRAKEN_DIR/jellyfish-install/bin/jellyfish $KRAKEN_DIR
#rm -r jellyfish-1.1.11.tar.gz jellyfish-1.1.11
cd $WD
fi

#make -C src clean
make -C $DIR/src install
for file in $DIR/scripts/*
do
perl -pl -e 'BEGIN { while (@ARGV) { $_ = shift; ($k,$v) = split /=/, $_, 2; $H{$k} = $v } }'\
-e 's/#####=(\w+)=#####/$H{$1}/g' \
Expand All @@ -52,12 +81,16 @@ do
fi
done

echo
echo "Kraken installation complete."
echo
echo "To make things easier for you, you may want to copy/symlink the following"
echo "files into a directory in your PATH:"
for file in $KRAKEN_DIR/kraken*
echo -n "
Kraken installation complete.

To make things easier for you, you may want to copy/symlink the following
files into a directory in your PATH:

ln -s"
for file in $KRAKEN_DIR/krakenhll*
do
[ -x "$file" ] && echo " $file"
[ -x "$file" ] && echo -n " $file"
done
echo " DEST_DIR"
exit 0
Loading