-
Notifications
You must be signed in to change notification settings - Fork 28
Custom Lineage Calling
This page describes how to perform custom lineage calling using mykrobe predict
.
The idea is that you have lineage information for your species of interest in a tree, with a variant defining each node in the tree. Mykrobe will check the genotype of each variant and then predict the lineage.
Mykrobe requires that the hierarchical lineage names are of the form "x", "x.1", "x.1.1", "x.1.2", "x.2" ...etc. The lineage tree is inferred by the dots in the names: "x" is the parent of "x.1", and "x.1" is the parent of "x.1.1" and "x.1.2" etc. However, It is possible to have a different lineage name reported from this "x.1.2" form (see later for an example).
The method involves two stages:
-
Run
mykrobe variants make-probes
, where variants and their associated lineages are used as input. This only needs to be done once. -
Run
mykrobe predict
on each sample, using the files made in 1 as input.
Essentially, we need to run make-probes
in almost the usual way - as described in the custom panels page.
The only difference is that we flag some nucleotide changes in the genome as lineage-defining. The command to run make probes is the same as if we were not calling lineage, but with the extra option --lineage
:
mykrobe variants make-probes \
-k 21 \
-t $PWD/vars.ref.txt \
--lineage lineage.json \
$PWD/ref_genome.fasta > probes.fa
This will write a new output file lineage.json
which is used as input to mykrobe predict
. Note: we used the option -k 21
so that the k-mer length matches the default for mykrobe genotype
. If you change this option, then you must also use that same -k
option when running mykrobe predict
!
The lineage-defining variants are flagged in vars.ref.txt
. The normal format of this file is five columns, for example:
ref 1000 A T DNA
This represents the nucleotide change A1000T. Using an extra end column (or two columns) adds a variant to the lineage definition. For example:
ref 1000 A T DNA lineage1
ref 2000 C A DNA lineage1.1
ref 3000 G C DNA lineage1.2 1.2_display_name
ref 4000 T A DNA lineage2
Note that there is an extra column for lineage1.2
. Column 6 defines the lineage tree, and the names in this column will be those reported by mykrobe. However, an entry in column 7 changes the name in the report. So in this case, lineage1.2
will be reported with the name 1.2_display_name
instead.
The assumption is that the non-reference genotype defines the lineage. For example, seeing a T
at position 1000 implies lineage1. If instead it is the reference genotype that defines the lineage, then add an asterisk:
ref 5000 G C DNA *lineage3
This means that seeing a G
at position 5000 (which happens to be the same as the reference genome) implies lineage3, whereas seeing a C
means not lineage3.
You can associate the same lineage to more than one variant. Mykrobe will take the presence of either variant as evidence of that lineage.
Run lineage calling on a reads file reads.fq
with this command:
mykrobe predict --sample sample_id \
--species custom \
--seq reads.fq \
--custom_lineage_json lineage.json \
--custom_probe_set_path probes.fa \
--format json \
-o out.json
where probes.fa
and lineage.json
are the files made earlier by make-probes
.
The lineage information is in the Phylogenetics
section of the JSON file - please see the description of JSON file output.