Skip to content

Custom Lineage Calling

Michael Hall edited this page Apr 16, 2021 · 2 revisions

This page describes how to perform custom lineage calling using mykrobe predict.

The idea is that you have lineage information for your species of interest in a tree, with a variant defining each node in the tree. Mykrobe will check the genotype of each variant and then predict the lineage.

Lineage names

Mykrobe requires that the hierarchical lineage names are of the form "x", "x.1", "x.1.1", "x.1.2", "x.2" ...etc. The lineage tree is inferred by the dots in the names: "x" is the parent of "x.1", and "x.1" is the parent of "x.1.1" and "x.1.2" etc. However, It is possible to have a different lineage name reported from this "x.1.2" form (see later for an example).

Methods

The method involves two stages:

  1. Run mykrobe variants make-probes, where variants and their associated lineages are used as input. This only needs to be done once.

  2. Run mykrobe predict on each sample, using the files made in 1 as input.

Make probes

Essentially, we need to run make-probes in almost the usual way - as described in the custom panels page. The only difference is that we flag some nucleotide changes in the genome as lineage-defining. The command to run make probes is the same as if we were not calling lineage, but with the extra option --lineage:

mykrobe variants make-probes \
  -k 21 \
  -t $PWD/vars.ref.txt \
  --lineage lineage.json \
  $PWD/ref_genome.fasta > probes.fa

This will write a new output file lineage.json which is used as input to mykrobe predict. Note: we used the option -k 21 so that the k-mer length matches the default for mykrobe genotype. If you change this option, then you must also use that same -k option when running mykrobe predict!

The lineage-defining variants are flagged in vars.ref.txt. The normal format of this file is five columns, for example:

ref	1000	A	T	DNA

This represents the nucleotide change A1000T. Using an extra end column (or two columns) adds a variant to the lineage definition. For example:

ref	1000	A	T	DNA	lineage1
ref	2000	C	A	DNA	lineage1.1
ref	3000	G	C	DNA	lineage1.2	1.2_display_name
ref	4000	T	A	DNA	lineage2

Note that there is an extra column for lineage1.2. Column 6 defines the lineage tree, and the names in this column will be those reported by mykrobe. However, an entry in column 7 changes the name in the report. So in this case, lineage1.2 will be reported with the name 1.2_display_name instead.

The assumption is that the non-reference genotype defines the lineage. For example, seeing a T at position 1000 implies lineage1. If instead it is the reference genotype that defines the lineage, then add an asterisk:

ref	5000	G	C	DNA	*lineage3

This means that seeing a G at position 5000 (which happens to be the same as the reference genome) implies lineage3, whereas seeing a C means not lineage3.

You can associate the same lineage to more than one variant. Mykrobe will take the presence of either variant as evidence of that lineage.

Genotype and identify the lineage

Run lineage calling on a reads file reads.fq with this command:

mykrobe predict --sample sample_id \
  --species custom \
  --seq reads.fq \
  --custom_lineage_json lineage.json \
  --custom_probe_set_path probes.fa \
  --format json \
  -o out.json

where probes.fa and lineage.json are the files made earlier by make-probes.

The lineage information is in the Phylogenetics section of the JSON file - please see the description of JSON file output.

Clone this wiki locally