-
Notifications
You must be signed in to change notification settings - Fork 28
Custom Panels
This page describes building a custom panel of variants and presence/absence genes. It walks through how to make and use a minimal panel that consists of just two variants and one presence/absence sequence, to use with staph. One variant is an amino acid change, and the other a SNP. The method is the same for tb (except it is unlikely that you would want presence/absence sequences for tb).
The aim is to make the two files that mykrobe needs to define a new panel:
- Probes FASTA,
probes.fa
. - A JSON file linking each variant or sequence in the probes FASTA file to a drug,
var2res.json
.
These two files can be used with the following command:
mykrobe predict --sample sample_name \
--species custom \
--panel custom \
--custom_probe_set_path $PWD/probes.fa \
--custom_variant_to_resistance_json $PWD/var2res.json \
--seq test_reads.fq.gz
The rest of this document describes how to make the two files probes.fa
and var2res.json
.
Download the data that mykrobe gets upon install, which includes all panels and reference genomes.
wget -O mykrobe-data.tar.gz https://ndownloader.figshare.com/files/20996829
tar xf mykrobe-data.tar.gz
We also need a genbank file of the reference genome (this is not currently included in the mykrobe data download). Download it:
wget -O BX571856.gb 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=text&id=BX571856.1'
We have two types of variant: those specified by gene coordinates, and those specified by coordinates in the reference genome. A file is needed for each (unless you are only interested in one type). All coordinates are 1-based. There are three different ways we can make probes for the panel:
- Using mutations anywhere in the genome (nucleotide changes only)
- Using mutations in genes (amino acid or nucleotide changes)
- Providing a sequence that must be present (no point mutations involved).
Unrelated "background" variants near to resistance variants can cause false-positives.
This can be mitigated by telling mykrobe about these background variants.
Please see the background variants page for a full description.
As per those instructions, the background variants will be stored in a
mongo database. Then add the option --db_name my_name
(replacing my_name
with
the name you used) to all mykrobe variants make-probes
commands below.
The reference coordinate-based file has five columns: reference name, position, reference nucleotide, alternate nucleotide, and alphabet (which must be DNA). For example:
ref 1000 A T DNA
This represents a SNP from A
to T
at position 1000 of the genome. If this is in a file called vars.ref.txt
, then we can make probes by running the following command.
mykrobe variants make-probes -t $PWD/vars.ref.txt $PWD/mykrobe-data/BX571856.1.fasta > probes.ref.fa
The gene coordinates file must have three columns: gene name, variant, and alphabet (either DNA
or PROT
). For example:
ileS D2E PROT
This represents an amino acid change from D
to E
at position 2 (ie the second amino acid) in gene ileS
.
If this is in a file called vars.gene.txt
, then we can make probes by running the following command.
mykrobe variants make-probes -t $PWD/variants.gene.txt -g BX571856.gb $PWD/mykrobe-data/BX571856.1.fasta > probes.gene.fa
If, in addition to mutations, you want to include sequences whose presence implies resistance to a drug, then you will need these sequences in a FASTA file. There is no need to run make-probes
. The name of each sequence in the FASTA file needs the format seq_name?name=seq_name&version=1
, where seq_name
is the name of the sequence. More than one version can be provided using version=2
. version=3
etc. For example, search for blaZ
in the built-in file mykrobe-data/panels/staph-amr-probe_set_v0_3_13-160715.fasta.gz
:
$ zgrep blaZ mykrobe-data/panels/staph-amr-probe_set_v0_3_13-160715.fasta.gz | head -n5
>blaZ?name=blaZ&version=1
>blaZ?name=blaZ&version=2
>blaZ?name=blaZ&version=3
>blaZ?name=blaZ&version=4
>blaZ?name=blaZ&version=5
For demonstration purposes, assume we have a fasta file probes.pres_abs.fa
with the following
>presAbs?name=presAbs&version=1
GTGGCAAGGCTTTTTACACAGCCTTTAGCTTCCCCGTTTTTTTATAGCAAGTTCGTAATT
TCGGAAATTGGGACGCTCAGACATTAATCTGCGGTGGGCGTTAACCTGACTGCACAAGTA
GTTCTAAGGAACATCTTTGG
(This example is just random sequence)
Earlier we made three FASTA files:
- Probes from reference coordinates
- Probes from gene coordinates
- Presence/absence sequences
To make the file needed for mykrobe, simply concatenate them into one file:
cat probes.gene.fa probes.ref.fa probes.pres_abs.fa > probes.fa
If you used the same method as above, the result should be:
>ref-D2E?var_name=GAT1212816GAA&num_alts=1&ref=BX571856.1&enum=0&gene=ileS&mut=D2E
TTTTTAAATTTTTAAGGAGTGAAAAAAATGGATTACAAAGAAACGTTATTAATGCCTAAAA
>alt-D2E?var_name=GAT1212816GAA&enum=0&gene=ileS&mut=D2E
TTTAAATTTTTAAGGAGTGAAAAAAATGGAATACAAAGAAACGTTATTAATGCCTAAAA
>ref-D2E?var_name=GAT1212816GAG&num_alts=1&ref=BX571856.1&enum=0&gene=ileS&mut=D2E
TTTTTAAATTTTTAAGGAGTGAAAAAAATGGATTACAAAGAAACGTTATTAATGCCTAAAA
>alt-D2E?var_name=GAT1212816GAG&enum=0&gene=ileS&mut=D2E
TTTAAATTTTTAAGGAGTGAAAAAAATGGAGTACAAAGAAACGTTATTAATGCCTAAAA
>ref-A1000T?var_name=A1000T&num_alts=1&ref=BX571856.1&enum=0&gene=NA&mut=A1000T
TTATTTATCTATGGAGGTGTTGGTTTAGGAAAAACCCATTTAATGCATGCCATTGGTCATC
>alt-A1000T?var_name=A1000T&enum=0&gene=NA&mut=A1000T
TTATTTATCTATGGAGGTGTTGGTTTAGGATAAACCCATTTAATGCATGCCATTGGTCATC
>presAbs?name=presAbs&version=1
GTGGCAAGGCTTTTTACACAGCCTTTAGCTTCCCCGTTTTTTTATAGCAAGTTCGTAATT
TCGGAAATTGGGACGCTCAGACATTAATCTGCGGTGGGCGTTAACCTGACTGCACAAGTA
GTTCTAAGGAACATCTTTGG
The first four sequences are probes for the gene mutation, the next two are probes for the A
to T
SNP at position 1000, and the final sequence is the presence/absence sequence.
A JSON file that links each variant and/or presence absence gene to a list of drugs is required. Each key should be either a variant name, or sequence name, and the value a list of drugs. The following example matches the example FASTA shown above.
{
"ileS_D2E": ["Drug1"],
"A1000T": ["Drug2"],
"presAbs": ["Drug3", "Drug4"]
}
The first entry corresponds to the D
to E
amino acid change at position 2 of the gene ileS
,
saying that the mutation causes resistance to "Drug1".
The second entry says that the SNP A
to T
at position 1000 in the genome causes resistance to "Drug2".
The third entry says that if the sample has the sequence "presAbs", then it is resistant to "Drug3" and "Drug4".