Skip to content

Custom Panels

martinghunt edited this page Apr 12, 2022 · 7 revisions

This page describes building a custom panel of variants and presence/absence genes. It walks through how to make and use a minimal panel that consists of just two variants and one presence/absence sequence, to use with staph. One variant is an amino acid change, and the other a SNP. The method is the same for tb (except it is unlikely that you would want presence/absence sequences for tb).

The aim is to make the two files that mykrobe needs to define a new panel:

  1. Probes FASTA, probes.fa.
  2. A JSON file linking each variant or sequence in the probes FASTA file to a drug, var2res.json.

These two files can be used with the following command:

mykrobe predict --sample sample_name \
  --species custom \
  --panel custom \
  --custom_probe_set_path $PWD/probes.fa \
  --custom_variant_to_resistance_json $PWD/var2res.json \
  --seq test_reads.fq.gz

The rest of this document describes how to make the two files probes.fa and var2res.json.

Get the reference data

Download the data that mykrobe gets upon install, which includes all panels and reference genomes.

wget -O mykrobe-data.tar.gz https://ndownloader.figshare.com/files/20996829
tar xf mykrobe-data.tar.gz

We also need a genbank file of the reference genome (this is not currently included in the mykrobe data download). Download it:

wget -O BX571856.gb 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gb&retmode=text&id=BX571856.1'

Make probes from various sources

We have two types of variant: those specified by gene coordinates, and those specified by coordinates in the reference genome. A file is needed for each (unless you are only interested in one type). All coordinates are 1-based. There are three different ways we can make probes for the panel:

  1. Using mutations anywhere in the genome (nucleotide changes only)
  2. Using mutations in genes (amino acid or nucleotide changes)
  3. Providing a sequence that must be present (no point mutations involved).

Important: background variants

Unrelated "background" variants near to resistance variants can cause false-positives. This can be mitigated by telling mykrobe about these background variants. Please see the background variants page for a full description. As per those instructions, the background variants will be stored in a mongo database. Then add the option --db_name my_name (replacing my_name with the name you used) to all mykrobe variants make-probes commands below.

Probes from reference coordinates

The reference coordinate-based file has five columns: reference name, position, reference nucleotide, alternate nucleotide, and alphabet (which must be DNA). For example:

ref	1000	A	T	DNA

This represents a SNP from A to T at position 1000 of the genome. If this is in a file called vars.ref.txt, then we can make probes by running the following command.

mykrobe variants make-probes -t $PWD/vars.ref.txt  $PWD/mykrobe-data/BX571856.1.fasta > probes.ref.fa

Probes from gene coordinates

The gene coordinates file must have three columns: gene name, variant, and alphabet (either DNA or PROT). For example:

ileS	D2E	PROT

This represents an amino acid change from D to E at position 2 (ie the second amino acid) in gene ileS. If this is in a file called vars.gene.txt, then we can make probes by running the following command.

mykrobe variants make-probes -t $PWD/variants.gene.txt -g BX571856.gb $PWD/mykrobe-data/BX571856.1.fasta > probes.gene.fa

Probes from presence/absence sequences

If, in addition to mutations, you want to include sequences whose presence implies resistance to a drug, then you will need these sequences in a FASTA file. There is no need to run make-probes. The name of each sequence in the FASTA file needs the format seq_name?name=seq_name&version=1, where seq_name is the name of the sequence. More than one version can be provided using version=2. version=3 etc. For example, search for blaZ in the built-in file mykrobe-data/panels/staph-amr-probe_set_v0_3_13-160715.fasta.gz:

$ zgrep blaZ mykrobe-data/panels/staph-amr-probe_set_v0_3_13-160715.fasta.gz | head -n5
>blaZ?name=blaZ&version=1
>blaZ?name=blaZ&version=2
>blaZ?name=blaZ&version=3
>blaZ?name=blaZ&version=4
>blaZ?name=blaZ&version=5

For demonstration purposes, assume we have a fasta file probes.pres_abs.fa with the following

>presAbs?name=presAbs&version=1
GTGGCAAGGCTTTTTACACAGCCTTTAGCTTCCCCGTTTTTTTATAGCAAGTTCGTAATT
TCGGAAATTGGGACGCTCAGACATTAATCTGCGGTGGGCGTTAACCTGACTGCACAAGTA
GTTCTAAGGAACATCTTTGG

(This example is just random sequence)

Make a single probes FASTA file

Earlier we made three FASTA files:

  1. Probes from reference coordinates
  2. Probes from gene coordinates
  3. Presence/absence sequences

To make the file needed for mykrobe, simply concatenate them into one file:

cat probes.gene.fa probes.ref.fa probes.pres_abs.fa > probes.fa

If you used the same method as above, the result should be:

>ref-D2E?var_name=GAT1212816GAA&num_alts=1&ref=BX571856.1&enum=0&gene=ileS&mut=D2E
TTTTTAAATTTTTAAGGAGTGAAAAAAATGGATTACAAAGAAACGTTATTAATGCCTAAAA
>alt-D2E?var_name=GAT1212816GAA&enum=0&gene=ileS&mut=D2E
TTTAAATTTTTAAGGAGTGAAAAAAATGGAATACAAAGAAACGTTATTAATGCCTAAAA
>ref-D2E?var_name=GAT1212816GAG&num_alts=1&ref=BX571856.1&enum=0&gene=ileS&mut=D2E
TTTTTAAATTTTTAAGGAGTGAAAAAAATGGATTACAAAGAAACGTTATTAATGCCTAAAA
>alt-D2E?var_name=GAT1212816GAG&enum=0&gene=ileS&mut=D2E
TTTAAATTTTTAAGGAGTGAAAAAAATGGAGTACAAAGAAACGTTATTAATGCCTAAAA
>ref-A1000T?var_name=A1000T&num_alts=1&ref=BX571856.1&enum=0&gene=NA&mut=A1000T
TTATTTATCTATGGAGGTGTTGGTTTAGGAAAAACCCATTTAATGCATGCCATTGGTCATC
>alt-A1000T?var_name=A1000T&enum=0&gene=NA&mut=A1000T
TTATTTATCTATGGAGGTGTTGGTTTAGGATAAACCCATTTAATGCATGCCATTGGTCATC
>presAbs?name=presAbs&version=1
GTGGCAAGGCTTTTTACACAGCCTTTAGCTTCCCCGTTTTTTTATAGCAAGTTCGTAATT
TCGGAAATTGGGACGCTCAGACATTAATCTGCGGTGGGCGTTAACCTGACTGCACAAGTA
GTTCTAAGGAACATCTTTGG

The first four sequences are probes for the gene mutation, the next two are probes for the A to T SNP at position 1000, and the final sequence is the presence/absence sequence.

Resistance JSON file

A JSON file that links each variant and/or presence absence gene to a list of drugs is required. Each key should be either a variant name, or sequence name, and the value a list of drugs. The following example matches the example FASTA shown above.

{
  "ileS_D2E": ["Drug1"],
  "A1000T": ["Drug2"],
  "presAbs": ["Drug3", "Drug4"]
}

The first entry corresponds to the D to E amino acid change at position 2 of the gene ileS, saying that the mutation causes resistance to "Drug1". The second entry says that the SNP A to T at position 1000 in the genome causes resistance to "Drug2". The third entry says that if the sample has the sequence "presAbs", then it is resistant to "Drug3" and "Drug4".