Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I interpret SNP2HLA output? #6

Open
rwdavies opened this issue Oct 14, 2020 · 2 comments
Open

How do I interpret SNP2HLA output? #6

rwdavies opened this issue Oct 14, 2020 · 2 comments

Comments

@rwdavies
Copy link

Hi,

I tried running a modified version of the README suggestion for HLA-TAPAS (code below), which gave the following example output for one individual for HLA-A (code to reproduce below as as well as log). Can you provide guidance on how to interpret this, if I'm specifically interested in 4-digit HLA types?

So these are the VCF entries for the first imputed sample for HLA-A

HLA_A*01        1|0:1.05:0.02,0.91,0.07
HLA_A*01:01:01:01       1|0:1.04:0.03,0.91,0.07
HLA_A*01:02     0|0:0.01:0.99,0.01,0
HLA_A*01:83:02  0|0:0.01:0.99,0.01,0
HLA_A*02        0|0:0.04:0.97,0.03,0
HLA_A*02:01:01:01       0|0:0.02:0.98,0.02,0
HLA_A*02:22:01:01       0|0:0.01:0.99,0.01,0
HLA_A*03        0|0:0.12:0.88,0.12,0
HLA_A*03:01:01:01       0|0:0.12:0.88,0.12,0
HLA_A*11        0|0:0.08:0.92,0.08,0
HLA_A*11:01:01:01       0|0:0.07:0.93,0.07,0
HLA_A*11:50Q    0|0:0.02:0.98,0.02,0
HLA_A*24        0|1:0.67:0.33,0.67,0
HLA_A*24:02:01:01       0|1:0.61:0.39,0.61,0
HLA_A*24:02:05  0|0:0.01:0.99,0.01,0
HLA_A*24:07:01  0|0:0.03:0.97,0.03,0
HLA_A*24:17     0|0:0.01:0.99,0.01,0
HLA_A*24:86N    0|0:0.01:0.99,0.01,0
HLA_A*36        0|0:0.01:0.99,0.01,0
HLA_A*36:01     0|0:0.01:0.99,0.01,0
HLA_A*68        0|0:0.01:0.99,0.01,0
HLA_A*68:02:01:01       0|0:0.01:0.99,0.01,0

My intuition is to a) disregard the 2-digit rows b) collapse >4 digit to 4-digit types c) sum 4-digit types. This gives something like the following, if I then sum by the second entry in (the DS = dosage) of the VCF

 1   HLA_A*01:01 1.04
 2   HLA_A*24:02 0.62
 3   HLA_A*03:01 0.12
 4   HLA_A*11:01 0.07
 5   HLA_A*24:07 0.03
 6   HLA_A*02:01 0.02
 7  HLA_A*11:50Q 0.02
 8   HLA_A*01:02 0.01
 9   HLA_A*01:83 0.01
 10  HLA_A*02:22 0.01
 11  HLA_A*24:17 0.01
 12 HLA_A*24:86N 0.01
 13  HLA_A*36:01 0.01
 14  HLA_A*68:02 0.01

My interpretation is that the most likely genotype for this sample would be HLA_A*01:01, HLA_A*24:02, with some sort of confidence (here perhaps confidence = (1.04 + 0.62) / (sum of the rest) = 0.83? If this is right, is there a recommended default confidence threshold (0.5? 0.9?)

Can you also confirm, for the imputation, that the only thing that matters from the reference samples is the *.phased.vcf.gz file? This is what this code seems to show

Thanks,
Robbie

Code I used

set -e

rm -r -f test_target
mkdir test_target
##cp SNP2HLA/example/1958BC.vcf.gz test_target/
cp SNP2HLA/example/1958BC.bed test_target/
cp SNP2HLA/example/1958BC.bim test_target/
cp SNP2HLA/example/1958BC.fam test_target/

rm -r -f test_reference
mkdir test_reference
## cp resources/1000G.bglv4.bgl.phased.vcf.gz test_reference/
## cp resources/1000G.bglv4.bim test_reference/
## cp resources/1000G.bglv4.FRQ.frq test_reference/
## cp resources/1000G.bglv4.markers test_reference/
cp resources/1000G.bglv4.bgl.phased.vcf.gz test_reference/
touch test_reference/1000G.bglv4.bim
touch test_reference/1000G.bglv4.FRQ.frq
touch test_reference/1000G.bglv4.markers

rm -r -f test_output
mkdir -p test_output

python -m SNP2HLA \
   --target test_target/1958BC \
   --reference test_reference/1000G.bglv4 \
   --out test_output/IMPUTED.1958BC \
   --nthreads 4 \
   --mem 16g

gunzip -c test_output/IMPUTED.1958BC.bgl.phased.vcf.gz | grep -F "HLA_A*" | cut -f3,10 | grep -v "0|0:0:1,0,0"

Output of code run

Namespace(dependency='dependency/', mem='16g', niterations=5, nthreads=4, out='test_output/IMPUTED.1958BC', reference='test_reference/1000G.bglv4', target='test_target/1958BC', tolerated_diff=0.15)
SNP2HLA: Performing HLA imputation for dataset test_target/1958BC
- Java memory = 16gb
[1] Extracting SNPs from the MHC.
[2] Performing SNP quality control.
[3] Converting PLINK to BEAGLE format.
[4] Converting BEAGLE to VCF format.
[5] Performing HLA imputation.
HLA_A*01        1|0:1.05:0.02,0.91,0.07
HLA_A*01:01:01:01       1|0:1.04:0.03,0.91,0.07
HLA_A*01:02     0|0:0.01:0.99,0.01,0
HLA_A*01:83:02  0|0:0.01:0.99,0.01,0
HLA_A*02        0|0:0.04:0.97,0.03,0
HLA_A*02:01:01:01       0|0:0.02:0.98,0.02,0
HLA_A*02:22:01:01       0|0:0.01:0.99,0.01,0
HLA_A*03        0|0:0.12:0.88,0.12,0
HLA_A*03:01:01:01       0|0:0.12:0.88,0.12,0
HLA_A*11        0|0:0.08:0.92,0.08,0
HLA_A*11:01:01:01       0|0:0.07:0.93,0.07,0
HLA_A*11:50Q    0|0:0.02:0.98,0.02,0
HLA_A*24        0|1:0.67:0.33,0.67,0
HLA_A*24:02:01:01       0|1:0.61:0.39,0.61,0
HLA_A*24:02:05  0|0:0.01:0.99,0.01,0
HLA_A*24:07:01  0|0:0.03:0.97,0.03,0
HLA_A*24:17     0|0:0.01:0.99,0.01,0
HLA_A*24:86N    0|0:0.01:0.99,0.01,0
HLA_A*36        0|0:0.01:0.99,0.01,0
HLA_A*36:01     0|0:0.01:0.99,0.01,0
HLA_A*68        0|0:0.01:0.99,0.01,0
HLA_A*68:02:01:01       0|0:0.01:0.99,0.01,0
@rwdavies
Copy link
Author

If there is some sort of parser that already does this that you can point me to that'd be great (e.g. takes as input the phased output file, and generates something like one row per sample, one column per HLA-region, with entry the most likely single genotype, and a confidence measure)

@yluo86
Copy link
Contributor

yluo86 commented Mar 26, 2021

Hi @rwdavies, sorry about the the slow response - I have only recently realise that i need to turn the watch on before getting notified with all open issues. You are right in interpreting the result. As for filtering - we usually filter out variants with imputation r2 < 0.5 and minor allele frequency < 1% (depends on the size of you dataset).

All required information for downstream association analysis is included in the *.phased.vcf.gz file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants