Using hap.py
to compare the analysis results of HaplotypeCaller and Mutect2 to understand their performance and accuracy in differnt scenarios.
:::warning
Prerequisite: Prepare the results of the 3 WGS data from the previous tutorial, each processed with two variant calling methods: HaplotypeCaller and Mutect2!!
cd /work/username
mkdir vcf_for_happy
cd vcf_for_happy
rsync -avzP /work/u2499286/S14_M2_result/SRR13076392_S14_L002.sorted.markdup.m2.vcf.gz ./
rsync -avzP /work/u2499286/S15_M2_result/SRR13076396_S15_L002.sorted.markdup.m2.vcf.gz ./
rsync -avzP /work/u2499286/S16_M2_result/SRR13076396_S16_L002.sorted.markdup.m2.vcf.gz ./
rsync -avzP /work/u2499286/S14_HC_result/SRR13076392_S14_L002.sorted.markdup.hc.vcf.gz ./
rsync -avzP /work/u2499286/S15_HC_result/SRR13076393_S15_L002.sorted.markdup.hc.vcf.gz ./
rsync -avzP /work/u2499286/S16_HC_result/SRR13076396_S16_L002.sorted.markdup.hc.vcf.gz ./
:::
:::info
Hap.py
(Haplotype Comparison Tools) is a tool used for comparing genomic variants. It is often utilized to evaluate the accuracy of variant calling algorithms, especially in identifying variants such as SNPs and Indels in somatic or germline cells. hap.py
can be used to compare variant calling results with known standards (e.g., a gold standard VCF) to assess the sensitivity, precision, and other metrics of variant detection methods.
:::
- Copy the executable files and reference files required for the class.
cd /work/username
rsync -avz /work/u2499286/hap.sh ./
rsync -avz /work/u2499286/KnownPositives_hg38_Liftover.vcf ./
rsync -avz /work/u2499286/High-Confidence_Regions_v1.2.bed.gz ./
- Open
hap.sh
and enter the correct path names. (1) Red part: Change the TA's username to your own. (2) Blue file names: Modify according to your files. (3) Yellow part: Change to your own file path.
Reminder: The bcftools command is used to remove multiallelic variants from the VCF file so that hap.py
can read it.
- Run
hap.sh
.
sbatch hap.sh
- Get the results: a total of 11 files.
- output_prefix.runinfo.json
- output_prefix.metrics.json.gz
- output_prefix.roc.Locations.SNP.csv.gz
- output_prefix.roc.Locations.SNP.PASS.csv.gz
- output_prefix.roc.Locations.INDEL.csv.gz
- output_prefix.roc.Locations.INDEL.PASS.csv.gz
- output_prefix.roc.all.csv.gz
- output_prefix.summary.csv
- output_prefix.extended.csv
- output_prefix.vcf.gz
- output_prefix.vcf.gz.tbi
- Copy the executable files required for the class.
rsync -avz /work/u2499286/rocplot.sh ./
rsync -avz /work/u2499286/rocplot_test.Rscript ./
- Enter R and load the required packages.
R
install.packages("ggplot2")
61
install.packages("tools")
q()
n
-
Change to the correct path. (1) Blue part: Make sure whether it is a HaplotypeCaller or Mutect2 file. (2) Red part: Replace with your own username.
-
Run
rocplot.sh
.
sbatch rocplot.sh
- The results are saved in the rocplot_HC or rocplot_M2 folders, each containing two plots. For example:
- You can also open IGV to compare the results from
hap.py
. The results are saved inoutput_prefix.vcf.gz.
-
Comparison between tools: Example: Comparison of SNPs from S16 Mutect2 and HaplotypeCaller. At position 241,884,674, the variant can be found with HaplotypeCaller, but not with Mutect2.
-
Comparison between samples: Example: Comparison of SNPs for S14, S15, S16 using Mutect2.
:::warning
cd /work/username
mkdir vcf_for_happy
cd vcf_for_happy
rsync -avzP /work/u2499286/S14_M2_result/SRR13076392_S14_L002.sorted.markdup.m2.vcf.gz ./
rsync -avzP /work/u2499286/S15_M2_result/SRR13076396_S15_L002.sorted.markdup.m2.vcf.gz ./
rsync -avzP /work/u2499286/S16_M2_result/SRR13076396_S16_L002.sorted.markdup.m2.vcf.gz ./
rsync -avzP /work/u2499286/S14_HC_result/SRR13076392_S14_L002.sorted.markdup.hc.vcf.gz ./
rsync -avzP /work/u2499286/S15_HC_result/SRR13076393_S15_L002.sorted.markdup.hc.vcf.gz ./
rsync -avzP /work/u2499286/S16_HC_result/SRR13076396_S16_L002.sorted.markdup.hc.vcf.gz ./
:::
:::info
hap.py
(Haplotype Comparison Tools)是一個用於比較基因組變異的工具。它經常被用來評估 variant calling 算法的準確性,特別是在體細胞或生殖細胞中的 SNPs 和 Indels 等變異的識別。hap.py
可以用來對比 variant calling 結果與已知的標準答案(例如 gold standard VCF ),以評估變異檢測方法的靈敏度 (Recall)、精確度 (Precision)等指標。
:::
- 複製上課所需執行檔與標準檔。
cd /work/username
rsync -avz /work/u2499286/hap.sh ./
rsync -avz /work/u2499286/KnownPositives_hg38_Liftover.vcf ./
rsync -avz /work/u2499286/High-Confidence_Regions_v1.2.bed.gz ./
sbatch hap.sh
- 得到結果 : 共11個檔案。
- output_prefix.runinfo.json
- output_prefix.metrics.json.gz
- output_prefix.roc.Locations.SNP.csv.gz
- output_prefix.roc.Locations.SNP.PASS.csv.gz
- output_prefix.roc.Locations.INDEL.csv.gz
- output_prefix.roc.Locations.INDEL.PASS.csv.gz
- output_prefix.roc.all.csv.gz
- output_prefix.summary.csv
- output_prefix.extended.csv
- output_prefix.vcf.gz
- output_prefix.vcf.gz.tbi
- 複製上課所需執行檔。
rsync -avz /work/u2499286/rocplot.sh ./
rsync -avz /work/u2499286/rocplot_test.Rscript ./
- 進入R,載入需要的 package
R
install.packages("ggplot2")
61
install.packages("tools")
q()
n
sbatch rocplot.sh
例如:
- 也可以打開 IGV 比較
hap.py
的結果。 結果儲存於output_prefix.vcf.gz