You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In round tripping vcf- > pgen(pvar/psam) -> vcf, contig names are not preserved:
Original VCF (one contig named "chr1"):
##fileformat=VCFv4.2
##contig=<ID=chr1,length=248956422,assembly=38>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ERS4367804 ERS4367805 ERS4367803 ERS4367800 ERS4367801 ERS4367798 ERS4367799 ERS4367796 ERS4367797 ERS4367795
chr1 10146 . AC A . NO_HQ_GENOTYPES AC=3;AF=0.750;AN=4;AS_QUALapprox=0|224;AS_VQSLOD=.;AS_YNG=.;QUALapprox=137 GT:AD:GQ:RGQ 1/1:1,15:11:137 ./. ./. ./. ./. 0/1:3,5:47:87 ./. ./. ./. ./.
After converting to pgen (plink2 --vcf original.vcf --make-pgen --out round_trip), the generated .pvar has the contig name correctly preserved in contig header line, but the variants use "1" as the contig name:
##contig=<ID=chr1,length=248956422,assembly=38>
#CHROM POS ID REF ALT FILTER
1 10146 . AC A NO_HQ_GENOTYPES
(Ok, its a .pvar, not a .VCF, but its still confusing).
But on regenerating a .vcf from the pfile (command plink2 --pfile round_trip --export vcf --out round_trip), both the sequence dictionary and the variants have the altered contig names (despite the assembly property still referencing hg38):
##fileformat=VCFv4.3
##fileDate=20230328
##source=PLINKv2.00
##contig=<ID=1,length=248956422,assembly=38>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ERS4367804 ERS4367805 ERS4367803 ERS4367800 ERS4367801 ERS4367798 ERS4367799 ERS4367796 ERS4367797 ERS4367795
1 10146 . AC A . NO_HQ_GENOTYPES . GT 1/1 ./. ./. ./. ./. 0/1 ./. ./. ./. ./.
This kind of change can cause lots of problems if the vcf is used further in downstream analysis code. The plink2 code that I'm running was built locally from commit a2e584b, 3/11/23.
The text was updated successfully, but these errors were encountered:
Right, thanks for the link. Perhaps there is some compelling reason to do otherwise, but my suggestion would be to have the default behavior be to preserve the scheme that is presented in the inputs.
That creates more interoperability problems than it solves. It is reasonable for programs taking plink-formatted input to currently assume no 'chr' prefix in .bim/.pvar files.
I will look into adding an "--output-chr infer" mode.
In round tripping vcf- > pgen(pvar/psam) -> vcf, contig names are not preserved:
Original VCF (one contig named "chr1"):
After converting to pgen (
plink2 --vcf original.vcf --make-pgen --out round_trip
), the generated .pvar has the contig name correctly preserved in contig header line, but the variants use "1" as the contig name:(Ok, its a .pvar, not a .VCF, but its still confusing).
But on regenerating a .vcf from the pfile (command
plink2 --pfile round_trip --export vcf --out round_trip
), both the sequence dictionary and the variants have the altered contig names (despite theassembly
property still referencing hg38):This kind of change can cause lots of problems if the vcf is used further in downstream analysis code. The plink2 code that I'm running was built locally from commit a2e584b, 3/11/23.
The text was updated successfully, but these errors were encountered: