Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plink2 round-tripping vcf->pgen->vcf does not preserve contig names #236

Open
cmnbroad opened this issue Mar 28, 2023 · 3 comments
Open

Comments

@cmnbroad
Copy link

In round tripping vcf- > pgen(pvar/psam) -> vcf, contig names are not preserved:

Original VCF (one contig named "chr1"):

##fileformat=VCFv4.2
##contig=<ID=chr1,length=248956422,assembly=38>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ERS4367804      ERS4367805      ERS4367803      ERS4367800      ERS4367801      ERS4367798      ERS4367799      ERS4367796      ERS4367797      ERS4367795
chr1    10146   .       AC      A       .       NO_HQ_GENOTYPES AC=3;AF=0.750;AN=4;AS_QUALapprox=0|224;AS_VQSLOD=.;AS_YNG=.;QUALapprox=137      GT:AD:GQ:RGQ    1/1:1,15:11:137 ./.     ./.     ./.     ./.     0/1:3,5:47:87   ./.     ./.     ./.     ./.

After converting to pgen (plink2 --vcf original.vcf --make-pgen --out round_trip), the generated .pvar has the contig name correctly preserved in contig header line, but the variants use "1" as the contig name:

##contig=<ID=chr1,length=248956422,assembly=38>
#CHROM  POS     ID      REF     ALT     FILTER
1       10146   .       AC      A       NO_HQ_GENOTYPES

(Ok, its a .pvar, not a .VCF, but its still confusing).

But on regenerating a .vcf from the pfile (command plink2 --pfile round_trip --export vcf --out round_trip), both the sequence dictionary and the variants have the altered contig names (despite the assembly property still referencing hg38):

##fileformat=VCFv4.3
##fileDate=20230328
##source=PLINKv2.00
##contig=<ID=1,length=248956422,assembly=38>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ERS4367804      ERS4367805      ERS4367803      ERS4367800      ERS4367801      ERS4367798      ERS4367799      ERS4367796      ERS4367797      ERS4367795
1       10146   .       AC      A       .       NO_HQ_GENOTYPES .       GT      1/1     ./.     ./.     ./.     ./.     0/1     ./.     ./.     ./.     ./.

This kind of change can cause lots of problems if the vcf is used further in downstream analysis code. The plink2 code that I'm running was built locally from commit a2e584b, 3/11/23.

@chrchang
Copy link
Owner

plink's way of controlling this is the --output-chr flag.

@cmnbroad
Copy link
Author

Right, thanks for the link. Perhaps there is some compelling reason to do otherwise, but my suggestion would be to have the default behavior be to preserve the scheme that is presented in the inputs.

@chrchang
Copy link
Owner

That creates more interoperability problems than it solves. It is reasonable for programs taking plink-formatted input to currently assume no 'chr' prefix in .bim/.pvar files.

I will look into adding an "--output-chr infer" mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants