Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch gives wrong results on MD tag when read CRAM files with multiple_iterators=True #1290

Open
fanxinping opened this issue May 22, 2024 · 0 comments

Comments

@fanxinping
Copy link

When we covnert SAM/BAM to CRAM, we use store_nm=1 and store_md=1 to store NM/MD tags in the CRAM file. When reading CRAM with pysam, decode_md=0 is added to disable automatic calculation of NM/MD tags, allowing it to retrieve the NM/MD tag values stored in the CRAM file. However, for the fetch method, setting multiple_iterators to True or False will yield different NM/MD tags.

Below is a simple example of a SAM file(test.sam). READ1, READ2 and READ4 have MD tags, but READ3 and READ5 not.

@HD     VN:1.5  SO:coordinate                                                                                                      
@SQ     SN:chrM LN:16571
READ1   99      chrM    1       60      21S80M  =       177     277     CTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:80 MC:Z:101M       AS:i:80 XS:i:71 XA:Z:chr17,+22020706,101M,6;
READ2   163     chrM    1       60      1S100M  =       91      191     GGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG   FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:100        MC:Z:101M       AS:i:100        XS:i:61
READ3   99      chrM    1       60      2S84M   =       1       84      TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  MC:Z:2S84M      AS:i:84 XS:i:56
READ4   81      chrM    1       50      15S84M  =       16373   16290   TAAGACATCCCGATGGATCACAGGTCTATCACCCTATTACCCACTCACGGGAGCTCTCCATGCATTTGGTATTGTCGTCTGGGGGGTGTGCACGCGATA     FF:F,F,FF,:F:FF,F,:,FF,:F:FF,FFFFF:FFFF,F:FF,,FFFF,FF:FFFFFF,F:F,F:F::F,F,F:FFFFFFFFF,F:,FF,FFFFFF,     NM:i:2  MD:Z:24A33T25   MC:Z:101M       AS:i:74 XS:i:54
READ5   147     chrM    1       60      2S84M   =       1       -84     TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA  :FFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF  MC:Z:2S84M      AS:i:84 XS:i:56

We can use samtools view to convert this SAM file to CRAM file and store NM/MD tag verbatim. Please note that tabs might be converted to spaces if you directly copy the example above.

samtools view -C -T ucsc.hg19.asta --output-fmt-option store_md=1 --output-fmt-option store_nm=1 -o test.cram test.sam
samtools index test.cram

Then, we use fetch to get reads

  1. without multiple_iterators=True
import pysam
f = pysam.AlignmentFile('test.cram', 'rc', format_options=[b"decode_md=0"])
for read in f.fetch("chrM", 1, 2):
    print(read)
f.close()

and we get:

READ1	99	#0	1	60	21S80M	#0	177	277	CTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGC	array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('NM', 0), ('MD', '80'), ('MC', '101M'), ('AS', 80), ('XS', 71), ('XA', 'chr17,+22020706,101M,6;')]
READ2	163	#0	1	60	1S100M	#0	91	191	GGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG	array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('NM', 0), ('MD', '100'), ('MC', '101M'), ('AS', 100), ('XS', 61)]
READ3	99	#0	1	60	2S84M	#0	1	84	TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA	array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('MC', '2S84M'), ('AS', 84), ('XS', 56)]
READ4	81	#0	1	50	15S84M	#0	16373	16290	TAAGACATCCCGATGGATCACAGGTCTATCACCCTATTACCCACTCACGGGAGCTCTCCATGCATTTGGTATTGTCGTCTGGGGGGTGTGCACGCGATA	array('B', [37, 37, 25, 37, 11, 37, 11, 37, 37, 11, 25, 37, 25, 37, 37, 11, 37, 11, 25, 11, 37, 37, 11, 25, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 11, 37, 25, 37, 37, 11, 11, 37, 37, 37, 37, 11, 37, 37, 25, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 11, 37, 25, 37, 25, 25, 37, 11, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 11, 37, 37, 11, 37, 37, 37, 37, 37, 37, 11])	[('NM', 2), ('MD', '24A33T25'), ('MC', '101M'), ('AS', 74), ('XS', 54)]
READ5	147	#0	1	60	2S84M	#0	1	-84	TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA	array('B', [25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('MC', '2S84M'), ('AS', 84), ('XS', 56)]

READ3 and READ5 don't have MD tags, which is expected.
2. with multiple_iterators=True

import pysam
f = pysam.AlignmentFile('test.cram', 'rc', format_options=[b"decode_md=0"])
for read in f.fetch("chrM", 1, 2, multiple_iterators=True):
    print(read)
f.close()

and we get:

READ1	99	#0	1	60	21S80M	#0	177	277	CTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGC	array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('NM', 0), ('MD', '80'), ('MC', '101M'), ('AS', 80), ('XS', 71), ('XA', 'chr17,+22020706,101M,6;')]
READ2	163	#0	1	60	1S100M	#0	91	191	GGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG	array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('NM', 0), ('MD', '100'), ('MC', '101M'), ('AS', 100), ('XS', 61)]
READ3	99	#0	1	60	2S84M	#0	1	84	TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA	array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('MC', '2S84M'), ('AS', 84), ('XS', 56), ('MD', '84'), ('NM', 0)]
READ4	81	#0	1	50	15S84M	#0	16373	16290	TAAGACATCCCGATGGATCACAGGTCTATCACCCTATTACCCACTCACGGGAGCTCTCCATGCATTTGGTATTGTCGTCTGGGGGGTGTGCACGCGATA	array('B', [37, 37, 25, 37, 11, 37, 11, 37, 37, 11, 25, 37, 25, 37, 37, 11, 37, 11, 25, 11, 37, 37, 11, 25, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 11, 37, 25, 37, 37, 11, 11, 37, 37, 37, 37, 11, 37, 37, 25, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 11, 37, 25, 37, 25, 25, 37, 11, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 11, 37, 37, 11, 37, 37, 37, 37, 37, 37, 11])	[('NM', 2), ('MD', '24A33T25'), ('MC', '101M'), ('AS', 74), ('XS', 54)]
READ5	147	#0	1	60	2S84M	#0	1	-84	TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA	array('B', [25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('MC', '2S84M'), ('AS', 84), ('XS', 56), ('MD', '84'), ('NM', 0)]

All reads have MD tags, which is unexpected.

I checked the source code of pysam, and found that the issue arises when multiple_iterators=True. It opens a new CRAM file object, but it doesn't strictly adhere to the original opening options, for example, it doesn't use the format_options. Due to the absence of setting decode_md=0, all reads' NM/MD tags will be automatically calculated, resulting in all reads containing NM/MD tags.

self.htsfile = hts_open(cfilename, 'r')

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant