You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we covnert SAM/BAM to CRAM, we use store_nm=1 and store_md=1 to store NM/MD tags in the CRAM file. When reading CRAM with pysam, decode_md=0 is added to disable automatic calculation of NM/MD tags, allowing it to retrieve the NM/MD tag values stored in the CRAM file. However, for the fetch method, setting multiple_iterators to True or False will yield different NM/MD tags.
Below is a simple example of a SAM file(test.sam). READ1, READ2 and READ4 have MD tags, but READ3 and READ5 not.
We can use samtools view to convert this SAM file to CRAM file and store NM/MD tag verbatim. Please note that tabs might be converted to spaces if you directly copy the example above.
I checked the source code of pysam, and found that the issue arises when multiple_iterators=True. It opens a new CRAM file object, but it doesn't strictly adhere to the original opening options, for example, it doesn't use the format_options. Due to the absence of setting decode_md=0, all reads' NM/MD tags will be automatically calculated, resulting in all reads containing NM/MD tags.
When we covnert SAM/BAM to CRAM, we use
store_nm=1
andstore_md=1
to store NM/MD tags in the CRAM file. When reading CRAM with pysam,decode_md=0
is added to disable automatic calculation of NM/MD tags, allowing it to retrieve the NM/MD tag values stored in the CRAM file. However, for the fetch method, settingmultiple_iterators
to True or False will yield different NM/MD tags.Below is a simple example of a SAM file(test.sam). READ1, READ2 and READ4 have MD tags, but READ3 and READ5 not.
We can use samtools view to convert this SAM file to CRAM file and store NM/MD tag verbatim. Please note that tabs might be converted to spaces if you directly copy the example above.
Then, we use fetch to get reads
multiple_iterators=True
and we get:
READ3 and READ5 don't have MD tags, which is expected.
2. with
multiple_iterators=True
and we get:
All reads have MD tags, which is unexpected.
I checked the source code of pysam, and found that the issue arises when
multiple_iterators=True
. It opens a new CRAM file object, but it doesn't strictly adhere to the original opening options, for example, it doesn't use theformat_options
. Due to the absence of settingdecode_md=0
, all reads' NM/MD tags will be automatically calculated, resulting in all reads containing NM/MD tags.pysam/pysam/libcalignmentfile.pyx
Line 2013 in 43c1066
Thanks!
The text was updated successfully, but these errors were encountered: