-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An issue about reproducibility. #16
Comments
Hi @fjmuzengyiheng, thanks for reporting this issue. Could you trying the latest code in the repository? I have tried it on my own data using different number of processes, the number of events remain the same. The resulting genotype may differ slightly as Sklearn's GMM clustering might give (mostly slightly) different results occasionally. |
Thank you so much for your reply. It really helped. But, another issue bothered me as well. simple_repeats_nochr_22XY_0000.bed #!/bin/bash singularity exec ${straglr_sif} python /usr/local/bin/straglr.py call TR_start: 2022-08-07 16:27:29 It took about 25h to genotype 100,000 loci. Did I set the parameter wrong? Can I make it faster? |
How deep is your sample sequenced? I suspect it's pretty deep? For a 20-30x sample for 100,000 STRs, it took 1.5h using 32 processors. If it's really deep, like 80-90x, it's gonna take longer as every read at the locus needs to be looked at. |
Hi, my ONT data is a 20-30x sample. If I selected loci with small size (eg. 3bp to 6bp), it runs faster within 2.5h using --nprocs 70. That shall work. Thank you so much for being patient to my issues. |
@fjmuzengyiheng, I believe you are using Straglr v1.2. If you want to give the current code in the repository a try, it should be much faster. I am going to make this a public release hopefully later this week, it will be v1.3. Thanks very much for trying Straglr. |
Thank you so much. Looking forward to the new public release v1.3. : - ) |
Hi, Prof.Readman Chiu.
Sorry for bother again.
I am using this first-tier tool for STR counting for my neurogenetic patients.
Here is one issue I want to report:
There is one disease named "CANVAS (Cerebellar ataxia, neuropathy, and vestibular areflexia syndrome)", which is caused by an expansion of (AAGGG)n repeat in RFC1 gene. (https://omim.org/entry/102579?search=RFC1&highlight=rfc1)
The sticky situation lies in:
The reference sequence is (AAAAG)n for this loci (hg38, 4:39,348,424-39,348,483).
There is an (AAGGG)n expansion in my data (ONT), which is confirmed by visualizing manually by IGV. (attached below)
When I use the bed file below, Straglr outputs nothing.
【bed file】
4 39348424 39348483 AAAAG"
I am not willing to give up this tool for its outstanding performance. Can Straglr deal with this "complex STRs (changed motif situation)"? I will be pleased if Straglr can deal with this situation, which will make it SUPER perfect.
Thank you again. This issue is also raised in Github (#19).
[attached file1: IGV visualization for my patient's ONT data]
[attached file2: 3510bp insertion in the first line]
AGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGCGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGCAATACAGAAGAAGAAGTAATACAGAAGGAAGGAAGGAAGGGAAGGGAAGGAAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGCAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAAGGAAGGAAGGGAAGGGAAGGGAAGGGAGGAAGGGAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGCGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGG
…-----原始邮件-----
发件人:readmanchiu ***@***.***>
发送时间:2022-08-10 03:27:50 (星期三)
收件人: bcgsc/straglr ***@***.***>
抄送: fjmuzengyiheng ***@***.***>, Mention ***@***.***>
主题: Re: [bcgsc/straglr] An issue about reproducibility. (Issue #16)
@fjmuzengyiheng, I believe you are using Straglr v1.2. If you want to give the current code in the repository a try, it should be much faster. I am going to make this a public release hopefully later this week, it will be v1.3. Thanks very much for trying Straglr.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Issue followed up through private communication |
**Hi, @readmanchiu
I am a new fan of the first-tier tool Straglr. I am trying to use this tool to genotype the genome-wide STR/VNTR for my ONT data.
I followed your guidance in the "Example application: genome-wide genotyping" part.
However, I found an issue about reproducibility.
Here is my steps:**
(1) I download UCSC Simple Repeats track as an tsv file;
(2) I deleted the prefix "chr" to be compatible with my genome reference;
(3) I also deleted the alternative contigs and kept those STR/VNTR loci in the 22+XY chromosome;
(4) I converted downloaded table into bed format using:
grep -v '^#' simple_repeats.downloaded.tsv | awk -vOFS='\t' 'length($17)>1 {print $2,$3,$4,$17}' > simple_repeats.bed
(5) I split whole-genome bed file into batches using:
split -l 10000 -d -a 4 --additional-suffix=.bed simple_repeats.bed simple_repeats_nochr
Then, I run the Genotype mode (locus given):
work_dir=$(pwd)
map_ref=/genetics/home/zengyiheng/project/zyh-pipeline/reference/grch38/GRCh38_alignment.fa
straglr_sif=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/straglr_blastn_trf.sif
str_bed2=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/str_db_ucsc_hg38/nochr/simple_repeats_nochr_0000.bed
mkdir ${work_dir}/str
mkdir ${work_dir}/tmp
cd ${work_dir}/str
singularity exec ${straglr_sif} python /usr/local/bin/straglr.py
${work_dir}/bam/19620.hg38.bam
${map_ref}
straglr_scan_19620
--loci ${str_bed2}
--min_str_len 3
--max_str_len 100
--min_ins_size 100
--genotype_in_size
--min_support 2
--max_num_clusters 2
--nprocs ${cpu_thread}
--tmpdir ${work_dir}/tmp
=============
The [simple_repeats_nochr_0000.bed] contains 10000 loci;
I run the above command four times respectively setting --nprocs ${cpu_thread} to be 2 / 4 / 48 / 48,
and found the results are not the same.
[the --nprocs i used] [the number of loci genotyped successfully recorded in the straglr_scan_19620.bed file ]
--nprocs 2 [9719 loci]
--nprocs 4 [9808 loci]
--nprocs 48 (1st run) [9878 loci]
--nprocs 48 (2nd run) [9880 loci]
The output of four runs is uploaded here. ( https://airportal.cn/965275/FmgSnxkGpH )(password: straglr)
It really confused me that the numbers of output loci were not 10000 and different from different runs.
Could you please be so kind to help me with this issue? Thank you so much. : )
The text was updated successfully, but these errors were encountered: