An issue about reproducibility. #16

fjmuzengyiheng · 2022-08-01T11:23:29Z

I am a new fan of the first-tier tool Straglr. I am trying to use this tool to genotype the genome-wide STR/VNTR for my ONT data.
I followed your guidance in the "Example application: genome-wide genotyping" part.
However, I found an issue about reproducibility.

Here is my steps:**

(1) I download UCSC Simple Repeats track as an tsv file;
(2) I deleted the prefix "chr" to be compatible with my genome reference;
(3) I also deleted the alternative contigs and kept those STR/VNTR loci in the 22+XY chromosome;
(4) I converted downloaded table into bed format using:

grep -v '^#' simple_repeats.downloaded.tsv | awk -vOFS='\t' 'length($17)>1 {print $2,$3,$4,$17}' > simple_repeats.bed

(5) I split whole-genome bed file into batches using:

split -l 10000 -d -a 4 --additional-suffix=.bed simple_repeats.bed simple_repeats_nochr

Then, I run the Genotype mode (locus given):

work_dir=$(pwd)
map_ref=/genetics/home/zengyiheng/project/zyh-pipeline/reference/grch38/GRCh38_alignment.fa
straglr_sif=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/straglr_blastn_trf.sif
str_bed2=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/str_db_ucsc_hg38/nochr/simple_repeats_nochr_0000.bed
mkdir ${work_dir}/str
mkdir ${work_dir}/tmp

cd ${work_dir}/str
singularity exec ${straglr_sif} python /usr/local/bin/straglr.py
${work_dir}/bam/19620.hg38.bam
${map_ref}
straglr_scan_19620
--loci ${str_bed2}
--min_str_len 3
--max_str_len 100
--min_ins_size 100
--genotype_in_size
--min_support 2
--max_num_clusters 2
--nprocs ${cpu_thread}
--tmpdir ${work_dir}/tmp

=============

The [simple_repeats_nochr_0000.bed] contains 10000 loci;
I run the above command four times respectively setting --nprocs ${cpu_thread} to be 2 / 4 / 48 / 48,
and found the results are not the same.

[the --nprocs i used] [the number of loci genotyped successfully recorded in the straglr_scan_19620.bed file ]
--nprocs 2 [9719 loci]
--nprocs 4 [9808 loci]
--nprocs 48 (1st run) [9878 loci]
--nprocs 48 (2nd run) [9880 loci]

The output of four runs is uploaded here. ( https://airportal.cn/965275/FmgSnxkGpH )(password: straglr)

It really confused me that the numbers of output loci were not 10000 and different from different runs.
Could you please be so kind to help me with this issue? Thank you so much. : )

readmanchiu · 2022-08-03T01:18:39Z

Hi @fjmuzengyiheng, thanks for reporting this issue. Could you trying the latest code in the repository? I have tried it on my own data using different number of processes, the number of events remain the same. The resulting genotype may differ slightly as Sklearn's GMM clustering might give (mostly slightly) different results occasionally.
Also note that you don't have to run with --min_str_len, --max_str_len, and --min_ins_size when you run in genotyping mode i.e. when --loci was provided, those parameters are only applicable when running in genome_scan mode.

fjmuzengyiheng · 2022-08-08T16:47:22Z

Thank you so much for your reply. It really helped.
And I also deleted the --min_str_len, --max_str_len, and --min_ins_size when I run in genotyping mode.

But, another issue bothered me as well.
I have about 900,000 loci from the simple_repeats_nochr_22XY.bed (hg38)
I split the simple_repeats_nochr_22XY.bed (hg38) into 9 files (100,000 loci per file):

simple_repeats_nochr_22XY_0000.bed
simple_repeats_nochr_22XY_0001.bed
....
simple_repeats_nochr_22XY_0008.bed

#!/bin/bash
#SBATCH -J zyh-mono0000
#SBATCH -p normal
#SBATCH -N 1
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err
#SBATCH --cpus-per-task=48
#SBATCH --nodelist=node2
ulimit -HSn 1002400
ulimit -s unlimited
ulimit -l unlimited

singularity exec ${straglr_sif} python /usr/local/bin/straglr.py
${work_dir}/bam/19620.hg38.bam
${map_ref}
straglr_scan_19620
--loci simple_repeats_nochr_22XY_0000.bed
--genotype_in_size
--min_support 2
--max_num_clusters 2
*--nprocs 48 *
--tmpdir ${work_dir}/tmp

call TR_start: 2022-08-07 16:27:29
call TR_end: 2022-08-08 17:53:19

It took about 25h to genotype 100,000 loci.
If I want to genotype all the genome-wide STR/VNTR loci, it would take about 9 days to genotype one sample using 48 CPU.

Did I set the parameter wrong? Can I make it faster?

readmanchiu · 2022-08-08T20:31:20Z

How deep is your sample sequenced? I suspect it's pretty deep? For a 20-30x sample for 100,000 STRs, it took 1.5h using 32 processors. If it's really deep, like 80-90x, it's gonna take longer as every read at the locus needs to be looked at.

fjmuzengyiheng · 2022-08-09T05:53:08Z

Hi, my ONT data is a 20-30x sample. If I selected loci with small size (eg. 3bp to 6bp), it runs faster within 2.5h using --nprocs 70. That shall work. Thank you so much for being patient to my issues.

readmanchiu · 2022-08-09T19:27:40Z

@fjmuzengyiheng, I believe you are using Straglr v1.2. If you want to give the current code in the repository a try, it should be much faster. I am going to make this a public release hopefully later this week, it will be v1.3. Thanks very much for trying Straglr.

fjmuzengyiheng · 2022-08-10T01:11:58Z

Thank you so much. Looking forward to the new public release v1.3. : - )

fjmuzengyiheng · 2023-01-07T07:54:58Z

Hi, Prof.Readman Chiu. Sorry for bother again. I am using this first-tier tool for STR counting for my neurogenetic patients. Here is one issue I want to report: There is one disease named "CANVAS (Cerebellar ataxia, neuropathy, and vestibular areflexia syndrome)", which is caused by an expansion of (AAGGG)n repeat in RFC1 gene. (https://omim.org/entry/102579?search=RFC1&highlight=rfc1) The sticky situation lies in: The reference sequence is (AAAAG)n for this loci (hg38, 4:39,348,424-39,348,483). There is an (AAGGG)n expansion in my data (ONT), which is confirmed by visualizing manually by IGV. (attached below) When I use the bed file below, Straglr outputs nothing. 【bed file】 4 39348424 39348483 AAAAG" I am not willing to give up this tool for its outstanding performance. Can Straglr deal with this "complex STRs (changed motif situation)"? I will be pleased if Straglr can deal with this situation, which will make it SUPER perfect. Thank you again. This issue is also raised in Github (#19). [attached file1: IGV visualization for my patient's ONT data] [attached file2: 3510bp insertion in the first line] AGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGCGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGCAATACAGAAGAAGAAGTAATACAGAAGGAAGGAAGGAAGGGAAGGGAAGGAAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGCAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAAGGAAGGAAGGGAAGGGAAGGGAAGGGAGGAAGGGAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGCGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGG

…

-----原始邮件----- 发件人:readmanchiu ***@***.***> 发送时间:2022-08-10 03:27:50 (星期三) 收件人: bcgsc/straglr ***@***.***> 抄送: fjmuzengyiheng ***@***.***>, Mention ***@***.***> 主题: Re: [bcgsc/straglr] An issue about reproducibility. (Issue #16) @fjmuzengyiheng, I believe you are using Straglr v1.2. If you want to give the current code in the repository a try, it should be much faster. I am going to make this a public release hopefully later this week, it will be v1.3. Thanks very much for trying Straglr. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

readmanchiu · 2023-11-03T15:11:53Z

Issue followed up through private communication

fjmuzengyiheng changed the title ~~I have been using your tool~~ An issue about reproducibility. Aug 1, 2022

readmanchiu closed this as completed Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An issue about reproducibility. #16

An issue about reproducibility. #16

fjmuzengyiheng commented Aug 1, 2022 •

edited

Loading

readmanchiu commented Aug 3, 2022

fjmuzengyiheng commented Aug 8, 2022

readmanchiu commented Aug 8, 2022

fjmuzengyiheng commented Aug 9, 2022

readmanchiu commented Aug 9, 2022

fjmuzengyiheng commented Aug 10, 2022

fjmuzengyiheng commented Jan 7, 2023 via email

readmanchiu commented Nov 3, 2023

An issue about reproducibility. #16

An issue about reproducibility. #16

Comments

fjmuzengyiheng commented Aug 1, 2022 • edited Loading

grep -v '^#' simple_repeats.downloaded.tsv | awk -vOFS='\t' 'length($17)>1 {print $2,$3,$4,$17}' > simple_repeats.bed

split -l 10000 -d -a 4 --additional-suffix=.bed simple_repeats.bed simple_repeats_nochr

The output of four runs is uploaded here. ( https://airportal.cn/965275/FmgSnxkGpH )(password: straglr)

readmanchiu commented Aug 3, 2022

fjmuzengyiheng commented Aug 8, 2022

readmanchiu commented Aug 8, 2022

fjmuzengyiheng commented Aug 9, 2022

readmanchiu commented Aug 9, 2022

fjmuzengyiheng commented Aug 10, 2022

fjmuzengyiheng commented Jan 7, 2023 via email

readmanchiu commented Nov 3, 2023

fjmuzengyiheng commented Aug 1, 2022 •

edited

Loading