Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An issue about reproducibility. #16

Closed
fjmuzengyiheng opened this issue Aug 1, 2022 · 8 comments
Closed

An issue about reproducibility. #16

fjmuzengyiheng opened this issue Aug 1, 2022 · 8 comments

Comments

@fjmuzengyiheng
Copy link

fjmuzengyiheng commented Aug 1, 2022

**Hi, @readmanchiu

I am a new fan of the first-tier tool Straglr. I am trying to use this tool to genotype the genome-wide STR/VNTR for my ONT data.
I followed your guidance in the "Example application: genome-wide genotyping" part.
However, I found an issue about reproducibility.

Here is my steps:**

(1) I download UCSC Simple Repeats track as an tsv file;
(2) I deleted the prefix "chr" to be compatible with my genome reference;
(3) I also deleted the alternative contigs and kept those STR/VNTR loci in the 22+XY chromosome;
(4) I converted downloaded table into bed format using:

grep -v '^#' simple_repeats.downloaded.tsv | awk -vOFS='\t' 'length($17)>1 {print $2,$3,$4,$17}' > simple_repeats.bed

(5) I split whole-genome bed file into batches using:

split -l 10000 -d -a 4 --additional-suffix=.bed simple_repeats.bed simple_repeats_nochr

Then, I run the Genotype mode (locus given):

work_dir=$(pwd)
map_ref=/genetics/home/zengyiheng/project/zyh-pipeline/reference/grch38/GRCh38_alignment.fa
straglr_sif=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/straglr_blastn_trf.sif
str_bed2=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/str_db_ucsc_hg38/nochr/simple_repeats_nochr_0000.bed
mkdir ${work_dir}/str
mkdir ${work_dir}/tmp

cd ${work_dir}/str
singularity exec ${straglr_sif} python /usr/local/bin/straglr.py
${work_dir}/bam/19620.hg38.bam
${map_ref}
straglr_scan_19620
--loci ${str_bed2}
--min_str_len 3
--max_str_len 100
--min_ins_size 100
--genotype_in_size
--min_support 2
--max_num_clusters 2
--nprocs ${cpu_thread}
--tmpdir ${work_dir}/tmp

=============

The [simple_repeats_nochr_0000.bed] contains 10000 loci;
I run the above command four times respectively setting --nprocs ${cpu_thread} to be 2 / 4 / 48 / 48,
and found the results are not the same.

[the --nprocs i used] [the number of loci genotyped successfully recorded in the straglr_scan_19620.bed file ]
--nprocs 2 [9719 loci]
--nprocs 4 [9808 loci]
--nprocs 48 (1st run) [9878 loci]
--nprocs 48 (2nd run) [9880 loci]

The output of four runs is uploaded here. ( https://airportal.cn/965275/FmgSnxkGpH )(password: straglr)

It really confused me that the numbers of output loci were not 10000 and different from different runs.
Could you please be so kind to help me with this issue? Thank you so much. : )

@fjmuzengyiheng fjmuzengyiheng changed the title I have been using your tool An issue about reproducibility. Aug 1, 2022
@readmanchiu
Copy link
Collaborator

Hi @fjmuzengyiheng, thanks for reporting this issue. Could you trying the latest code in the repository? I have tried it on my own data using different number of processes, the number of events remain the same. The resulting genotype may differ slightly as Sklearn's GMM clustering might give (mostly slightly) different results occasionally.
Also note that you don't have to run with --min_str_len, --max_str_len, and --min_ins_size when you run in genotyping mode i.e. when --loci was provided, those parameters are only applicable when running in genome_scan mode.

@fjmuzengyiheng
Copy link
Author

Thank you so much for your reply. It really helped.
And I also deleted the --min_str_len, --max_str_len, and --min_ins_size when I run in genotyping mode.

But, another issue bothered me as well.
I have about 900,000 loci from the simple_repeats_nochr_22XY.bed (hg38)
I split the simple_repeats_nochr_22XY.bed (hg38) into 9 files (100,000 loci per file):

simple_repeats_nochr_22XY_0000.bed
simple_repeats_nochr_22XY_0001.bed
....
simple_repeats_nochr_22XY_0008.bed

#!/bin/bash
#SBATCH -J zyh-mono0000
#SBATCH -p normal
#SBATCH -N 1
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err
#SBATCH --cpus-per-task=48
#SBATCH --nodelist=node2
ulimit -HSn 1002400
ulimit -s unlimited
ulimit -l unlimited

singularity exec ${straglr_sif} python /usr/local/bin/straglr.py
${work_dir}/bam/19620.hg38.bam
${map_ref}
straglr_scan_19620
--loci simple_repeats_nochr_22XY_0000.bed
--genotype_in_size
--min_support 2
--max_num_clusters 2
*--nprocs 48 *
--tmpdir ${work_dir}/tmp

call TR_start: 2022-08-07 16:27:29
call TR_end: 2022-08-08 17:53:19

It took about 25h to genotype 100,000 loci.
If I want to genotype all the genome-wide STR/VNTR loci, it would take about 9 days to genotype one sample using 48 CPU.

Did I set the parameter wrong? Can I make it faster?

@readmanchiu
Copy link
Collaborator

How deep is your sample sequenced? I suspect it's pretty deep? For a 20-30x sample for 100,000 STRs, it took 1.5h using 32 processors. If it's really deep, like 80-90x, it's gonna take longer as every read at the locus needs to be looked at.

@fjmuzengyiheng
Copy link
Author

Hi, my ONT data is a 20-30x sample. If I selected loci with small size (eg. 3bp to 6bp), it runs faster within 2.5h using --nprocs 70. That shall work. Thank you so much for being patient to my issues.

@readmanchiu
Copy link
Collaborator

@fjmuzengyiheng, I believe you are using Straglr v1.2. If you want to give the current code in the repository a try, it should be much faster. I am going to make this a public release hopefully later this week, it will be v1.3. Thanks very much for trying Straglr.

@fjmuzengyiheng
Copy link
Author

Thank you so much. Looking forward to the new public release v1.3. : - )

@fjmuzengyiheng
Copy link
Author

fjmuzengyiheng commented Jan 7, 2023 via email

@readmanchiu
Copy link
Collaborator

Issue followed up through private communication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants