You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On Rosalind, I'm encountering an error when running the snp_call_nf pipeline, specifically at the GATK BaseRecalibrator step. The pipeline is failing to find the expected known_variants.vcf file and when using an alternative file (known_snps_unsorted.vcf), it's failing validation.
Error Message:
A USER ERROR has occurred: Couldn't read file file:///local/data/Malaria/Projects/Takala-Harrison/Temporal_Malawi_Matt/snp_call_pipe/snp_call_nf/ref/pf_crosses_v1/known_variants.vcf. Error was: It doesn't exist.
Troubleshooting Steps and Results:
Checked reference directory:
The expected directory (/local/data/Malaria/Projects/Takala-Harrison/Temporal_Malawi_Matt/snp_call_pipe/snp_call_nf/ref/pf_crosses_v1/) exists, but doesn't contain a file named known_variants.vcf. I'm not sure why it is pointing to this directory as I set the link to the ref folder like this:
A USER ERROR has occurred: Input ref/known_snps_unsorted.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position Pf3D7_04_v3:37119 are not observed at all in the sample genotypes
Questions:
From this comment in the nextflow.config file, it looks like the 'known_variants.vcf' file is supposed to be used in the stead of the 'known_snps_unsorted.vcf' file.
// Original known_snps_unsorted.vcf has 944,270 SNPs (from previous lab members).
// The updated file is pf_crosses_v1/known_variants.vcf which has 66,121 variants (snp/indels)
// The later is generated following the MalariaGen Pf6 paper:
// MalariaGEN, et al. (2021). An open dataset of Plasmodium falciparum
// genome variation in 7,000 worldwide samples. Wellcome Open Res 6, 42.
// 10.12688/wellcomeopenres.16168.2.
known_sites = ["$projectDir/ref/pf_crosses_v1/known_variants.vcf"]
Is it correct to use the 'known_variant.vcf'? Even though it only has 66,121 variants?
If it is desirable to use the 'known_variants.vcf' file, is this is the correct file:
If the 'known_snps_unsorted.vcf' is the correct file, is there an issue with the error that GATK ValidateVariants produced?
The pipeline seems to be looking for the file in a different location than where the reference files are symlinked. How can I correct this path issue in the Nextflow configuration? Maybe adding a param for reference directory that is different from projectDir like so:
Edit params as follows:
params {
// Add this line near the top of the params section
ref_dir = "$projectDir/ref"
// Then update all the paths to use this new parameter
parasite {
fasta = "${params.ref_dir}/PlasmoDB-44_Pfalciparum3D7_Genome.fasta"
fasta_prefix = "${params.ref_dir}/PlasmoDB-44_Pfalciparum3D7_Genome"
}
host {
fasta = ["${params.ref_dir}/host/hg38.fasta"]
fasta_prefix = ["${params.ref_dir}/host/hg38"]
}
known_sites = ["${params.ref_dir}/pf_crosses_v1/known_variants.vcf"]
Also, update VQSR resources as well:
// vqsr is off by default, because the test data is too small and will cause error/crash
vqsr = false
vqsr_resources = [
[name: 'jacob2014_microarray_liftover', type: 'truth', prior: 15, vcf: "${params.ref_dir}/jacob2014_chip_sites.vcf" ],
[name: 'Pfcross1_3d7_hb3_gatk_pass', type: 'training', prior: 12, vcf: "${params.ref_dir}/pf_crosses_v1/pass_3d7_hb3.gatk.final.vcf.gz" ],
[name: 'Pfcross1_7g8_gb4_gatk_pass', type: 'training', prior: 12, vcf: "${params.ref_dir}/pf_crosses_v1/pass_7g8_gb4.gatk.final.vcf.gz" ],
[name: 'Pfcross1_hb3_dd2_gatk_pass', type: 'training', prior: 12, vcf: "${params.ref_dir}/pf_crosses_v1/pass_hb3_dd2.gatk.final.vcf.gz" ]
]
}
Command to run pipeline for internal (Rosalind) users:
nextflow run main.nf --ref_dir /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref
If this is acceptable, here is the README update:
Reference Directory:
The pipeline uses a reference directory for various genome files and known variants.
By default, this is set to $projectDir/ref, but can be changed using the --ref_dir parameter:
nextflow run main.nf --ref_dir /path/to/your/reference/directory
Internal users should use:
nextflow run main.nf --ref_dir /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref
External users who have set up the reference directory in the default location can run the pipeline without this parameter.
Environment:
Nextflow version: version 22.10.6 build 5843
GATK version:
Using GATK jar /home/matt.adams/miniconda3/envs/snp_call_nf/share/gatk4-4.2.2.0-1/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/matt.adams/miniconda3/envs/snp_call_nf/share/gatk4-4.2.2.0-1/gatk-package-4.2.2.0-local.jar --version
The Genome Analysis Toolkit (GATK) v4.2.2.0
HTSJDK Version: 2.24.1
Picard Version: 2.25.4
On Rosalind, I'm encountering an error when running the snp_call_nf pipeline, specifically at the GATK BaseRecalibrator step. The pipeline is failing to find the expected known_variants.vcf file and when using an alternative file (known_snps_unsorted.vcf), it's failing validation.
Error Message:
A USER ERROR has occurred: Couldn't read file file:///local/data/Malaria/Projects/Takala-Harrison/Temporal_Malawi_Matt/snp_call_pipe/snp_call_nf/ref/pf_crosses_v1/known_variants.vcf. Error was: It doesn't exist.
Troubleshooting Steps and Results:
The expected directory (/local/data/Malaria/Projects/Takala-Harrison/Temporal_Malawi_Matt/snp_call_pipe/snp_call_nf/ref/pf_crosses_v1/) exists, but doesn't contain a file named known_variants.vcf. I'm not sure why it is pointing to this directory as I set the link to the ref folder like this:
/local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref/known_snps_unsorted.vcf (and vcf.idx)
Checked symbolic links:
Attempted to recreate symbolic links to reference files, but they already existed.
Attempted to validate alternative VCF file:
Ran GATK ValidateVariants on known_snps_unsorted.vcf which resulted in the error below the command here:
A USER ERROR has occurred: Input ref/known_snps_unsorted.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position Pf3D7_04_v3:37119 are not observed at all in the sample genotypes
Questions:
Is it correct to use the 'known_variant.vcf'? Even though it only has 66,121 variants?
If it is desirable to use the 'known_variants.vcf' file, is this is the correct file:
If the 'known_snps_unsorted.vcf' is the correct file, is there an issue with the error that GATK ValidateVariants produced?
The pipeline seems to be looking for the file in a different location than where the reference files are symlinked. How can I correct this path issue in the Nextflow configuration? Maybe adding a param for reference directory that is different from projectDir like so:
Edit params as follows:
Also, update VQSR resources as well:
Command to run pipeline for internal (Rosalind) users:
If this is acceptable, here is the README update:
Reference Directory:
The pipeline uses a reference directory for various genome files and known variants.
By default, this is set to
$projectDir/ref
, but can be changed using the--ref_dir
parameter:Internal users should use:
External users who have set up the reference directory in the default location can run the pipeline without this parameter.
Environment:
Nextflow version: version 22.10.6 build 5843
GATK version:
Using GATK jar /home/matt.adams/miniconda3/envs/snp_call_nf/share/gatk4-4.2.2.0-1/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/matt.adams/miniconda3/envs/snp_call_nf/share/gatk4-4.2.2.0-1/gatk-package-4.2.2.0-local.jar --version
The Genome Analysis Toolkit (GATK) v4.2.2.0
HTSJDK Version: 2.24.1
Picard Version: 2.25.4
Any guidance on resolving these issues would be greatly appreciated. Thank you!
The text was updated successfully, but these errors were encountered: