-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DIAMOND DB Generation for UniRef90 #1
base: main
Are you sure you want to change the base?
Conversation
Also, input setting could be simplified as Rhea DB version and UniRef90 version can be found automatically using one liners:
But I can't find a clean solution to use this, because I need a groovy str variable to use in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @ochkalova. Some minor comments.
I also think we should create local
and nf-core
folders under modules. Move all your modules and tax_db generaiton there, put diamond into nf-core. I know those were done by @chrisAta and movement will require fixes in module.nf but we should be consistent.
configs/slurm.config
Outdated
withLabel: 'light' { | ||
cpus = 1 | ||
memory = { 3.GB * task.attempt } | ||
errorStrategy = { task.exitStatus == 137 ? 'retry' : 'finish' } | ||
errorStrategy = { task.exitStatus == 137 || task.exitStatus == 140 ? 'retry' : 'finish' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can add even more wide errorStrategy like https://github.com/EBI-Metagenomics/miassembler/blob/main/conf/base.config#L17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Martin supported this suggestion so I've made changes to configs in use wider error strategy
} | ||
|
||
includeConfig 'configs/modules.config' | ||
|
||
params { | ||
|
||
outdir = "/hps/nobackup/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_v6" | ||
dummy_fasta = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/dummy.fasta" | ||
empty_file = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/EMPTY.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those 2 files and outdir
should be moved to and output dir should be moved somewhere from Chris's folder. We should assign a general location for that (and I know those are not your fixes, just noticed because nobody reviewed tax_db generation)
@chrisAta makes sense to review it I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah this was written before I started using nf-core guidelines... I will fix all of this after this PR gets merged
@@ -57,6 +66,13 @@ params { | |||
itsonedb_download_taxdump = "https://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip" | |||
itsonedb_tax_header = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/itsonedb_tax_header.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this one also refers to @chrisAta
@ochkalova and one more comment. Please, add a descriptive note/comment to modules that require download of data from API/FTP. It should be noted sowhere why we must/can't refer to |
@KateSakharova I've separated modules to local and nf-core and corrected paths in commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice thank you, really just one small change after everything Kate has reviewed! Thanks for changing the format of some of the existing stuff (configs/modules)... there's a few more things that I want to change here to make it more like the rest of our nextflow pipelines, but I'll do that in a future PR when I have time
@@ -0,0 +1,32 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this header at the beginning of any new scripts:
# -*- coding: utf-8 -*-
# Copyright 2024 EMBL - European Bioinformatics Institute
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Chris, I've added this
} | ||
withLabel: 'medium' { | ||
cpus = 1 | ||
memory = { 2.GB } | ||
errorStrategy = { task.exitStatus == 137 ? 'retry' : 'finish' } | ||
errorStrategy = { task.exitStatus in ((130..155) + 104) ? 'retry' : 'finish' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to check why these were changed, does it include other errors that were missing before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very wide list of errors Kate suggested to include
I needed only to add 140, but Martin supported the idea of using wider error strategy to account for transient errors. The main argument was
the effort required to catch those internally and try fix/prevent them is massive comparatively to just retrying jobs every now and then
and I agreed with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I'll add all of the other modules to this in a future PR
@@ -2,7 +2,7 @@ | |||
process CLEAN_FASTA { | |||
|
|||
label 'light' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a note to change this from the hps
path in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice thanks for changing those already 🙏
if [[ "${params.rhea_chebi_download_mapping}" =~ ^https?:// ]]; then | ||
# If it's a URL, download the file | ||
wget ${params.rhea_chebi_download_mapping} -O rhea-reactions.txt.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to check, is this a really big file? because input paths can be URLs in nextflow... but obviously if it's too large to download on a per-job basis then something like this does make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tiny file, unfortunately nextflow fails to download it when I'm providing this url as input path to file() method...
WARN: Unable to stage foreign file: https://ftp.expasy.org/databases/rhea/txt/rhea-reactions.txt.gz (try 1) -- Cause: Unable to access path: https://ftp.expasy.org/databases/rhea/txt/rhea-reactions.txt.gz
I asked Martin and Vangelis about this and they suggested just using wget
instead
Hello,
I'm creating this draft pull request to gather feedback and clarify next steps to complete fitting my sub-workflow into the larger pipeline for generation of various databases.
Functionality:
Downloads UniRef90 and generates two databases:
Required manual input:
Output structure:
135
is the Rhea DB version.2024_05
is the UniRef90 version.2024-07-31
is the UniprotKB access date.Changes made:
bin/
directory.diamond/makedb
from nf-core.publishDir
directives for the modules."process_single"
because in most my modules I need 1 CPU and at least 6 GB of memory.script
directive.configs/test_local.config
to allow test runs on my local machine.My Questions:
community.wave.seqera.io/library/biopython_pip_taxoniq:61a7ad516ddf4b95
) that includes Biopython and Taxoniq.--generate_amplicon_db
and--generate_uniref90_db
) for now.