Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIAMOND DB Generation for UniRef90 #1

Draft
wants to merge 64 commits into
base: main
Choose a base branch
from
Draft

Conversation

ochkalova
Copy link

@ochkalova ochkalova commented Dec 3, 2024

Hello,

I'm creating this draft pull request to gather feedback and clarify next steps to complete fitting my sub-workflow into the larger pipeline for generation of various databases.

Functionality:

Downloads UniRef90 and generates two databases:

  • A DIAMOND DB containing proteins from UniRef90 that have Rhea annotations + a mapping file of Rhea IDs to reaction definitions.
  • A DIAMOND DB containing proteins from UniRef90 that are non-viral, used for taxonomic assignments.

Required manual input:

  1. Rhea DB version (for mapping Rhea to reaction definitions)
  2. UniRef90 version
  3. UniprotKB access date (for mapping protein IDs to Rhea IDs)
  4. File with snapshot of UnirefKB that maps protein IDs to Rhea IDs

Output structure:

<output_dir_name>
├── UNIREF90_RHEA
│   ├── rhea_chebi_mapping_135.tsv
│   └── uniref90_rhea_2024_05_2024-07-31.dmnd
└── UNIREF90_TAXA
    └── uniref90_taxa_2024_05.dmnd
  • 135 is the Rhea DB version.
  • 2024_05 is the UniRef90 version.
  • 2024-07-31 is the UniprotKB access date.

Changes made:

  • Added three Python scripts to the bin/ directory.
  • Added four new modules (uniref90_non_viral_filter, uniref90_rhea_filter, reformat_rhea_chebi, diamond/makedb) including diamond/makedb from nf-core.
  • Added tests and test data.
  • Modified workflows/pipeline.nf to include the new sub-workflow, along with flags to control generation of amplicon DBs or UniRef90 DBs.
  • Modified nextflow.config.
  • Created a configs/modules.config file to set the publishDir directives for the modules.
  • Updated the retry logic in slurm.config to include error code 140 (out of memory error). Also added new label "process_single" because in most my modules I need 1 CPU and at least 6 GB of memory.
  • Minor adjustments to all Chris's modules due to issues raised by the Nextflow plugin 😅. I've only added script directive.
  • Created configs/test_local.config to allow test runs on my local machine.

My Questions:

  1. Are the names of modules, workflows, and output files clear and intuitive, or could they be improved?
  2. Are we okay with using Seqera containers? I created a container (community.wave.seqera.io/library/biopython_pip_taxoniq:61a7ad516ddf4b95) that includes Biopython and Taxoniq.
  3. If I'm using a module from nf-core, do I need to separate all modules into local and nf-core, or can they be mixed?
  4. The config files are starting to feel a bit disorganized. Any suggestions for better structuring?
  5. How should we handle DB generation control in the pipeline? Chris suggested commenting out unwanted sub-workflows in workflows/pipeline.nf, but Kate was against it. I’ve added flags (--generate_amplicon_db and --generate_uniref90_db) for now.
  6. Does my approach to version control make sense, or is there a more efficient way to manage it?

@ochkalova ochkalova self-assigned this Dec 3, 2024
@ochkalova
Copy link
Author

ochkalova commented Dec 3, 2024

Also, input setting could be simplified as Rhea DB version and UniRef90 version can be found automatically using one liners:
curl -s https://ftp.expasy.org/databases/rhea/rhea-release.properties | grep "rhea.release.number" | awk -F= '{print $2}'

curl -s https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.release_note | grep -oP 'Release: \K[0-9_]+'

But I can't find a clean solution to use this, because I need a groovy str variable to use in diamond/makedb meta input. And if I use these one liners in a module the output is a channel of values.

Copy link

@KateSakharova KateSakharova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @ochkalova. Some minor comments.
I also think we should create local and nf-core folders under modules. Move all your modules and tax_db generaiton there, put diamond into nf-core. I know those were done by @chrisAta and movement will require fixes in module.nf but we should be consistent.

bin/uniref90_non_viral_filter.py Outdated Show resolved Hide resolved
configs/modules.config Outdated Show resolved Hide resolved
configs/modules.config Outdated Show resolved Hide resolved
configs/modules.config Outdated Show resolved Hide resolved
withLabel: 'light' {
cpus = 1
memory = { 3.GB * task.attempt }
errorStrategy = { task.exitStatus == 137 ? 'retry' : 'finish' }
errorStrategy = { task.exitStatus == 137 || task.exitStatus == 140 ? 'retry' : 'finish' }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can add even more wide errorStrategy like https://github.com/EBI-Metagenomics/miassembler/blob/main/conf/base.config#L17

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Martin supported this suggestion so I've made changes to configs in use wider error strategy

modules/uniref90_rhea_filter/main.nf Outdated Show resolved Hide resolved
}

includeConfig 'configs/modules.config'

params {

outdir = "/hps/nobackup/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_v6"
dummy_fasta = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/dummy.fasta"
empty_file = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/EMPTY.txt"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those 2 files and outdir should be moved to and output dir should be moved somewhere from Chris's folder. We should assign a general location for that (and I know those are not your fixes, just noticed because nobody reviewed tax_db generation)
@chrisAta makes sense to review it I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah this was written before I started using nf-core guidelines... I will fix all of this after this PR gets merged

@@ -57,6 +66,13 @@ params {
itsonedb_download_taxdump = "https://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip"
itsonedb_tax_header = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/itsonedb_tax_header.txt"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this one also refers to @chrisAta

subworkflows/uniref90_generation/main.nf Outdated Show resolved Hide resolved
modules/reformat_rhea_chebi/main.nf Outdated Show resolved Hide resolved
@KateSakharova
Copy link

@ochkalova and one more comment. Please, add a descriptive note/comment to modules that require download of data from API/FTP. It should be noted sowhere why we must/can't refer to latest release and set arguments of dbs in params. We can easily forget the structure of those dbs.

@ochkalova
Copy link
Author

Looks good @ochkalova. Some minor comments. I also think we should create local and nf-core folders under modules. Move all your modules and tax_db generaiton there, put diamond into nf-core. I know those were done by @chrisAta and movement will require fixes in module.nf but we should be consistent.

@KateSakharova I've separated modules to local and nf-core and corrected paths in commit

Copy link
Member

@chrisAta chrisAta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice thank you, really just one small change after everything Kate has reviewed! Thanks for changing the format of some of the existing stuff (configs/modules)... there's a few more things that I want to change here to make it more like the rest of our nextflow pipelines, but I'll do that in a future PR when I have time

@@ -0,0 +1,32 @@
#!/usr/bin/env python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this header at the beginning of any new scripts:

# -*- coding: utf-8 -*-

# Copyright 2024 EMBL - European Bioinformatics Institute
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Chris, I've added this

}
withLabel: 'medium' {
cpus = 1
memory = { 2.GB }
errorStrategy = { task.exitStatus == 137 ? 'retry' : 'finish' }
errorStrategy = { task.exitStatus in ((130..155) + 104) ? 'retry' : 'finish' }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to check why these were changed, does it include other errors that were missing before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very wide list of errors Kate suggested to include
I needed only to add 140, but Martin supported the idea of using wider error strategy to account for transient errors. The main argument was

the effort required to catch those internally and try fix/prevent them is massive comparatively to just retrying jobs every now and then

and I agreed with this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'll add all of the other modules to this in a future PR

@@ -2,7 +2,7 @@
process CLEAN_FASTA {

label 'light'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a note to change this from the hps path in another PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice thanks for changing those already 🙏

Comment on lines +12 to +14
if [[ "${params.rhea_chebi_download_mapping}" =~ ^https?:// ]]; then
# If it's a URL, download the file
wget ${params.rhea_chebi_download_mapping} -O rhea-reactions.txt.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to check, is this a really big file? because input paths can be URLs in nextflow... but obviously if it's too large to download on a per-job basis then something like this does make sense

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tiny file, unfortunately nextflow fails to download it when I'm providing this url as input path to file() method...

WARN: Unable to stage foreign file: https://ftp.expasy.org/databases/rhea/txt/rhea-reactions.txt.gz (try 1) -- Cause: Unable to access path: https://ftp.expasy.org/databases/rhea/txt/rhea-reactions.txt.gz

I asked Martin and Vangelis about this and they suggested just using wget instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants