DIAMOND DB Generation for UniRef90 #1

ochkalova · 2024-12-03T11:42:32Z

Hello,

I'm creating this draft pull request to gather feedback and clarify next steps to complete fitting my sub-workflow into the larger pipeline for generation of various databases.

Functionality:

Downloads UniRef90 and generates two databases:

A DIAMOND DB containing proteins from UniRef90 that have Rhea annotations + a mapping file of Rhea IDs to reaction definitions.
A DIAMOND DB containing proteins from UniRef90 that are non-viral, used for taxonomic assignments.

Required manual input:

Rhea DB version (for mapping Rhea to reaction definitions)
UniRef90 version
UniprotKB access date (for mapping protein IDs to Rhea IDs)
File with snapshot of UnirefKB that maps protein IDs to Rhea IDs

Output structure:

<output_dir_name>
├── UNIREF90_RHEA
│   ├── rhea_chebi_mapping_135.tsv
│   └── uniref90_rhea_2024_05_2024-07-31.dmnd
└── UNIREF90_TAXA
    └── uniref90_taxa_2024_05.dmnd

135 is the Rhea DB version.
2024_05 is the UniRef90 version.
2024-07-31 is the UniprotKB access date.

Changes made:

Added three Python scripts to the bin/ directory.
Added four new modules (uniref90_non_viral_filter, uniref90_rhea_filter, reformat_rhea_chebi, diamond/makedb) including diamond/makedb from nf-core.
Added tests and test data.
Modified workflows/pipeline.nf to include the new sub-workflow, along with flags to control generation of amplicon DBs or UniRef90 DBs.
Modified nextflow.config.
Created a configs/modules.config file to set the publishDir directives for the modules.
Updated the retry logic in slurm.config to include error code 140 (out of memory error). Also added new label "process_single" because in most my modules I need 1 CPU and at least 6 GB of memory.
Minor adjustments to all Chris's modules due to issues raised by the Nextflow plugin 😅. I've only added script directive.
Created configs/test_local.config to allow test runs on my local machine.

My Questions:

Are the names of modules, workflows, and output files clear and intuitive, or could they be improved?
Are we okay with using Seqera containers? I created a container (community.wave.seqera.io/library/biopython_pip_taxoniq:61a7ad516ddf4b95) that includes Biopython and Taxoniq.
If I'm using a module from nf-core, do I need to separate all modules into local and nf-core, or can they be mixed?
The config files are starting to feel a bit disorganized. Any suggestions for better structuring?
How should we handle DB generation control in the pipeline? Chris suggested commenting out unwanted sub-workflows in workflows/pipeline.nf, but Kate was against it. I’ve added flags (--generate_amplicon_db and --generate_uniref90_db) for now.
Does my approach to version control make sense, or is there a more efficient way to manage it?

ochkalova · 2024-12-03T13:56:29Z

Also, input setting could be simplified as Rhea DB version and UniRef90 version can be found automatically using one liners:
curl -s https://ftp.expasy.org/databases/rhea/rhea-release.properties | grep "rhea.release.number" | awk -F= '{print $2}'

curl -s https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.release_note | grep -oP 'Release: \K[0-9_]+'

But I can't find a clean solution to use this, because I need a groovy str variable to use in diamond/makedb meta input. And if I use these one liners in a module the output is a channel of values.

KateSakharova

Looks good @ochkalova. Some minor comments.
I also think we should create local and nf-core folders under modules. Move all your modules and tax_db generaiton there, put diamond into nf-core. I know those were done by @chrisAta and movement will require fixes in module.nf but we should be consistent.

bin/uniref90_non_viral_filter.py

configs/modules.config

KateSakharova · 2024-12-03T13:53:28Z

configs/slurm.config

    withLabel: 'light' {
        cpus = 1
        memory = { 3.GB * task.attempt }
-        errorStrategy = { task.exitStatus == 137 ? 'retry' : 'finish' }
+        errorStrategy = { task.exitStatus == 137 || task.exitStatus == 140 ? 'retry' : 'finish' }


maybe you can add even more wide errorStrategy like https://github.com/EBI-Metagenomics/miassembler/blob/main/conf/base.config#L17

Ok, Martin supported this suggestion so I've made changes to configs in use wider error strategy

modules/uniref90_rhea_filter/main.nf

KateSakharova · 2024-12-03T14:04:17Z

nextflow.config

 }

+includeConfig 'configs/modules.config'
+
 params {

    outdir = "/hps/nobackup/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_v6"
    dummy_fasta = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/dummy.fasta"
    empty_file = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/EMPTY.txt"


Those 2 files and outdir should be moved to and output dir should be moved somewhere from Chris's folder. We should assign a general location for that (and I know those are not your fixes, just noticed because nobody reviewed tax_db generation)
@chrisAta makes sense to review it I think

ah yeah this was written before I started using nf-core guidelines... I will fix all of this after this PR gets merged

KateSakharova · 2024-12-03T14:04:52Z

nextflow.config

@@ -57,6 +66,13 @@ params {
    itsonedb_download_taxdump = "https://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip"
    itsonedb_tax_header = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/itsonedb_tax_header.txt"


and this one also refers to @chrisAta

subworkflows/uniref90_generation/main.nf

modules/reformat_rhea_chebi/main.nf

KateSakharova · 2024-12-03T14:18:21Z

@ochkalova and one more comment. Please, add a descriptive note/comment to modules that require download of data from API/FTP. It should be noted sowhere why we must/can't refer to latest release and set arguments of dbs in params. We can easily forget the structure of those dbs.

ochkalova · 2024-12-04T13:02:51Z

Looks good @ochkalova. Some minor comments. I also think we should create local and nf-core folders under modules. Move all your modules and tax_db generaiton there, put diamond into nf-core. I know those were done by @chrisAta and movement will require fixes in module.nf but we should be consistent.

@KateSakharova I've separated modules to local and nf-core and corrected paths in commit

chrisAta

Very nice thank you, really just one small change after everything Kate has reviewed! Thanks for changing the format of some of the existing stuff (configs/modules)... there's a few more things that I want to change here to make it more like the rest of our nextflow pipelines, but I'll do that in a future PR when I have time

chrisAta · 2024-12-10T11:58:09Z

bin/reformat_rhea_chebi_mapping.py

@@ -0,0 +1,32 @@
+#!/usr/bin/env python


Can you add this header at the beginning of any new scripts:

# -*- coding: utf-8 -*- # Copyright 2024 EMBL - European Bioinformatics Institute # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

Thank you Chris, I've added this

chrisAta · 2024-12-10T12:56:34Z

configs/local.config

    }
    withLabel: 'medium' {
        cpus = 1
        memory = { 2.GB }
-        errorStrategy = { task.exitStatus == 137 ? 'retry' : 'finish' }
+        errorStrategy = { task.exitStatus in ((130..155) + 104) ? 'retry' : 'finish' }


Just wanted to check why these were changed, does it include other errors that were missing before?

This is a very wide list of errors Kate suggested to include
I needed only to add 140, but Martin supported the idea of using wider error strategy to account for transient errors. The main argument was

the effort required to catch those internally and try fix/prevent them is massive comparatively to just retrying jobs every now and then

and I agreed with this

chrisAta · 2024-12-10T12:57:04Z

configs/modules.config

Nice, I'll add all of the other modules to this in a future PR

chrisAta · 2024-12-10T12:58:09Z

modules/local/clean_fasta/main.nf

@@ -2,7 +2,7 @@
 process CLEAN_FASTA {

    label 'light'


I'll make a note to change this from the hps path in another PR

chrisAta · 2024-12-10T12:58:48Z

modules/local/clean_fasta/tests/main.nf.test

nice thanks for changing those already 🙏

chrisAta · 2024-12-10T14:17:39Z

modules/local/reformat_rhea_chebi/main.nf

+    if [[ "${params.rhea_chebi_download_mapping}" =~ ^https?:// ]]; then
+        # If it's a URL, download the file
+        wget ${params.rhea_chebi_download_mapping} -O rhea-reactions.txt.gz


just want to check, is this a really big file? because input paths can be URLs in nextflow... but obviously if it's too large to download on a per-job basis then something like this does make sense

It's a tiny file, unfortunately nextflow fails to download it when I'm providing this url as input path to file() method...

WARN: Unable to stage foreign file: https://ftp.expasy.org/databases/rhea/txt/rhea-reactions.txt.gz (try 1) -- Cause: Unable to access path: https://ftp.expasy.org/databases/rhea/txt/rhea-reactions.txt.gz

I asked Martin and Vangelis about this and they suggested just using wget instead

ochkalova added 30 commits November 4, 2024 09:56

add script to filter viral proteins from Uniref90

ab21f46

add script to filter proteins without Rheas from Uniref90

12dabc6

add draft modules and subworkflows

89fbe9d

add variables to config

0c04257

add script to reformat rhea-reactions.txt file

86a5b33

correct script

d42f2ed

correct script

ad0d24d

correct script

72fed20

correct script

b490bac

update draft modules

ab1ad54

raname and update scripts

ae1101f

update draft subworkflows

310a8b1

update modules

06160ce

update the workflow

6af26c8

correct typo

345f438

chmod +x python scripts

89499a8

add test for uniref90_rhea_filter

4af026d

add test for uniref90_non_viral_filter

cd648ea

add config to run tests locally

9f2a116

modify to correctly capture ENZYME field

9fa42ae

add test

e66e5c4

fix inconsistent naming

f17f303

enable docker for locl tests

dd90672

add test configs

cebbc3a

add subworkflow test

2baeecd

modify subworkflow

97dc697

add snapshot ids

81916a5

set workDir in test_local config

4421517

fix error

04701b8

update snapshot

bfe2303

ochkalova added 8 commits November 14, 2024 12:27

crete modules.config

3482e63

add version specification in file names

6e041f3

add complete nf-core diamond module

c1a860e

update tests and snapshots

5442e52

change label from medium to light

b3d9ae9

remove CheckIfExists

d83d11b

use wget instead of file() for remote file

ffc3ea6

add error code to retry logic

0b999a5

ochkalova self-assigned this Dec 3, 2024

add new label "process_single"

0428374

ochkalova requested review from chrisAta, mberacochea and KateSakharova December 3, 2024 12:17

KateSakharova requested changes Dec 3, 2024

View reviewed changes

ochkalova added 7 commits December 4, 2024 11:26

update test profile

fff25a6

use """script""" instead of "line"

088b810

create parameter for magic number

ffe336b

fix module to be able run tests

688281b

remove unnecessary input arg

013addc

add publish_dir_mode to params

7eeab87

separate modules to local and nf-core and fix paths

9d611df

ochkalova added 4 commits December 5, 2024 11:28

add retry to process_single

9a78d7d

remove check_max method

ee7a8bc

remove email specification for Entrez

0e37ab5

use wider error strategy

2568e2e

chrisAta approved these changes Dec 10, 2024

View reviewed changes

add copyright info

27d6e89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIAMOND DB Generation for UniRef90 #1

DIAMOND DB Generation for UniRef90 #1

ochkalova commented Dec 3, 2024 •

edited

Loading

ochkalova commented Dec 3, 2024 •

edited

Loading

KateSakharova left a comment •

edited

Loading

KateSakharova Dec 3, 2024

ochkalova Dec 5, 2024

KateSakharova Dec 3, 2024

chrisAta Dec 10, 2024

KateSakharova Dec 3, 2024

KateSakharova commented Dec 3, 2024

ochkalova commented Dec 4, 2024

chrisAta left a comment

chrisAta Dec 10, 2024

ochkalova Dec 10, 2024

chrisAta Dec 10, 2024

ochkalova Dec 10, 2024

chrisAta Dec 10, 2024

chrisAta Dec 10, 2024

chrisAta Dec 10, 2024

chrisAta Dec 10, 2024

ochkalova Dec 10, 2024

		@@ -57,6 +66,13 @@ params {
		itsonedb_download_taxdump = "https://ftp.ebi.ac.uk/pub/databases/ena/taxonomy/sdwca.zip"
		itsonedb_tax_header = "/hps/software/users/rdf/metagenomics/service-team/users/chrisata/taxdb_generation_nf/assets/itsonedb_tax_header.txt"

DIAMOND DB Generation for UniRef90 #1

Are you sure you want to change the base?

DIAMOND DB Generation for UniRef90 #1

Conversation

ochkalova commented Dec 3, 2024 • edited Loading

Functionality:

Required manual input:

Output structure:

Changes made:

My Questions:

ochkalova commented Dec 3, 2024 • edited Loading

KateSakharova left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KateSakharova commented Dec 3, 2024

ochkalova commented Dec 4, 2024

chrisAta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ochkalova commented Dec 3, 2024 •

edited

Loading

ochkalova commented Dec 3, 2024 •

edited

Loading

KateSakharova left a comment •

edited

Loading