Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend annotations #1

Closed
wants to merge 106 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
33341ac
Added pre-processing script for UniRule
tgurbich Nov 16, 2023
123be4b
Added species names to UniRule prep script
tgurbich Nov 17, 2023
4ca71fb
Added GECCO
tgurbich Nov 17, 2023
fc88a23
Added accounting for empty GECCO results
tgurbich Nov 20, 2023
83dbf96
Removed old code
tgurbich Nov 20, 2023
4f153c6
Fixed typo
tgurbich Nov 20, 2023
05ab91d
Edited syntax
tgurbich Nov 20, 2023
43778d7
Edited syntax
tgurbich Nov 20, 2023
121d831
Bug fixes
tgurbich Nov 20, 2023
089762c
Added Defense Finder
tgurbich Nov 20, 2023
d72f870
Bug fixes
tgurbich Nov 20, 2023
37b2cba
Added UniFire
tgurbich Nov 20, 2023
590a513
Bug fix
tgurbich Nov 20, 2023
2371b8c
A few fixes to the new tools
mberacochea Nov 21, 2023
6a92fb6
Fix GECCO invocation and some typos
mberacochea Nov 21, 2023
14ae4ee
Fixed GECCO, UniRULE and MultiQC
mberacochea Nov 22, 2023
af6ff56
Added UniFire memory requirements
tgurbich Nov 22, 2023
07b5f86
Fixed typo
tgurbich Nov 22, 2023
4446eeb
Add tower.yml
mberacochea Nov 22, 2023
ddafcc5
Merge branch 'extend-annotations' of github.com:EBI-Metagenomics/mett…
mberacochea Nov 22, 2023
dd81bce
Patch rRNA detection step with tmp folder fix.
mberacochea Nov 22, 2023
b3e5d4f
Rename tower.yml
mberacochea Nov 22, 2023
8026915
Added UniRule post-processing script - WIP
tgurbich Nov 23, 2023
4bf4054
Combined parsing functions
tgurbich Nov 23, 2023
a147b65
Add antiSMASH 7.1
mberacochea Nov 24, 2023
8da6072
Merge branch 'extend-annotations' of github.com:EBI-Metagenomics/mett…
mberacochea Nov 24, 2023
0c77757
Adjust antiSMASH outdir results structure
mberacochea Nov 27, 2023
ddb8903
Adjust the outdir results structure
mberacochea Nov 27, 2023
38f08f6
Typo on process selector
mberacochea Nov 27, 2023
2fa0916
Added dbcan
tgurbich Nov 29, 2023
9fb11ba
Merge branch 'extend-annotations' of https://github.com/EBI-Metagenom…
tgurbich Nov 29, 2023
f0f0f56
Bug fix
tgurbich Nov 29, 2023
b606043
Bug fix
tgurbich Nov 29, 2023
f92bb13
Input format change
tgurbich Nov 29, 2023
ac6719b
Fixed typos and removed testing edits
tgurbich Nov 30, 2023
7b0aca6
Bug fix and typo fixes
tgurbich Nov 30, 2023
8ce9bc6
Added GO-terms to the annotate_gff script
tgurbich Nov 30, 2023
8f05c94
Removed reporting of empty IPS
tgurbich Nov 30, 2023
7b2df5e
Removed empty values from GFF
tgurbich Nov 30, 2023
d679dbe
Bug fix in annotate_gff
tgurbich Nov 30, 2023
566de3d
Added dbcan post-processing
tgurbich Dec 1, 2023
efba874
dbCAN GFF format edits
tgurbich Dec 1, 2023
b541105
Tweak the results folder structure.
mberacochea Dec 1, 2023
2180c9f
Annotate GFF versions fix
mberacochea Dec 1, 2023
b323010
Format edits for dbcan processing script
tgurbich Dec 1, 2023
725665b
Added unifire post-processing
tgurbich Dec 1, 2023
7ec709b
WIP
tgurbich Dec 1, 2023
ce7a245
WIP
tgurbich Dec 1, 2023
3b5f16a
Added join to arguments
tgurbich Dec 1, 2023
af84985
WIP
tgurbich Dec 1, 2023
18c7333
WIP
tgurbich Dec 1, 2023
f484ef1
Bug fix
tgurbich Dec 5, 2023
6387305
Bug fix
tgurbich Dec 5, 2023
0825e45
GFF format changes
tgurbich Dec 6, 2023
3033c5e
Added defense finder post-processing script
tgurbich Dec 7, 2023
fd053c6
Added DefenseFinder post-processing to nf
tgurbich Dec 7, 2023
1b0c9ca
Added duplicate resolution script - WIP
tgurbich Dec 7, 2023
c3c70ee
Format edit
tgurbich Dec 7, 2023
2b7389a
Changed dbcan output name format
tgurbich Dec 7, 2023
5024ed0
Added AMRFinderPlus processing script
tgurbich Dec 7, 2023
db27d87
Added AMR post-processing to the AMR module
tgurbich Dec 7, 2023
86b6021
Fixed unifire post-processing script
tgurbich Dec 7, 2023
3b11069
Temporarily removed amrfinder post-processing
tgurbich Dec 7, 2023
7e22e4b
Fixed typo
tgurbich Dec 7, 2023
8a4726c
Fixed typo
tgurbich Dec 7, 2023
4037c7d
Removed post-processing
tgurbich Dec 7, 2023
699e42e
Add an antismash to gff script
mberacochea Dec 7, 2023
150b35e
Move PROKKA results to the functional annotation folder.
mberacochea Dec 7, 2023
3af5a9f
Move sanntis to the bgc results folder
mberacochea Dec 7, 2023
0d721f3
Changed antismash GFF name
tgurbich Dec 8, 2023
6f036db
Text edits
tgurbich Dec 8, 2023
9d4bde2
Format edits
tgurbich Dec 8, 2023
18549fe
Merge pull request #2 from EBI-Metagenomics/feature/antismash-gff
tgurbich Dec 8, 2023
c74a36c
Bug fix
tgurbich Dec 8, 2023
12e7f86
Added an extra GFF processing script to the annotate_gff process
mberacochea Dec 8, 2023
e039cf2
Fixed the schema
mberacochea Dec 8, 2023
93a0f72
Fix nf-core pipeline linting issues
mberacochea Dec 8, 2023
d4956be
convert_cds_into_multiple_lines correction
mberacochea Dec 8, 2023
d9c49de
CRISPR fix for correct visualisation
tgurbich Dec 8, 2023
e9d4b9e
Added protein component filter
tgurbich Dec 8, 2023
b94715f
Removed uniprot fields that are not in the existing dictionary
tgurbich Dec 8, 2023
4fc6da8
Add an AMRFinderPlus GFF generation step
mberacochea Dec 8, 2023
a513267
Fixed crispr naming
tgurbich Dec 8, 2023
b891c33
Typo in convert_cds script. Added tag to IPS
mberacochea Dec 8, 2023
9d90096
Typos in amrfinder post processing
mberacochea Dec 8, 2023
836227d
AMRFINDER_PLUS_TO_GFF add required -v param
mberacochea Dec 8, 2023
6aa4bcc
Removed parent field from flank sequence in CRISPRCas results
tgurbich Dec 8, 2023
1b2b761
Tweak DBCan outputfolder
mberacochea Dec 8, 2023
bd873c2
Remove the structure annotation folder.
mberacochea Dec 8, 2023
abd9e22
Merge pull request #3 from EBI-Metagenomics/feature/gff_post_processi…
tgurbich Dec 8, 2023
d5526d3
Mobilome merger script added
Ales-ibt Dec 8, 2023
dad4d02
Finished GECCO addition to annotate_gff script
tgurbich Dec 8, 2023
f371d33
Resolved
tgurbich Dec 8, 2023
bbb596c
Added antismash to annotate_gff
tgurbich Dec 8, 2023
1c18835
Removed extra spaces from dbcan protein fams
tgurbich Dec 8, 2023
7b76e63
Added dbCAN to the annotate_gff script
tgurbich Dec 8, 2023
c1f98b9
Added defense finder to annotate_gff
tgurbich Dec 8, 2023
62c3f8a
Added arguments to annotate_gff
tgurbich Dec 11, 2023
574dd1e
Added combined unifire output
tgurbich Dec 11, 2023
a960896
Format
tgurbich Dec 11, 2023
d634e08
Removed gff line splitting
tgurbich Dec 11, 2023
ae7e2fa
Removed mmseqs from the duplicate resolution script
tgurbich Jan 15, 2024
bfd2174
Fixed integer checked in duplicate gene names
tgurbich Jan 15, 2024
d3bc930
Moved duplicate loading into a separate function
tgurbich Jan 15, 2024
4813b3f
Refactored the duplicate resolution script
tgurbich Jan 16, 2024
c15f62a
WIP
tgurbich Jan 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ trim_trailing_whitespace = true
indent_size = 4
indent_style = space

[*.{md,yml,yaml,html,css,scss,js}]
[*.{md,yml,yaml,html,css,scss,js,json}]
indent_size = 2

# These files are edited and tested upstream in nf-core/modules
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ jobs:

- uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.11"
architecture: "x64"

- name: Install dependencies
Expand Down
13 changes: 13 additions & 0 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,24 @@ report_section_order:
"ebi-metagenomics-mettannotator-summary":
order: -1002

run_modules:
- quast
- prokka
- custom_content

top_modules:
- quast
- prokka

prokka_table: true
prokka_fn_snames: true

sp:
quast_config:
fn: "*.tsv"

export_plots: true

## Prettification
custom_logo_url: https://github.com/ebi-metagenomics/mettannotator
custom_logo_title: "ebi-metagenomics/mettannotator"
290 changes: 229 additions & 61 deletions bin/annotate_gff.py

Large diffs are not rendered by default.

155 changes: 155 additions & 0 deletions bin/antismash_to_gff.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# Copyright 2023 EMBL - European Bioinformatics Institute
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import argparse
import json
from collections import namedtuple

import logging

logger = logging.getLogger(__name__)

BGC = namedtuple("BGC", "contig_name bgc_name start end orfs")
ORF = namedtuple("ORF", "locus_tag start end strand type product")


def load_regions_json(json_file) -> list[BGC]:
"""Load the gene types from the json
The structure of the json is:
{
"length": 4688977,
"seq_id": "contig_4",
"regions": [{
"start": 1185393,
"end": 1205746,
"idx": 1,
"orfs": [
{
"start": 1187624,
"end": 1188664,
"strand": 1,
"locus_tag": "BU_ATCC8492_00951",
"type": "biosynthetic-additional",
"description": "<div class=\"focus-intro\">\n <strong><span class=\"serif\">BU_ATCC8492_00951</span></strong><br>... ",
"dna": "ATGCAGAAACGACCTCT...",
"translation": "MQKRPLLGLT...",
"product": "Dual-specificity RNA methyltransferase RlmN"
},...
],..
}
Note: this json file is build from the regions.js file of the html output of antiSMASH.
To build the json file:
echo ";var fs = require('fs'); fs.writeFileSync('./regions.json', JSON.stringify(recordData));" >> convert_to_json.js
node geneclusters.js # this will generate the geneclusters.json file
"""

bgcs = []

with open(json_file) as json_handle:
json_dict = json.load(json_handle)
for bgc_entry in json_dict:
regions = bgc_entry.get("regions")
if not regions:
# ignore this contig, no bgc found
continue
for region in regions:
index = region["idx"]
region_start = region["start"]
region_end = region["end"]
bgc_orfs = []
for orf in region.get("orfs", []):
bgc_orfs.append(
ORF(
locus_tag=orf["locus_tag"],
start=orf["start"],
end=orf["end"],
strand=int(orf["strand"]),
product=orf["product"],
type=orf["type"],
)
)
bgcs.append(
BGC(
contig_name=bgc_entry["seq_id"],
bgc_name=f"{bgc_entry['seq_id']}_bgc{index}",
start=region_start,
end=region_end,
orfs=bgc_orfs,
)
)

return bgcs


def build_gff(regions_json, antismash_version):
"""Build the GFF from the geneclusters and the EMBL file"""
bgc_regions: list[BGC] = load_regions_json(regions_json)
for bgc in bgc_regions:
# BGC region "parent"
yield [
bgc.contig_name,
f"antiSMASH:{antismash_version}",
"biosynthetic-gene-cluster",
bgc.start,
bgc.end,
".",
".",
".",
f"ID={bgc.bgc_name}",
]
orf: ORF
for orf in bgc.orfs:
yield [
bgc.contig_name,
f"antiSMASH:{antismash_version}",
"CDS",
orf.start, # FIXME, shoud we correct offset (gff are +1)?
orf.end,
".", # TODO, it should be possible to get the confidence score from the antismash gbk result file
"+" if orf.strand == 1 else "-",
".",
";".join(
[
f"ID={orf.locus_tag}",
f"Parent={bgc.bgc_name}",
f"product={orf.product}",
f"function={orf.type}"
]
),
]


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Build an antiSMASH gff from the gbk and regionsjs json~fied file")
parser.add_argument(
"-r",
dest="regions",
help="antiSMASH json-fied regions.js file (it should be regions.json not .js)",
required=True,
)
parser.add_argument(
"-a",
dest="antismash_version",
help="The version of antiSMASH",
required=True,
)
parser.add_argument("-o", dest="out", help="Ouput GFF file name", required=True)
args = parser.parse_args()

with open(args.out, "w") as out_handle:
print("##gff-version 3", file=out_handle)
for row in build_gff(args.regions, args.antismash_version):
print("\t".join(map(str, row)), file=out_handle)
Loading
Loading