Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend annotations #1

Closed
wants to merge 106 commits into from
Closed
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
33341ac
Added pre-processing script for UniRule
tgurbich Nov 16, 2023
123be4b
Added species names to UniRule prep script
tgurbich Nov 17, 2023
4ca71fb
Added GECCO
tgurbich Nov 17, 2023
fc88a23
Added accounting for empty GECCO results
tgurbich Nov 20, 2023
83dbf96
Removed old code
tgurbich Nov 20, 2023
4f153c6
Fixed typo
tgurbich Nov 20, 2023
05ab91d
Edited syntax
tgurbich Nov 20, 2023
43778d7
Edited syntax
tgurbich Nov 20, 2023
121d831
Bug fixes
tgurbich Nov 20, 2023
089762c
Added Defense Finder
tgurbich Nov 20, 2023
d72f870
Bug fixes
tgurbich Nov 20, 2023
37b2cba
Added UniFire
tgurbich Nov 20, 2023
590a513
Bug fix
tgurbich Nov 20, 2023
2371b8c
A few fixes to the new tools
mberacochea Nov 21, 2023
6a92fb6
Fix GECCO invocation and some typos
mberacochea Nov 21, 2023
14ae4ee
Fixed GECCO, UniRULE and MultiQC
mberacochea Nov 22, 2023
af6ff56
Added UniFire memory requirements
tgurbich Nov 22, 2023
07b5f86
Fixed typo
tgurbich Nov 22, 2023
4446eeb
Add tower.yml
mberacochea Nov 22, 2023
ddafcc5
Merge branch 'extend-annotations' of github.com:EBI-Metagenomics/mett…
mberacochea Nov 22, 2023
dd81bce
Patch rRNA detection step with tmp folder fix.
mberacochea Nov 22, 2023
b3e5d4f
Rename tower.yml
mberacochea Nov 22, 2023
8026915
Added UniRule post-processing script - WIP
tgurbich Nov 23, 2023
4bf4054
Combined parsing functions
tgurbich Nov 23, 2023
a147b65
Add antiSMASH 7.1
mberacochea Nov 24, 2023
8da6072
Merge branch 'extend-annotations' of github.com:EBI-Metagenomics/mett…
mberacochea Nov 24, 2023
0c77757
Adjust antiSMASH outdir results structure
mberacochea Nov 27, 2023
ddb8903
Adjust the outdir results structure
mberacochea Nov 27, 2023
38f08f6
Typo on process selector
mberacochea Nov 27, 2023
2fa0916
Added dbcan
tgurbich Nov 29, 2023
9fb11ba
Merge branch 'extend-annotations' of https://github.com/EBI-Metagenom…
tgurbich Nov 29, 2023
f0f0f56
Bug fix
tgurbich Nov 29, 2023
b606043
Bug fix
tgurbich Nov 29, 2023
f92bb13
Input format change
tgurbich Nov 29, 2023
ac6719b
Fixed typos and removed testing edits
tgurbich Nov 30, 2023
7b0aca6
Bug fix and typo fixes
tgurbich Nov 30, 2023
8ce9bc6
Added GO-terms to the annotate_gff script
tgurbich Nov 30, 2023
8f05c94
Removed reporting of empty IPS
tgurbich Nov 30, 2023
7b2df5e
Removed empty values from GFF
tgurbich Nov 30, 2023
d679dbe
Bug fix in annotate_gff
tgurbich Nov 30, 2023
566de3d
Added dbcan post-processing
tgurbich Dec 1, 2023
efba874
dbCAN GFF format edits
tgurbich Dec 1, 2023
b541105
Tweak the results folder structure.
mberacochea Dec 1, 2023
2180c9f
Annotate GFF versions fix
mberacochea Dec 1, 2023
b323010
Format edits for dbcan processing script
tgurbich Dec 1, 2023
725665b
Added unifire post-processing
tgurbich Dec 1, 2023
7ec709b
WIP
tgurbich Dec 1, 2023
ce7a245
WIP
tgurbich Dec 1, 2023
3b5f16a
Added join to arguments
tgurbich Dec 1, 2023
af84985
WIP
tgurbich Dec 1, 2023
18c7333
WIP
tgurbich Dec 1, 2023
f484ef1
Bug fix
tgurbich Dec 5, 2023
6387305
Bug fix
tgurbich Dec 5, 2023
0825e45
GFF format changes
tgurbich Dec 6, 2023
3033c5e
Added defense finder post-processing script
tgurbich Dec 7, 2023
fd053c6
Added DefenseFinder post-processing to nf
tgurbich Dec 7, 2023
1b0c9ca
Added duplicate resolution script - WIP
tgurbich Dec 7, 2023
c3c70ee
Format edit
tgurbich Dec 7, 2023
2b7389a
Changed dbcan output name format
tgurbich Dec 7, 2023
5024ed0
Added AMRFinderPlus processing script
tgurbich Dec 7, 2023
db27d87
Added AMR post-processing to the AMR module
tgurbich Dec 7, 2023
86b6021
Fixed unifire post-processing script
tgurbich Dec 7, 2023
3b11069
Temporarily removed amrfinder post-processing
tgurbich Dec 7, 2023
7e22e4b
Fixed typo
tgurbich Dec 7, 2023
8a4726c
Fixed typo
tgurbich Dec 7, 2023
4037c7d
Removed post-processing
tgurbich Dec 7, 2023
699e42e
Add an antismash to gff script
mberacochea Dec 7, 2023
150b35e
Move PROKKA results to the functional annotation folder.
mberacochea Dec 7, 2023
3af5a9f
Move sanntis to the bgc results folder
mberacochea Dec 7, 2023
0d721f3
Changed antismash GFF name
tgurbich Dec 8, 2023
6f036db
Text edits
tgurbich Dec 8, 2023
9d4bde2
Format edits
tgurbich Dec 8, 2023
18549fe
Merge pull request #2 from EBI-Metagenomics/feature/antismash-gff
tgurbich Dec 8, 2023
c74a36c
Bug fix
tgurbich Dec 8, 2023
12e7f86
Added an extra GFF processing script to the annotate_gff process
mberacochea Dec 8, 2023
e039cf2
Fixed the schema
mberacochea Dec 8, 2023
93a0f72
Fix nf-core pipeline linting issues
mberacochea Dec 8, 2023
d4956be
convert_cds_into_multiple_lines correction
mberacochea Dec 8, 2023
d9c49de
CRISPR fix for correct visualisation
tgurbich Dec 8, 2023
e9d4b9e
Added protein component filter
tgurbich Dec 8, 2023
b94715f
Removed uniprot fields that are not in the existing dictionary
tgurbich Dec 8, 2023
4fc6da8
Add an AMRFinderPlus GFF generation step
mberacochea Dec 8, 2023
a513267
Fixed crispr naming
tgurbich Dec 8, 2023
b891c33
Typo in convert_cds script. Added tag to IPS
mberacochea Dec 8, 2023
9d90096
Typos in amrfinder post processing
mberacochea Dec 8, 2023
836227d
AMRFINDER_PLUS_TO_GFF add required -v param
mberacochea Dec 8, 2023
6aa4bcc
Removed parent field from flank sequence in CRISPRCas results
tgurbich Dec 8, 2023
1b2b761
Tweak DBCan outputfolder
mberacochea Dec 8, 2023
bd873c2
Remove the structure annotation folder.
mberacochea Dec 8, 2023
abd9e22
Merge pull request #3 from EBI-Metagenomics/feature/gff_post_processi…
tgurbich Dec 8, 2023
d5526d3
Mobilome merger script added
Ales-ibt Dec 8, 2023
dad4d02
Finished GECCO addition to annotate_gff script
tgurbich Dec 8, 2023
f371d33
Resolved
tgurbich Dec 8, 2023
bbb596c
Added antismash to annotate_gff
tgurbich Dec 8, 2023
1c18835
Removed extra spaces from dbcan protein fams
tgurbich Dec 8, 2023
7b76e63
Added dbCAN to the annotate_gff script
tgurbich Dec 8, 2023
c1f98b9
Added defense finder to annotate_gff
tgurbich Dec 8, 2023
62c3f8a
Added arguments to annotate_gff
tgurbich Dec 11, 2023
574dd1e
Added combined unifire output
tgurbich Dec 11, 2023
a960896
Format
tgurbich Dec 11, 2023
d634e08
Removed gff line splitting
tgurbich Dec 11, 2023
ae7e2fa
Removed mmseqs from the duplicate resolution script
tgurbich Jan 15, 2024
bfd2174
Fixed integer checked in duplicate gene names
tgurbich Jan 15, 2024
d3bc930
Moved duplicate loading into a separate function
tgurbich Jan 15, 2024
4813b3f
Refactored the duplicate resolution script
tgurbich Jan 16, 2024
c15f62a
WIP
tgurbich Jan 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,24 @@ report_section_order:
"ebi-metagenomics-mettannotator-summary":
order: -1002

run_modules:
- quast
- prokka
- custom_content

top_modules:
- quast
- prokka

prokka_table: true
prokka_fn_snames: true

sp:
quast_config:
fn: "*.tsv"

export_plots: true

## Prettification
custom_logo_url: https://github.com/ebi-metagenomics/mettannotator
custom_logo_title: "ebi-metagenomics/mettannotator"
33 changes: 20 additions & 13 deletions bin/annotate_gff.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ def get_iprs(ipr_annot):
iprs[protein][0].add(pfam)
if len(cols) > 12:
ipr = cols[11]
iprs[protein][1].add(ipr)
if not ipr == "-":
iprs[protein][1].add(ipr)
return iprs


Expand All @@ -49,22 +50,27 @@ def get_eggnog(eggnog_annot):
eggnog = [cols[1]]
try:
cog = cols[eggnog_fields["cog_func"]]
cog = cog.split()
cog = list(cog)
if len(cog) > 1:
cog = ["R"]
except Exception:
cog = ["NA"]
kegg = cols[eggnog_fields["KEGG_ko"]].split(",")
eggnogs[protein] = [eggnog, cog, kegg]
go = cols[eggnog_fields["GOs"]]
eggnogs[protein] = [eggnog, cog, kegg, go]
return eggnogs


def get_eggnog_fields(line):
cols = line.strip().split("\t")
try:
index_of_go = cols.index("GOs")
except ValueError:
sys.exit("Cannot find the GO terms column.")
if cols[8] == "KEGG_ko" and cols[15] == "CAZy":
eggnog_fields = {"KEGG_ko": 8, "cog_func": 20}
eggnog_fields = {"KEGG_ko": 8, "cog_func": 20, "GOs": index_of_go}
elif cols[11] == "KEGG_ko" and cols[18] == "CAZy":
eggnog_fields = {"KEGG_ko": 11, "cog_func": 6}
eggnog_fields = {"KEGG_ko": 11, "cog_func": 6, "GOs": index_of_go}
else:
sys.exit("Cannot parse eggNOG - unexpected field order or naming")
return eggnog_fields
Expand Down Expand Up @@ -195,9 +201,7 @@ def get_amr(amr_file):


def add_gff(in_gff, eggnog_file, ipr_file, sanntis_file, amr_file):
eggnogs = {}
if eggnog_file:
eggnogs = get_eggnog(eggnog_file)
eggnogs = get_eggnog(eggnog_file)
iprs = get_iprs(ipr_file)
sanntis_bgcs = get_sanntis(sanntis_file, in_gff)
amr_annotations = {}
Expand Down Expand Up @@ -226,6 +230,8 @@ def add_gff(in_gff, eggnog_file, ipr_file, sanntis_file, amr_file):
added_annot[protein]["COG"] = a
elif pos == 3:
added_annot[protein]["KEGG"] = a
elif pos == 4:
added_annot[protein]["ontology_term"] = a
except Exception:
pass
try:
Expand Down Expand Up @@ -259,7 +265,8 @@ def add_gff(in_gff, eggnog_file, ipr_file, sanntis_file, amr_file):
if a == "AMR":
cols[8] = "{};{}".format(cols[8], value)
else:
cols[8] = "{};{}={}".format(cols[8], a, value)
if not value == "-":
cols[8] = "{};{}={}".format(cols[8], a, value)
line = "\t".join(cols)
out_gff.append(line)
return out_gff
Expand Down Expand Up @@ -380,13 +387,13 @@ def add_ncrnas_and_crispr_to_gff(gff_outfile, ncrnas, crispr_annotations, res):
parser.add_argument(
"-i",
dest="ips",
help="InterproScan annontations results for the cluster rep",
help="InterproScan annotations results for the cluster rep",
required=True,
)
parser.add_argument(
"-e",
dest="eggnong",
help="eggnog annontations for the clutser repo",
dest="eggnog",
help="eggnog annotations for the cluster repo",
required=False,
)
parser.add_argument(
Expand Down Expand Up @@ -416,7 +423,7 @@ def add_ncrnas_and_crispr_to_gff(gff_outfile, ncrnas, crispr_annotations, res):

extended_gff = add_gff(
in_gff=gff,
eggnog_file=args.eggnong,
eggnog_file=args.eggnog,
ipr_file=args.ips,
sanntis_file=args.sanntis,
amr_file=args.amr,
Expand Down
99 changes: 99 additions & 0 deletions bin/prepare_unirule_input.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
#!/usr/bin/env python3

import argparse
import logging
import os
import sys

logging.basicConfig(level=logging.INFO)


def main(infile, outdir):
taxid = assign_taxid(infile)
check_dir(outdir)
outfile = "proteins.fasta"
outpath = os.path.join(outdir, outfile)
with open(outpath, "w") as file_out, open(infile, "r") as file_in:
for line in file_in:
if line.startswith(">"):
formatted_line = reformat_line(line, taxid)
file_out.write(formatted_line)
else:
file_out.write(line)


def check_dir(directory_path):
if not os.path.exists(directory_path):
try:
os.makedirs(directory_path)
except OSError as e:
logging.error(f"Error: Failed to create directory '{directory_path}'. {e}")


def reformat_line(line, taxid):
line = line.lstrip('>').strip()
id, description = line.split(maxsplit=1)
if taxid == "820":
sp_name = "Bacteroides uniformis"
elif taxid == "821":
sp_name = "Phocaeicola vulgatus"
elif taxid == "46503":
sp_name = "Parabacteroides merdae"
else:
raise ValueError("Unknown species")
formatted_line = ">tr|{id}|{description} OS={sp_name} OX={taxid}\n".format(id=id, description=description,
sp_name=sp_name, taxid=taxid)
return formatted_line


def assign_taxid(infile):
try:
with open(infile, 'r') as file:
# Read the first line
first_line = file.readline().strip()
species_code = first_line[1:3]

# Assign taxid based on species code
if species_code == "BU":
taxid = "820"
elif species_code == "PV":
taxid = "821"
elif species_code == "PM":
taxid = "46503"
else:
raise ValueError("Unknown species")
return taxid
except Exception as e:
logging.error(f"Error: {e}")
exit(1)


def parse_args():
parser = argparse.ArgumentParser(
description=(
"The script reformats the fasta faa file to prepare it for UniRule."
)
)
parser.add_argument(
"-i",
dest="infile",
required=True,
help="Input protein fasta file.",
)
parser.add_argument(
"-o",
dest="outdir",
required=True,
help=(
"Path to the folder where the output will be saved to."
),
)
return parser.parse_args()


if __name__ == "__main__":
args = parse_args()
main(
args.infile,
args.outdir,
)
124 changes: 124 additions & 0 deletions bin/process_dbcan_result.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
#!/usr/bin/env python3

import argparse
import logging
import os
import sys

logging.basicConfig(level=logging.INFO)


def main(input_folder, outfile, dbcan_version):
if not check_folder_completeness(input_folder):
sys.exit("Missing dbCAN outputs. Exiting.")
substrates = load_substrates(input_folder)
cgc_locations = load_cgcs(input_folder)
print_gff(input_folder, outfile, dbcan_version, substrates, cgc_locations)


def load_cgcs(input_folder):
cgc_locations = dict()
with open(os.path.join(input_folder, "cgc_standard.out")) as file_in:
for line in file_in:
if not line.startswith("CGC#"):
cgc, _, contig, _, start, end, _, _ = line.strip().split("\t")
if cgc in cgc_locations:
if cgc_locations[cgc]["start"] > int(start):
cgc_locations[cgc]["start"] = int(start)
if cgc_locations[cgc]["end"] < int(end):
cgc_locations[cgc]["end"] = int(end)
else:
cgc_locations[cgc] = {"start": int(start),
"end": int(end),
"contig": contig}
return cgc_locations


def print_gff(input_folder, outfile, dbcan_version, substrates, cgc_locations):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method can be more pythonic:
`import csv

def print_gff(input_folder, outfile, dbcan_version, substrates, cgc_locations):

with open(outfile, "w") as f_out:
    writer = csv.writer(f_out, delimiter="\t")
    
    writer.writerow(["##gff-version 3"])
    
    printed_cgcs = []
    
    with open(os.path.join(input_folder, "cgc_standard.out")) as f_in:
        reader = csv.reader(f_in, delimiter="\t")
        for line in reader:
            if line[0].startswith("CGC#"):
                continue
                
            cgc, gene_type, contig, prot_id, start, end, strand, protein_fam = line
            
            if cgc not in printed_cgcs:
                substrate = substrates.get(cgc, "substrate_dbcan_pul=N/A;substrate_ecami=N/A")
                
                writer.writerow([contig, f"dbCAN:{dbcan_version}", "predicted PUL", 
                                cgc_locations[cgc]['start'], cgc_locations[cgc]['end'], 
                                ".", ".", ".", f"ID={cgc};{substrate}"])
                printed_cgcs.append(cgc)
                
            writer.writerow([contig, f"dbCAN:{dbcan_version}", gene_type, start, end, 
                             ".", strand, ".", f"ID={prot_id};Parent={cgc},protein_family={protein_fam}"])`

this is not tested

with open(outfile, "w") as file_out:
file_out.write("##gff-version 3\n")
cgcs_printed = list()
with open(os.path.join(input_folder, "cgc_standard.out")) as file_in:
for line in file_in:
if not line.startswith("CGC#"):
cgc, gene_type, contig, prot_id, start, end, strand, protein_fam = line.strip().split("\t")
if not cgc in cgcs_printed:
substrate = substrates[cgc] if cgc in substrates else "substrate_dbcan_pul=N/A;substrate_ecami=N/A"
file_out.write("{}\tdbCAN:{}\tpredicted PUL\t{}\t{}\t.\t.\t.\tID={};{}\n".format(
contig, dbcan_version, cgc_locations[cgc]["start"], cgc_locations[cgc]["end"], cgc,
substrate))
cgcs_printed.append(cgc)
file_out.write("{}\tdbCAN:{}\t{}\t{}\t{}\t.\t{}\t.\tID={};Parent={},protein_family={}\n".format(
contig, dbcan_version, gene_type, start, end, strand, prot_id, cgc, protein_fam))


def load_substrates(input_folder):
substrates = dict()
with open(os.path.join(input_folder, "sub.prediction.out"), "r") as file_in:
for line in file_in:
if not line.startswith("#"):
parts = line.strip().split("\t")
cgc = parts[0].split("|")[1]
try:
substrate_pul = parts[2]
except IndexError:
substrate_pul = "N/A"
try:
substrate_ecami = parts[5]
except IndexError:
substrate_ecami = "N/A"
if not substrate_pul:
substrate_pul = "N/A"
if not substrate_ecami:
substrate_ecami = "N/A"
substrates[cgc] = "substrate_dbcan_pul={};substrate_ecami={}".format(substrate_pul, substrate_ecami)
return substrates


def check_folder_completeness(input_folder):
status = True
for file in ["cgc_standard.out", "overview.txt", "sub.prediction.out"]:
if not os.path.exists(os.path.join(input_folder, file)):
logging.error("File {} does not exist.".format(file))
status = False
return status


def parse_args():
parser = argparse.ArgumentParser(
description=(
"The script takes dbCAN output and parses it to create a standalone GFF."
)
)
parser.add_argument(
"-i",
dest="input_folder",
required=True,
help="Path to the folder with dbCAN results.",
)
parser.add_argument(
"-o",
dest="outfile",
required=True,
help=(
"Path to the output file."
),
)
parser.add_argument(
"-v",
dest="dbcan_ver",
required=True,
help=(
"dbCAN version used."
),
)
return parser.parse_args()


if __name__ == "__main__":
args = parse_args()
main(
args.input_folder,
args.outfile,
args.dbcan_ver
)
Loading
Loading