-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend annotations #1
Closed
+2,902
−389
Closed
Changes from 54 commits
Commits
Show all changes
106 commits
Select commit
Hold shift + click to select a range
33341ac
Added pre-processing script for UniRule
tgurbich 123be4b
Added species names to UniRule prep script
tgurbich 4ca71fb
Added GECCO
tgurbich fc88a23
Added accounting for empty GECCO results
tgurbich 83dbf96
Removed old code
tgurbich 4f153c6
Fixed typo
tgurbich 05ab91d
Edited syntax
tgurbich 43778d7
Edited syntax
tgurbich 121d831
Bug fixes
tgurbich 089762c
Added Defense Finder
tgurbich d72f870
Bug fixes
tgurbich 37b2cba
Added UniFire
tgurbich 590a513
Bug fix
tgurbich 2371b8c
A few fixes to the new tools
mberacochea 6a92fb6
Fix GECCO invocation and some typos
mberacochea 14ae4ee
Fixed GECCO, UniRULE and MultiQC
mberacochea af6ff56
Added UniFire memory requirements
tgurbich 07b5f86
Fixed typo
tgurbich 4446eeb
Add tower.yml
mberacochea ddafcc5
Merge branch 'extend-annotations' of github.com:EBI-Metagenomics/mett…
mberacochea dd81bce
Patch rRNA detection step with tmp folder fix.
mberacochea b3e5d4f
Rename tower.yml
mberacochea 8026915
Added UniRule post-processing script - WIP
tgurbich 4bf4054
Combined parsing functions
tgurbich a147b65
Add antiSMASH 7.1
mberacochea 8da6072
Merge branch 'extend-annotations' of github.com:EBI-Metagenomics/mett…
mberacochea 0c77757
Adjust antiSMASH outdir results structure
mberacochea ddb8903
Adjust the outdir results structure
mberacochea 38f08f6
Typo on process selector
mberacochea 2fa0916
Added dbcan
tgurbich 9fb11ba
Merge branch 'extend-annotations' of https://github.com/EBI-Metagenom…
tgurbich f0f0f56
Bug fix
tgurbich b606043
Bug fix
tgurbich f92bb13
Input format change
tgurbich ac6719b
Fixed typos and removed testing edits
tgurbich 7b0aca6
Bug fix and typo fixes
tgurbich 8ce9bc6
Added GO-terms to the annotate_gff script
tgurbich 8f05c94
Removed reporting of empty IPS
tgurbich 7b2df5e
Removed empty values from GFF
tgurbich d679dbe
Bug fix in annotate_gff
tgurbich 566de3d
Added dbcan post-processing
tgurbich efba874
dbCAN GFF format edits
tgurbich b541105
Tweak the results folder structure.
mberacochea 2180c9f
Annotate GFF versions fix
mberacochea b323010
Format edits for dbcan processing script
tgurbich 725665b
Added unifire post-processing
tgurbich 7ec709b
WIP
tgurbich ce7a245
WIP
tgurbich 3b5f16a
Added join to arguments
tgurbich af84985
WIP
tgurbich 18c7333
WIP
tgurbich f484ef1
Bug fix
tgurbich 6387305
Bug fix
tgurbich 0825e45
GFF format changes
tgurbich 3033c5e
Added defense finder post-processing script
tgurbich fd053c6
Added DefenseFinder post-processing to nf
tgurbich 1b0c9ca
Added duplicate resolution script - WIP
tgurbich c3c70ee
Format edit
tgurbich 2b7389a
Changed dbcan output name format
tgurbich 5024ed0
Added AMRFinderPlus processing script
tgurbich db27d87
Added AMR post-processing to the AMR module
tgurbich 86b6021
Fixed unifire post-processing script
tgurbich 3b11069
Temporarily removed amrfinder post-processing
tgurbich 7e22e4b
Fixed typo
tgurbich 8a4726c
Fixed typo
tgurbich 4037c7d
Removed post-processing
tgurbich 699e42e
Add an antismash to gff script
mberacochea 150b35e
Move PROKKA results to the functional annotation folder.
mberacochea 3af5a9f
Move sanntis to the bgc results folder
mberacochea 0d721f3
Changed antismash GFF name
tgurbich 6f036db
Text edits
tgurbich 9d4bde2
Format edits
tgurbich 18549fe
Merge pull request #2 from EBI-Metagenomics/feature/antismash-gff
tgurbich c74a36c
Bug fix
tgurbich 12e7f86
Added an extra GFF processing script to the annotate_gff process
mberacochea e039cf2
Fixed the schema
mberacochea 93a0f72
Fix nf-core pipeline linting issues
mberacochea d4956be
convert_cds_into_multiple_lines correction
mberacochea d9c49de
CRISPR fix for correct visualisation
tgurbich e9d4b9e
Added protein component filter
tgurbich b94715f
Removed uniprot fields that are not in the existing dictionary
tgurbich 4fc6da8
Add an AMRFinderPlus GFF generation step
mberacochea a513267
Fixed crispr naming
tgurbich b891c33
Typo in convert_cds script. Added tag to IPS
mberacochea 9d90096
Typos in amrfinder post processing
mberacochea 836227d
AMRFINDER_PLUS_TO_GFF add required -v param
mberacochea 6aa4bcc
Removed parent field from flank sequence in CRISPRCas results
tgurbich 1b2b761
Tweak DBCan outputfolder
mberacochea bd873c2
Remove the structure annotation folder.
mberacochea abd9e22
Merge pull request #3 from EBI-Metagenomics/feature/gff_post_processi…
tgurbich d5526d3
Mobilome merger script added
Ales-ibt dad4d02
Finished GECCO addition to annotate_gff script
tgurbich f371d33
Resolved
tgurbich bbb596c
Added antismash to annotate_gff
tgurbich 1c18835
Removed extra spaces from dbcan protein fams
tgurbich 7b76e63
Added dbCAN to the annotate_gff script
tgurbich c1f98b9
Added defense finder to annotate_gff
tgurbich 62c3f8a
Added arguments to annotate_gff
tgurbich 574dd1e
Added combined unifire output
tgurbich a960896
Format
tgurbich d634e08
Removed gff line splitting
tgurbich ae7e2fa
Removed mmseqs from the duplicate resolution script
tgurbich bfd2174
Fixed integer checked in duplicate gene names
tgurbich d3bc930
Moved duplicate loading into a separate function
tgurbich 4813b3f
Refactored the duplicate resolution script
tgurbich c15f62a
WIP
tgurbich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import argparse | ||
import logging | ||
import os | ||
import sys | ||
|
||
logging.basicConfig(level=logging.INFO) | ||
|
||
|
||
def main(infile, outdir): | ||
taxid = assign_taxid(infile) | ||
check_dir(outdir) | ||
outfile = "proteins.fasta" | ||
outpath = os.path.join(outdir, outfile) | ||
with open(outpath, "w") as file_out, open(infile, "r") as file_in: | ||
for line in file_in: | ||
if line.startswith(">"): | ||
formatted_line = reformat_line(line, taxid) | ||
file_out.write(formatted_line) | ||
else: | ||
file_out.write(line) | ||
|
||
|
||
def check_dir(directory_path): | ||
if not os.path.exists(directory_path): | ||
try: | ||
os.makedirs(directory_path) | ||
except OSError as e: | ||
logging.error(f"Error: Failed to create directory '{directory_path}'. {e}") | ||
|
||
|
||
def reformat_line(line, taxid): | ||
line = line.lstrip('>').strip() | ||
id, description = line.split(maxsplit=1) | ||
if taxid == "820": | ||
sp_name = "Bacteroides uniformis" | ||
elif taxid == "821": | ||
sp_name = "Phocaeicola vulgatus" | ||
elif taxid == "46503": | ||
sp_name = "Parabacteroides merdae" | ||
else: | ||
raise ValueError("Unknown species") | ||
formatted_line = ">tr|{id}|{description} OS={sp_name} OX={taxid}\n".format(id=id, description=description, | ||
sp_name=sp_name, taxid=taxid) | ||
return formatted_line | ||
|
||
|
||
def assign_taxid(infile): | ||
try: | ||
with open(infile, 'r') as file: | ||
# Read the first line | ||
first_line = file.readline().strip() | ||
species_code = first_line[1:3] | ||
|
||
# Assign taxid based on species code | ||
if species_code == "BU": | ||
taxid = "820" | ||
elif species_code == "PV": | ||
taxid = "821" | ||
elif species_code == "PM": | ||
taxid = "46503" | ||
else: | ||
raise ValueError("Unknown species") | ||
return taxid | ||
except Exception as e: | ||
logging.error(f"Error: {e}") | ||
exit(1) | ||
|
||
|
||
def parse_args(): | ||
parser = argparse.ArgumentParser( | ||
description=( | ||
"The script reformats the fasta faa file to prepare it for UniRule." | ||
) | ||
) | ||
parser.add_argument( | ||
"-i", | ||
dest="infile", | ||
required=True, | ||
help="Input protein fasta file.", | ||
) | ||
parser.add_argument( | ||
"-o", | ||
dest="outdir", | ||
required=True, | ||
help=( | ||
"Path to the folder where the output will be saved to." | ||
), | ||
) | ||
return parser.parse_args() | ||
|
||
|
||
if __name__ == "__main__": | ||
args = parse_args() | ||
main( | ||
args.infile, | ||
args.outdir, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,124 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import argparse | ||
import logging | ||
import os | ||
import sys | ||
|
||
logging.basicConfig(level=logging.INFO) | ||
|
||
|
||
def main(input_folder, outfile, dbcan_version): | ||
if not check_folder_completeness(input_folder): | ||
sys.exit("Missing dbCAN outputs. Exiting.") | ||
substrates = load_substrates(input_folder) | ||
cgc_locations = load_cgcs(input_folder) | ||
print_gff(input_folder, outfile, dbcan_version, substrates, cgc_locations) | ||
|
||
|
||
def load_cgcs(input_folder): | ||
cgc_locations = dict() | ||
with open(os.path.join(input_folder, "cgc_standard.out")) as file_in: | ||
for line in file_in: | ||
if not line.startswith("CGC#"): | ||
cgc, _, contig, _, start, end, _, _ = line.strip().split("\t") | ||
if cgc in cgc_locations: | ||
if cgc_locations[cgc]["start"] > int(start): | ||
cgc_locations[cgc]["start"] = int(start) | ||
if cgc_locations[cgc]["end"] < int(end): | ||
cgc_locations[cgc]["end"] = int(end) | ||
else: | ||
cgc_locations[cgc] = {"start": int(start), | ||
"end": int(end), | ||
"contig": contig} | ||
return cgc_locations | ||
|
||
|
||
def print_gff(input_folder, outfile, dbcan_version, substrates, cgc_locations): | ||
with open(outfile, "w") as file_out: | ||
file_out.write("##gff-version 3\n") | ||
cgcs_printed = list() | ||
with open(os.path.join(input_folder, "cgc_standard.out")) as file_in: | ||
for line in file_in: | ||
if not line.startswith("CGC#"): | ||
cgc, gene_type, contig, prot_id, start, end, strand, protein_fam = line.strip().split("\t") | ||
if not cgc in cgcs_printed: | ||
substrate = substrates[cgc] if cgc in substrates else "substrate_dbcan_pul=N/A;substrate_ecami=N/A" | ||
file_out.write("{}\tdbCAN:{}\tpredicted PUL\t{}\t{}\t.\t.\t.\tID={};{}\n".format( | ||
contig, dbcan_version, cgc_locations[cgc]["start"], cgc_locations[cgc]["end"], cgc, | ||
substrate)) | ||
cgcs_printed.append(cgc) | ||
file_out.write("{}\tdbCAN:{}\t{}\t{}\t{}\t.\t{}\t.\tID={};Parent={},protein_family={}\n".format( | ||
contig, dbcan_version, gene_type, start, end, strand, prot_id, cgc, protein_fam)) | ||
|
||
|
||
def load_substrates(input_folder): | ||
substrates = dict() | ||
with open(os.path.join(input_folder, "sub.prediction.out"), "r") as file_in: | ||
for line in file_in: | ||
if not line.startswith("#"): | ||
parts = line.strip().split("\t") | ||
cgc = parts[0].split("|")[1] | ||
try: | ||
substrate_pul = parts[2] | ||
except IndexError: | ||
substrate_pul = "N/A" | ||
try: | ||
substrate_ecami = parts[5] | ||
except IndexError: | ||
substrate_ecami = "N/A" | ||
if not substrate_pul: | ||
substrate_pul = "N/A" | ||
if not substrate_ecami: | ||
substrate_ecami = "N/A" | ||
substrates[cgc] = "substrate_dbcan_pul={};substrate_ecami={}".format(substrate_pul, substrate_ecami) | ||
return substrates | ||
|
||
|
||
def check_folder_completeness(input_folder): | ||
status = True | ||
for file in ["cgc_standard.out", "overview.txt", "sub.prediction.out"]: | ||
if not os.path.exists(os.path.join(input_folder, file)): | ||
logging.error("File {} does not exist.".format(file)) | ||
status = False | ||
return status | ||
|
||
|
||
def parse_args(): | ||
parser = argparse.ArgumentParser( | ||
description=( | ||
"The script takes dbCAN output and parses it to create a standalone GFF." | ||
) | ||
) | ||
parser.add_argument( | ||
"-i", | ||
dest="input_folder", | ||
required=True, | ||
help="Path to the folder with dbCAN results.", | ||
) | ||
parser.add_argument( | ||
"-o", | ||
dest="outfile", | ||
required=True, | ||
help=( | ||
"Path to the output file." | ||
), | ||
) | ||
parser.add_argument( | ||
"-v", | ||
dest="dbcan_ver", | ||
required=True, | ||
help=( | ||
"dbCAN version used." | ||
), | ||
) | ||
return parser.parse_args() | ||
|
||
|
||
if __name__ == "__main__": | ||
args = parse_args() | ||
main( | ||
args.input_folder, | ||
args.outfile, | ||
args.dbcan_ver | ||
) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method can be more pythonic:
`import csv
def print_gff(input_folder, outfile, dbcan_version, substrates, cgc_locations):
this is not tested