Feature request - GenBank format #15

bmichanderson · 2024-08-28T05:31:21Z

Hi Ian,

Thanks for making this great tool available!
I've been able to get this running in a Singularity container on my machine, but I would like to have output in GenBank format (I can't find an option for that).

Generating a GFF3 file works, but I am having issues converting it to GenBank format (e.g. combining features). I've tried https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py but the output formatting doesn't parse well (unsure why) and the features are not combined properly (e.g. rps12 is broken into multiple features).

I've tried using https://chloe.plastid.org/annotate.html, but the GenBank output only includes CDS features (I want "gene" features and "intron" features too), and again rps12 is broken up.

Using https://chlorobox.mpimp-golm.mpg.de/geseq.html works well and properly combines features, including Chloe annotations.
I'd like to find a way to run Chloe on many fasta files and get GenBank files without having to manually upload them to GeSeq.

Is this a planned feature? Are there other tools that you would recommend?

Cheers,
Ben

ian-small · 2024-08-28T07:04:16Z

Hi Ben, What GenBank format do you want? The .tbl file format for GenBank submission, or the .gbk flat file format that you download from GenBank? (or both?) I should be able to do the latter quite easily now, using https://github.com/BioJulia/GenomicAnnotations.jl as it covers .gbk format at least. Previously the way it was designed meant GenomicAnnotations.jl couldn’t handle trans-spliced genes such as rps12, but 3 weeks ago it was updated and now it looks like it can. When I get a moment I’ll give it a try. Internally Chloe treats rps12 as two distinct genes (rps12A and rps12B), but the intention is to merge them for the output. Unlike most other tools, Chloe can in principle annotate the two half-introns that are trans-spliced. Cheers Ian From: Ben Anderson ***@***.***> Date: Wednesday, 28 August 2024 at 1:31 PM To: ian-small/chloe ***@***.***> Cc: Subscribed ***@***.***> Subject: [ian-small/chloe] Feature request - GenBank format (Issue #15) Hi Ian, Thanks for making this great tool available! I've been able to get this running in a Singularity container on my machine, but I would like to have output in GenBank format (I can't find an option for that). Generating a GFF3 file works, but I am having issues converting it to GenBank format (e.g. combining features). I've tried https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py but the output formatting doesn't parse well (unsure why) and the features are not combined properly (e.g. rps12 is broken into multiple features). I've tried using https://chloe.plastid.org/annotate.html, but the GenBank output only includes CDS features (I want "gene" features and "intron" features too), and again rps12 is broken up. Using https://chlorobox.mpimp-golm.mpg.de/geseq.html works well and properly combines features, including Chloe annotations. I'd like to find a way to run Chloe on many fasta files and get GenBank files without having to manually upload them to GeSeq. Is this a planned feature? Are there other tools that you would recommend? Cheers, Ben — Reply to this email directly, view it on GitHub<#15>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AD4WKG3Q2BXS4EFK3YFFC6TZTVOD3AVCNFSM6AAAAABNHPP7BCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ4TCMBZGAYDAOA>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

bmichanderson · 2024-08-28T07:16:54Z

Thanks Ian!
Yeah, I would like the flat file format (.gb or .gbk or .gbff) for parsing out genes/CDS/introns/etc. for downstream alignments, if that is possible.

ian-small · 2024-11-20T07:15:17Z

Hi Ben, I've just added code so that Chloe now uses GenomicAnnotations.jl (https://github.com/BioJulia/GenomicAnnotations.jl) for any output other than Chloe's internal .sff format
That provides .gbk and .embl as well as .gff (which is now the default, add --no-gff if you don't want it)
I needed some fixes for GenomicAnnotations.jl to handle trans-spliced genes like rps12 which are not yet in the released version of GenomicAnnotations.jl, so temporarily, Chloe depends on GenomicAnnotations.jl#master (i.e. the github repo, not the official Julia repository). Hopefully that doesn't cause any installation issues.
So to get .gbk instead of .gff, run Chloe with --gbk --no-gff
The .gbk files have the feature headers and the sequence, but none of the metadata info in the header is filled
Incidentally, if you're writing your own code to parse out and extract features, imho it is much easier from Chloe's .sff format files than from any other format, and they contain introns as explicit features.

bmichanderson · 2024-11-22T03:40:58Z

Hi Ian,
Thanks for that. One of the reasons I want the GenBank format is because I already have scripts for parsing those files and I want to be able to download data from GenBank and parse it as well as new assemblies. You're probably right that the other format would be easier to parse, and perhaps I should do that at some point, but then what would you recommend for downloaded GenBank formats?

Anyway, I'm trying to test your new version but I was unable to install it. Here's the error I got from Julia:

+ cd chloe
+ julia --project=. -e using Pkg; Pkg.instantiate()
    Updating registry at `/tmp/registries/General.toml`
ERROR: Unsatisfiable requirements detected for package UUIDs [cf7118a7]:
 UUIDs [cf7118a7] log:
 ├─possible versions are: 1.10.4 or uninstalled (package in sysimage!)
 └─restricted to versions 1.11.0-1 by Chloe [ca11047d] — no versions left
   └─Chloe [ca11047d] log:
     ├─possible versions are: 0.1.15 or uninstalled
     └─Chloe [ca11047d] is fixed to version 0.1.15
Stacktrace:
  [1] check_constraints(graph::Pkg.Resolve.Graph)
    @ Pkg.Resolve /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Resolve/graphtype.jl:998
  [2] Pkg.Resolve.Graph(compat::Dict{Base.UUID, Dict{VersionNumber, Dict{Base.UUID, Pkg.Versions.VersionSpec}}}, compat_weak::Dict{Base.UUID, Dict{VersionNumber, Set{Base.UUID}}}, uuid_to_name::Dict{Base.UUID, String}, reqs::Dict{Base.UUID, Pkg.Versions.VersionSpec}, fixed::Dict{Base.UUID, Pkg.Resolve.Fixed}, verbose::Bool, julia_version::VersionNumber)
    @ Pkg.Resolve /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Resolve/graphtype.jl:345
  [3] deps_graph(env::Pkg.Types.EnvCache, registries::Vector{Pkg.Registry.RegistryInstance}, uuid_to_name::Dict{Base.UUID, String}, reqs::Dict{Base.UUID, Pkg.Versions.VersionSpec}, fixed::Dict{Base.UUID, Pkg.Resolve.Fixed}, julia_version::VersionNumber, installed_only::Bool)
    @ Pkg.Operations /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:587
  [4] resolve_versions!(env::Pkg.Types.EnvCache, registries::Vector{Pkg.Registry.RegistryInstance}, pkgs::Vector{Pkg.Types.PackageSpec}, julia_version::VersionNumber, installed_only::Bool)
    @ Pkg.Operations /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:407
  [5] up(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}, level::Pkg.Types.UpgradeLevel; skip_writing_project::Bool, preserve::Nothing)
    @ Pkg.Operations /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:1538
  [6] up
    @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:1520 [inlined]
  [7] up(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; level::Pkg.Types.UpgradeLevel, mode::Pkg.Types.PackageMode, preserve::Nothing, update_registry::Bool, skip_writing_project::Bool, kwargs::@Kwargs{})
    @ Pkg.API /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:351
  [8] up
    @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:326 [inlined]
  [9] up
    @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:164 [inlined]
 [10] instantiate(ctx::Pkg.Types.Context; manifest::Nothing, update_registry::Bool, verbose::Bool, platform::Base.BinaryPlatforms.Platform, allow_build::Bool, allow_autoprecomp::Bool, kwargs::@Kwargs{})
    @ Pkg.API /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:1797
 [11] instantiate
    @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:1773 [inlined]
 [12] instantiate(; kwargs::@Kwargs{})
    @ Pkg.API /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:1772
 [13] top-level scope
    @ none:1
FATAL:   While performing build: while running engine: exit status 1

ian-small · 2024-11-22T04:02:06Z

Hi Ben, if you upgrade to the latest version of Julia it will fix that problem. I’ll see if I can relax the dependency, though. From: Ben Anderson ***@***.***> Date: Friday, 22 November 2024 at 11:41 AM To: ian-small/Chloe.jl ***@***.***> Cc: Ian Small ***@***.***>, Comment ***@***.***> Subject: Re: [ian-small/Chloe.jl] Feature request - GenBank format (Issue #15) Hi Ian, Thanks for that. One of the reasons I want the GenBank format is because I already have scripts for parsing those files and I want to be able to download data from GenBank and parse it as well as new assemblies. You're probably right that the other format would be easier to parse, and perhaps I should do that at some point, but then what would you recommend for downloaded GenBank formats? Anyway, I'm trying to test your new version but I was unable to install it. Here's the error I got from Julia: + cd chloe + julia --project=. -e using Pkg; Pkg.instantiate() Updating registry at `/tmp/registries/General.toml` ERROR: Unsatisfiable requirements detected for package UUIDs [cf7118a7]: UUIDs [cf7118a7] log: ├─possible versions are: 1.10.4 or uninstalled (package in sysimage!) └─restricted to versions 1.11.0-1 by Chloe [ca11047d] — no versions left └─Chloe [ca11047d] log: ├─possible versions are: 0.1.15 or uninstalled └─Chloe [ca11047d] is fixed to version 0.1.15 Stacktrace: [1] check_constraints(graph::Pkg.Resolve.Graph) @ Pkg.Resolve /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Resolve/graphtype.jl:998 [2] Pkg.Resolve.Graph(compat::Dict{Base.UUID, Dict{VersionNumber, Dict{Base.UUID, Pkg.Versions.VersionSpec}}}, compat_weak::Dict{Base.UUID, Dict{VersionNumber, Set{Base.UUID}}}, uuid_to_name::Dict{Base.UUID, String}, reqs::Dict{Base.UUID, Pkg.Versions.VersionSpec}, fixed::Dict{Base.UUID, Pkg.Resolve.Fixed}, verbose::Bool, julia_version::VersionNumber) @ Pkg.Resolve /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Resolve/graphtype.jl:345 [3] deps_graph(env::Pkg.Types.EnvCache, registries::Vector{Pkg.Registry.RegistryInstance}, uuid_to_name::Dict{Base.UUID, String}, reqs::Dict{Base.UUID, Pkg.Versions.VersionSpec}, fixed::Dict{Base.UUID, Pkg.Resolve.Fixed}, julia_version::VersionNumber, installed_only::Bool) @ Pkg.Operations /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:587 [4] resolve_versions!(env::Pkg.Types.EnvCache, registries::Vector{Pkg.Registry.RegistryInstance}, pkgs::Vector{Pkg.Types.PackageSpec}, julia_version::VersionNumber, installed_only::Bool) @ Pkg.Operations /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:407 [5] up(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}, level::Pkg.Types.UpgradeLevel; skip_writing_project::Bool, preserve::Nothing) @ Pkg.Operations /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:1538 [6] up @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:1520 [inlined] [7] up(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; level::Pkg.Types.UpgradeLevel, mode::Pkg.Types.PackageMode, preserve::Nothing, update_registry::Bool, skip_writing_project::Bool, ***@***.***{}) @ Pkg.API /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:351 [8] up @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:326 [inlined] [9] up @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:164 [inlined] [10] instantiate(ctx::Pkg.Types.Context; manifest::Nothing, update_registry::Bool, verbose::Bool, platform::Base.BinaryPlatforms.Platform, allow_build::Bool, allow_autoprecomp::Bool, ***@***.***{}) @ Pkg.API /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:1797 [11] instantiate @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:1773 [inlined] [12] instantiate(; ***@***.***{}) @ Pkg.API /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:1772 [13] top-level scope @ none:1 FATAL: While performing build: while running engine: exit status 1 — Reply to this email directly, view it on GitHub<#15 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AD4WKG7RWIFT4DDO6AAE4QD2B2RWBAVCNFSM6AAAAABNHPP7BCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJSHAYTIOBRGE>. You are receiving this because you commented.Message ID: ***@***.***>

bmichanderson · 2024-11-22T04:12:38Z

Thanks Ian. I tried updating Julia to v. 1.11.1 and it finished the install. I'm now hitting a different error when attempting to annotate. Here's my command (to a Singularity container with the install):

julia --project=/chloe /chloe/chloe.jl annotate --reference=/chloe_references --gbk --no-gff *.fasta

But it doesn't load the references

[ Info: [CUN1-01] seq length: 157942bp
[ Info: [CUN1-01] using 0 reference(s): 0.615s
ERROR: LoadError: BoundsError: attempt to access 3-element Vector{SubString{String}} at index [4]
Stacktrace:
  [1] throw_boundserror(A::Vector{SubString{String}}, I::Tuple{Int64})
    @ Base ./essentials.jl:14
...

I can include more of the error message if helpful.
Here's my Singularity recipe: https://github.com/bmichanderson/singularity-containers/blob/master/Singularity.chloe

ian-small · 2024-11-22T04:48:20Z

Hi Ben, You’ll need to add Chloe again to the new version of Julia, it keeps the packages separate between versions. You may need to instantiate again too. I’ve taken the compatibility requirement for 1.11 out, and now it runs fine under 1.10.4 on my machine, so if you still have 1.10 around and can’t get 1.11 to work you can try that. Good luck Ian

bmichanderson · 2024-11-22T05:00:32Z

Hi Ian,
Yeah, installing for 1.11 meant rebuilding the Singularity container, so Chloe was again cloned and instantiated in that process.
Rebuilding for 1.10.4 again results in the same error and inability to load references. I'm not sure why that is (worked before).

bmichanderson · 2024-11-22T05:04:04Z

Update: adding the argument -r cp to my command and now it loads references

bmichanderson · 2024-11-22T05:08:39Z

Update 2: my parser (https://github.com/bmichanderson/scripts/blob/master/genbank_parse.py) doesn't like the GenBank formatted files.

/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:1553: BiopythonParserWarning: Malformed LOCUS line found - is this correct?
:'LOCUS       unknown    157942 bp     DNA\n'
  warnings.warn(
/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/__init__.py:727: BiopythonParserWarning: Could not parse feature location 'join(143142..143373'; setting feature location to None.
  warnings.warn(
/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/__init__.py:727: BiopythonParserWarning: Could not parse feature location 'join(complement(101144..101169)'; setting feature location to None.
  warnings.warn(
/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:1234: BiopythonParserWarning: Invalid indentation for sequence line
  warnings.warn(
Traceback (most recent call last):
  File "/home/benjamin/scripts/genbank_parse.py", line 67, in <module>
    for gbk in gbks:
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
    return next(self.records)
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 515, in parse_records
    record = self.parse(handle, do_features)
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 498, in parse
    if self.feed(handle, consumer, do_features):
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 474, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 1239, in parse_footer
    raise ValueError(f"Sequence line mal-formed, '{line}'")
ValueError: Sequence line mal-formed, '      1 TGCTTATGTT AAGTAAATAA AACATTACTT AATTTTAAAC TTAATTCTAA ATTAAGGAGC'

ian-small · 2024-11-22T05:28:08Z

You’d need to take that up with the author(s) of GenomicAnnotations.jl… (or I can on your behalf, if you can tell me how the output deviates from the spec)

bmichanderson · 2024-11-22T07:46:59Z

Thanks Ian.

In the mal-formed sequence ValueError, it appears that the "Scanner.py" of BioPython (https://github.com/biopython/biopython/blob/master/Bio/GenBank/Scanner.py) raises an error when it detects improper indentation of the sequence. Adding a single space to each line of the sequence entry at the bottom of the file removes that error.

The header line is improper (missing at least one field). It needs:
"LOCUS" "name" "length" "molecule type" "other field" "date"

I think it couldn't parse features with incorrectly formatted "join" and "complement" (?)
Their way: join(complement(72814..73059),complement(73723..74014),complement(74819..74889))
Correct way: complement(join(72814..73059,73723..74014,74819..74889))

Actually, the errors were raised when they nested another "join" inside a "join"

join(complement(72533..72646),join(complement(101144..101169),complement(101706..101937)))

(these are related to rps12)

bmichanderson · 2024-11-22T09:01:51Z

Note @kdyrhage's comment here as well:
BioJulia/GenomicAnnotations.jl#16

The join(complement(...),complement(...)) is not equivalent to complement(join(... , ....))

ian-small · 2024-11-22T11:15:09Z

Yes, indeed, Chloe was joining the complemented fragments in the wrong order, my bad
But now @kdyrhage has fixed the original issue in GenomicAnnotations.jl, I can go back to complement(join...))
I didn't realise nested joins weren't allowed; I've attempted to avoid that for rps12 now. All looks OK for the genomes I've tested...

bmichanderson · 2024-11-22T11:53:07Z

It still seems to be having a problem.

Here is the re-built (after your recent push) Chloe's rps12 CDS feature:
join(complement(72533..72646),complement(101144..101169),complement(101706..101937))

Here it is when downloaded from Chloe via GeSeq
complement(join(101144..101169,101706..101937,72533..72646))

ian-small · 2024-11-22T12:35:48Z

Try now; should be correct, if not as neat as the GeSeq version
And I've added a basic header, waiting on GenomicsAnnotations to do it properly
Thanks for catching all these bugs, by the way!

bmichanderson · 2024-11-22T13:06:37Z

Looks good!

rps12 in a different sample:

re-built Chloe:
join(complement(72608..72721),142743..142974,143511..143536)
join(complement(72608..72721),complement(101308..101539),complement(100746..100771))

Chloe via GeSeq:
join(complement(72608..72721),142743..142974,143511..143536)
complement(join(100746..100771,101308..101539,72608..72721))

It still complains a bit about the header (one too many fields now -- removed circular and put more spaces between PLN and the date and it doesn't complain anymore). Not a big deal to me, since I can adjust it pretty simply I think.

Only the sequence indentation needs fixing now (assuming for the GenomicsAnnotations people).

bmichanderson · 2024-11-22T13:12:00Z

Ah, another minor issue. The CDS features are missing a "/gene=..." field, which affects my parsing script.
(I think all features, e.g. tRNA and introns as well, have a /gene field)

ian-small · 2024-11-22T13:40:52Z

OK, added a 'gene' attribute to CDS/intron/tRNA/rRNA entries.

bmichanderson · 2024-11-25T02:49:21Z

Thanks Ian, looks good!
And the sequence is properly formatted now (saw the issue on GenomicAnnotations)!
I think it is now working fine (except for header edits and the addition of e.g. ORGANISM info that I can add), so you can probably close this issue if you like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request - GenBank format #15

Feature request - GenBank format #15

bmichanderson commented Aug 28, 2024

ian-small commented Aug 28, 2024 via email

bmichanderson commented Aug 28, 2024

ian-small commented Nov 20, 2024

bmichanderson commented Nov 22, 2024

ian-small commented Nov 22, 2024 via email

bmichanderson commented Nov 22, 2024 •

edited

Loading

ian-small commented Nov 22, 2024 via email

bmichanderson commented Nov 22, 2024

bmichanderson commented Nov 22, 2024

bmichanderson commented Nov 22, 2024

ian-small commented Nov 22, 2024 via email

bmichanderson commented Nov 22, 2024 •

edited

Loading

bmichanderson commented Nov 22, 2024

ian-small commented Nov 22, 2024

bmichanderson commented Nov 22, 2024

ian-small commented Nov 22, 2024

bmichanderson commented Nov 22, 2024

bmichanderson commented Nov 22, 2024 •

edited

Loading

ian-small commented Nov 22, 2024

bmichanderson commented Nov 25, 2024

Feature request - GenBank format #15

Feature request - GenBank format #15

Comments

bmichanderson commented Aug 28, 2024

ian-small commented Aug 28, 2024 via email

bmichanderson commented Aug 28, 2024

ian-small commented Nov 20, 2024

bmichanderson commented Nov 22, 2024

ian-small commented Nov 22, 2024 via email

bmichanderson commented Nov 22, 2024 • edited Loading

ian-small commented Nov 22, 2024 via email

bmichanderson commented Nov 22, 2024

bmichanderson commented Nov 22, 2024

bmichanderson commented Nov 22, 2024

ian-small commented Nov 22, 2024 via email

bmichanderson commented Nov 22, 2024 • edited Loading

bmichanderson commented Nov 22, 2024

ian-small commented Nov 22, 2024

bmichanderson commented Nov 22, 2024

ian-small commented Nov 22, 2024

bmichanderson commented Nov 22, 2024

bmichanderson commented Nov 22, 2024 • edited Loading

ian-small commented Nov 22, 2024

bmichanderson commented Nov 25, 2024

bmichanderson commented Nov 22, 2024 •

edited

Loading

bmichanderson commented Nov 22, 2024 •

edited

Loading

bmichanderson commented Nov 22, 2024 •

edited

Loading