Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request - GenBank format #15

Open
bmichanderson opened this issue Aug 28, 2024 · 20 comments
Open

Feature request - GenBank format #15

bmichanderson opened this issue Aug 28, 2024 · 20 comments

Comments

@bmichanderson
Copy link

Hi Ian,

Thanks for making this great tool available!
I've been able to get this running in a Singularity container on my machine, but I would like to have output in GenBank format (I can't find an option for that).

Generating a GFF3 file works, but I am having issues converting it to GenBank format (e.g. combining features). I've tried https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py but the output formatting doesn't parse well (unsure why) and the features are not combined properly (e.g. rps12 is broken into multiple features).

I've tried using https://chloe.plastid.org/annotate.html, but the GenBank output only includes CDS features (I want "gene" features and "intron" features too), and again rps12 is broken up.

Using https://chlorobox.mpimp-golm.mpg.de/geseq.html works well and properly combines features, including Chloe annotations.
I'd like to find a way to run Chloe on many fasta files and get GenBank files without having to manually upload them to GeSeq.

Is this a planned feature? Are there other tools that you would recommend?

Cheers,
Ben

@ian-small
Copy link
Owner

ian-small commented Aug 28, 2024 via email

@bmichanderson
Copy link
Author

Thanks Ian!
Yeah, I would like the flat file format (.gb or .gbk or .gbff) for parsing out genes/CDS/introns/etc. for downstream alignments, if that is possible.

@ian-small
Copy link
Owner

Hi Ben, I've just added code so that Chloe now uses GenomicAnnotations.jl (https://github.com/BioJulia/GenomicAnnotations.jl) for any output other than Chloe's internal .sff format
That provides .gbk and .embl as well as .gff (which is now the default, add --no-gff if you don't want it)
I needed some fixes for GenomicAnnotations.jl to handle trans-spliced genes like rps12 which are not yet in the released version of GenomicAnnotations.jl, so temporarily, Chloe depends on GenomicAnnotations.jl#master (i.e. the github repo, not the official Julia repository). Hopefully that doesn't cause any installation issues.
So to get .gbk instead of .gff, run Chloe with --gbk --no-gff
The .gbk files have the feature headers and the sequence, but none of the metadata info in the header is filled
Incidentally, if you're writing your own code to parse out and extract features, imho it is much easier from Chloe's .sff format files than from any other format, and they contain introns as explicit features.

@bmichanderson
Copy link
Author

Hi Ian,
Thanks for that. One of the reasons I want the GenBank format is because I already have scripts for parsing those files and I want to be able to download data from GenBank and parse it as well as new assemblies. You're probably right that the other format would be easier to parse, and perhaps I should do that at some point, but then what would you recommend for downloaded GenBank formats?

Anyway, I'm trying to test your new version but I was unable to install it. Here's the error I got from Julia:

+ cd chloe
+ julia --project=. -e using Pkg; Pkg.instantiate()
    Updating registry at `/tmp/registries/General.toml`
ERROR: Unsatisfiable requirements detected for package UUIDs [cf7118a7]:
 UUIDs [cf7118a7] log:
 ├─possible versions are: 1.10.4 or uninstalled (package in sysimage!)
 └─restricted to versions 1.11.0-1 by Chloe [ca11047d] — no versions left
   └─Chloe [ca11047d] log:
     ├─possible versions are: 0.1.15 or uninstalled
     └─Chloe [ca11047d] is fixed to version 0.1.15
Stacktrace:
  [1] check_constraints(graph::Pkg.Resolve.Graph)
    @ Pkg.Resolve /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Resolve/graphtype.jl:998
  [2] Pkg.Resolve.Graph(compat::Dict{Base.UUID, Dict{VersionNumber, Dict{Base.UUID, Pkg.Versions.VersionSpec}}}, compat_weak::Dict{Base.UUID, Dict{VersionNumber, Set{Base.UUID}}}, uuid_to_name::Dict{Base.UUID, String}, reqs::Dict{Base.UUID, Pkg.Versions.VersionSpec}, fixed::Dict{Base.UUID, Pkg.Resolve.Fixed}, verbose::Bool, julia_version::VersionNumber)
    @ Pkg.Resolve /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Resolve/graphtype.jl:345
  [3] deps_graph(env::Pkg.Types.EnvCache, registries::Vector{Pkg.Registry.RegistryInstance}, uuid_to_name::Dict{Base.UUID, String}, reqs::Dict{Base.UUID, Pkg.Versions.VersionSpec}, fixed::Dict{Base.UUID, Pkg.Resolve.Fixed}, julia_version::VersionNumber, installed_only::Bool)
    @ Pkg.Operations /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:587
  [4] resolve_versions!(env::Pkg.Types.EnvCache, registries::Vector{Pkg.Registry.RegistryInstance}, pkgs::Vector{Pkg.Types.PackageSpec}, julia_version::VersionNumber, installed_only::Bool)
    @ Pkg.Operations /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:407
  [5] up(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}, level::Pkg.Types.UpgradeLevel; skip_writing_project::Bool, preserve::Nothing)
    @ Pkg.Operations /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:1538
  [6] up
    @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/Operations.jl:1520 [inlined]
  [7] up(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; level::Pkg.Types.UpgradeLevel, mode::Pkg.Types.PackageMode, preserve::Nothing, update_registry::Bool, skip_writing_project::Bool, kwargs::@Kwargs{})
    @ Pkg.API /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:351
  [8] up
    @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:326 [inlined]
  [9] up
    @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:164 [inlined]
 [10] instantiate(ctx::Pkg.Types.Context; manifest::Nothing, update_registry::Bool, verbose::Bool, platform::Base.BinaryPlatforms.Platform, allow_build::Bool, allow_autoprecomp::Bool, kwargs::@Kwargs{})
    @ Pkg.API /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:1797
 [11] instantiate
    @ /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:1773 [inlined]
 [12] instantiate(; kwargs::@Kwargs{})
    @ Pkg.API /julia-1.10.4/share/julia/stdlib/v1.10/Pkg/src/API.jl:1772
 [13] top-level scope
    @ none:1
FATAL:   While performing build: while running engine: exit status 1

@ian-small
Copy link
Owner

ian-small commented Nov 22, 2024 via email

@bmichanderson
Copy link
Author

bmichanderson commented Nov 22, 2024

Thanks Ian. I tried updating Julia to v. 1.11.1 and it finished the install. I'm now hitting a different error when attempting to annotate. Here's my command (to a Singularity container with the install):

julia --project=/chloe /chloe/chloe.jl annotate --reference=/chloe_references --gbk --no-gff *.fasta

But it doesn't load the references

[ Info: [CUN1-01] seq length: 157942bp
[ Info: [CUN1-01] using 0 reference(s): 0.615s
ERROR: LoadError: BoundsError: attempt to access 3-element Vector{SubString{String}} at index [4]
Stacktrace:
  [1] throw_boundserror(A::Vector{SubString{String}}, I::Tuple{Int64})
    @ Base ./essentials.jl:14
...

I can include more of the error message if helpful.
Here's my Singularity recipe: https://github.com/bmichanderson/singularity-containers/blob/master/Singularity.chloe

@ian-small
Copy link
Owner

ian-small commented Nov 22, 2024 via email

@bmichanderson
Copy link
Author

Hi Ian,
Yeah, installing for 1.11 meant rebuilding the Singularity container, so Chloe was again cloned and instantiated in that process.
Rebuilding for 1.10.4 again results in the same error and inability to load references. I'm not sure why that is (worked before).

@bmichanderson
Copy link
Author

Update: adding the argument -r cp to my command and now it loads references

@bmichanderson
Copy link
Author

Update 2: my parser (https://github.com/bmichanderson/scripts/blob/master/genbank_parse.py) doesn't like the GenBank formatted files.

/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:1553: BiopythonParserWarning: Malformed LOCUS line found - is this correct?
:'LOCUS       unknown    157942 bp     DNA\n'
  warnings.warn(
/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/__init__.py:727: BiopythonParserWarning: Could not parse feature location 'join(143142..143373'; setting feature location to None.
  warnings.warn(
/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/__init__.py:727: BiopythonParserWarning: Could not parse feature location 'join(complement(101144..101169)'; setting feature location to None.
  warnings.warn(
/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py:1234: BiopythonParserWarning: Invalid indentation for sequence line
  warnings.warn(
Traceback (most recent call last):
  File "/home/benjamin/scripts/genbank_parse.py", line 67, in <module>
    for gbk in gbks:
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 85, in __next__
    return next(self.records)
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 515, in parse_records
    record = self.parse(handle, do_features)
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 498, in parse
    if self.feed(handle, consumer, do_features):
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 474, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "/home/benjamin/.local/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 1239, in parse_footer
    raise ValueError(f"Sequence line mal-formed, '{line}'")
ValueError: Sequence line mal-formed, '      1 TGCTTATGTT AAGTAAATAA AACATTACTT AATTTTAAAC TTAATTCTAA ATTAAGGAGC'

@ian-small
Copy link
Owner

ian-small commented Nov 22, 2024 via email

@bmichanderson
Copy link
Author

bmichanderson commented Nov 22, 2024

Thanks Ian.

In the mal-formed sequence ValueError, it appears that the "Scanner.py" of BioPython (https://github.com/biopython/biopython/blob/master/Bio/GenBank/Scanner.py) raises an error when it detects improper indentation of the sequence. Adding a single space to each line of the sequence entry at the bottom of the file removes that error.

The header line is improper (missing at least one field). It needs:
"LOCUS" "name" "length" "molecule type" "other field" "date"

I think it couldn't parse features with incorrectly formatted "join" and "complement" (?)
Their way: join(complement(72814..73059),complement(73723..74014),complement(74819..74889))
Correct way: complement(join(72814..73059,73723..74014,74819..74889))

Actually, the errors were raised when they nested another "join" inside a "join"

join(complement(72533..72646),join(complement(101144..101169),complement(101706..101937)))

(these are related to rps12)

@bmichanderson
Copy link
Author

Note @kdyrhage's comment here as well:
BioJulia/GenomicAnnotations.jl#16

The join(complement(...),complement(...)) is not equivalent to complement(join(... , ....))

@ian-small
Copy link
Owner

Yes, indeed, Chloe was joining the complemented fragments in the wrong order, my bad
But now @kdyrhage has fixed the original issue in GenomicAnnotations.jl, I can go back to complement(join...))
I didn't realise nested joins weren't allowed; I've attempted to avoid that for rps12 now. All looks OK for the genomes I've tested...

@bmichanderson
Copy link
Author

It still seems to be having a problem.

Here is the re-built (after your recent push) Chloe's rps12 CDS feature:
join(complement(72533..72646),complement(101144..101169),complement(101706..101937))

Here it is when downloaded from Chloe via GeSeq
complement(join(101144..101169,101706..101937,72533..72646))

@ian-small
Copy link
Owner

Try now; should be correct, if not as neat as the GeSeq version
And I've added a basic header, waiting on GenomicsAnnotations to do it properly
Thanks for catching all these bugs, by the way!

@bmichanderson
Copy link
Author

Looks good!

rps12 in a different sample:

re-built Chloe:
join(complement(72608..72721),142743..142974,143511..143536)
join(complement(72608..72721),complement(101308..101539),complement(100746..100771))

Chloe via GeSeq:
join(complement(72608..72721),142743..142974,143511..143536)
complement(join(100746..100771,101308..101539,72608..72721))

It still complains a bit about the header (one too many fields now -- removed circular and put more spaces between PLN and the date and it doesn't complain anymore). Not a big deal to me, since I can adjust it pretty simply I think.

Only the sequence indentation needs fixing now (assuming for the GenomicsAnnotations people).

@bmichanderson
Copy link
Author

bmichanderson commented Nov 22, 2024

Ah, another minor issue. The CDS features are missing a "/gene=..." field, which affects my parsing script.
(I think all features, e.g. tRNA and introns as well, have a /gene field)

@ian-small
Copy link
Owner

OK, added a 'gene' attribute to CDS/intron/tRNA/rRNA entries.

@bmichanderson
Copy link
Author

Thanks Ian, looks good!
And the sequence is properly formatted now (saw the issue on GenomicAnnotations)!
I think it is now working fine (except for header edits and the addition of e.g. ORGANISM info that I can add), so you can probably close this issue if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@ian-small @bmichanderson and others