Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Model Collections to hdf5? #195

Open
ethanplunkett opened this issue Jun 26, 2024 · 0 comments
Open

Convert Model Collections to hdf5? #195

ethanplunkett opened this issue Jun 26, 2024 · 0 comments

Comments

@ethanplunkett
Copy link
Contributor

ethanplunkett commented Jun 26, 2024

@Miguel-Fuentes has requested that we make models from the model collections retrievable and usable by Python.
Currently the collections store the models as serialized R objects (.Rds files) and therefore are only readable by R.

Two possible solutions:

  1. Keep models as .Rds. To use in Python you'd have to:
  2. Store the models as hdf5 files
    • This requires changing the code that builds the model collections and the code that downloads the models.
    • export_birdflow() and import_birdflow() now work with fully fit, imported and re-exported models.

hdf5 BirdFlow Model file states

There are several states of an hdf5 file that contains a BirdFlow model.

  1. preprocessed model created by preprocess_species() relative to a fitted and imported model these:
    • Lack marginals and metadata related to fitting.
    • The 1st distribution and date are duplicated as the 53 distribution and an extra column in the dates component.
    • Have a distances component with great circle distances.
  2. Initial fitted model BirdFlowPy makes minimal changes.
  • Adds the marginals
  • Adds metdata on the fitting process.
  1. Imported and re-exported models are fairly different than the initial fitted model due to changes made during and after importation.
  • import_birdflow():
    • Renames marginals to match BirdFlowR's convention (e.g. "M_01-02").
    • Makes the last marginal link to the first: "M_52-01"
    • Drops great circle distances. They can be recreated quickly in R.
    • Drops 53 distribution and 53 column in dates - these were duplicates of the first.
    • Adds a marginal index.
  • Other functions may edit the imported BirdFlowR model and thus affect a re-exported hdf5.

Over time things have been added to the file so older files may differ in various ways.

Pros and cons of switching to hdf5's files for collections:

Pro

  • Allows downloading and using the fitted model directly in python (or an other non-R language).
  • Open format.

Con

  • .Rds is an exact and complete representation of the R object.
  • Exporting and importing to an hdf5 isn't perfect and and it's hard to verify that no data is lost or corrupted during the process. Thus using hdf5's creates a little bit of a maintenance hassle. For instance I have no idea if a model with transitions (add_transitions()) will export and re-import properly.
  • Reading an hdf5 is slower in R than reading a .Rds file
  • Importing a model from a collection would require the rhdf5 package. This eliminates one of the benefits of breaking up the package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant