Step 1: Create metadata describing each NWP dataset

The intention is that end-users wouldn't have to do this step. Instead an organisation (e.g. Open Climate Fix and/or dynamical.org) would perform this step and publish the metadata.

Planned `hypergrib` MVP features (for getting coordinate labels)

Get a list of init datetimes, ensemble members, and steps.
Record if/when the number of ensemble members and/or steps changes.
Get a list of parameters and vertical levels by reading the bodies of a minimal set of .idx files.
Decode the parameter abbreviation string and the string summarising the vertical level using the grib_tables sub-crate (so the user gets more information about what these mean, and so the levels can be put into order).
Get the horizontal spatial coordinates.
Record the dimension names, array shape, and coordinate labels in a JSON file. Record the decoded GRIB parameter names and GRIB vertical levels so the end-user doesn't need to use grib_tables (maybe have a mapping from each abbreviation string used in the dataset, to the full GRIB ProductTemplate). Also record when the coordinates change. Changes in horizontal resolution probably have to be loaded as different xarray datasets (see #15 and #17).

Features beyond the MVP

Implement an efficient way to update the hypergrib metadata (e.g. when NODD publishes new forecasts).
- Also need to decode .idx parameter strings like this (from HRRR): var discipline=0 center=7 local_table=1 parmcat=16 parm=201
Open other GRIB datasets. (If we have to parse the step from the body of .idx files then consider using nom).
Optimise the extraction of the horizontal spatial coords from the GRIBs by only loading the relevant sections from the GRIBs (using the .idx files). Although this optimisation isn't urgent. Users will never have to run this step.

Step 2: Load the metadata and load data

Open dataset:

da = xr.open_dataset(URL, engine="hypergrib")

hypergrib loads the metadata and passes to xarray the full list of coordinates and dimension names, e.g.:

dims = ["init_time", "variable", "vertical_level", "timestep", "ensemble_member"]
coords = {
  "init_time": ["2024-01-01", "2024-01-02"],
  # etc.
}

User request:

da.sel(
  init_time="2024-01-01",
  nwp_variable=["temperature", "wind_speed"],
  vertical_level="2meters",
  # all forecast time steps
  # all ensemble members
)

xarray converts these coordinate labels to integer indexes:

da.isel(
  init_time=0,
  nwp_variable=[0, 1],
  vertical_level=0,
  # all forecast time steps
  # all ensemble members
)

The integer indexes get passed to the hypergrib backend for xarray. (In the future, hypergrib may implement a custom xarray index, so we can avoid the redundant conversion to integer indexes and back to coordinate labels).

Planned `hypergrib` MVP features:

Load the hypergrib metadata (which was produced by step 1).
Convert integer indicies back to coordinate labels by looking up the appropriate labels in hypergrib's coords arrays.
Find the unique tuples of init date, init hour, ensemble member, and step.
Algorithmically generate the location of all the .idx files we need. For example, the GEFS location strings look like this: noaa-gefs-pds/gefs.<init date>/<init hour>/pgrb2b/gep<ensemble member>.t<init hour>z.pgrb2af<step>
In parallel, submit GET requests for all these .idx files.
As soon as an .idx file arrives, decode it, and look up byte ranges of the GRIB files we need, and immediately submit GET requests for those byte ranges of the GRIB file. (This step is probably so fast that we perhaps don't need to multi-thread this... for the MVP, let's use a single thread for decoding .idx files and if that's too slow then we can add more threads). Maybe stop decoding rows in the .idx file once we've found all the metadata we need.
If an .idx file doesn't exist then:
- Allow the user to determine what happens if hypergrib tries but fails to read an .idx file. Three options:
- Silent: Don't complain about the missing .idx. Just load the GRIB, scan it, and keep in mem (because we'll soon extract binary data from it).
- Warn: Log a warning about the missing .idx. And load the GRIB, scan it, and keep in mem.
- Fail: Complain loudly about the missing .idx! Don't load the GRIB.
- (Maybe, in a future version, we could offer the option to generate and cache .idx files locally)
If no GRIB exists then log another warning and insert the MISSING DATA indicator into the array (which will probably be NaN for floating point data).
As soon as GRIB data arrives, decode it, and place it into the final array. Decoding GRIB data should be multi-threaded.
Benchmark! See recent discussion on "Large Scale Geospatial Benchmarks" on the Pangeo forum.

Features beyond the MVP

Allow the user to specify whether to load `.idx` files

Allow the user to set a threshold for when to load .idx files.

If the user requests more than THRESHOLD% of the GRIB messages in any GRIB file then skip the .idx and just load the GRIB. Otherwise, attempt to load the .idx. (The motivation being that, if the user wants to read most of the GRIB file, then loading the .idx first will add unnecessary latency).

Set the threshold to 100% to always try to load the .idx file before the GRIB.

Set the threshold to 0% to never load the .idx, and always load the GRIB file first.

Define an extended idx format

See #17

Schedule the network IO to balance several different objectives:

Keep a few hundred network request in-flight at any given moment (user configurable). (Why? Because the AnyBlob paper suggests that is what's required to achieve max throughput).
Consolidate nearby byterange requests (user configurable) to minimise overhead, and reduce the total number of IO operations.

Slice into each GRIB message

For example, some GRIBs are compressed in JPEG2000, and JPEG2000 allows parts of the image to be decompressed. And maybe, whilst making the manifest, we could decompress each GRIB file and save the state of the decompressor every, say, 4 kB. Then, at query time, if we want a single pixel then we'd have to stream at most 4 kB of data from disk. Although that has its own issues.).

Other ideas

Get hypergrib working for as many NWPs as possible
Run a service to continually update metadata
Caching. (Maybe start with caching for a single user, on that user's machine. Then consider a caching service of some sort. For example, if lots of people request "churro-shaped" data arrays then it will be far faster to load those from a "churro-shaped" dataset cached in cloud object storage). ("churro-shaped" means, for example, a long timeseries for a single geographical point).
Analysis tool for comparing different NWPs against each other and against ground truth. (Where would hypergrib run? Perhaps in the browser, using wasm?! (but tokio's rt-multi-thread feature doesn't work on wasm, which might be a deal-breaker.) Or perhaps run a web service in the cloud, close to the data, across multiple machines, so hypergrib. And expose a standards compliant API like Environmental Data Retrieval for the front-end?)
Implement existing protocols
On the fly processing and analytics. E.g. reprojection
Distribute hypergrib's workload across multiple machines. So, for example, users can get acceptable IO performance even if they ask for "churro-shaped" data arrays.

If it's too slow to get `.idx` files:

For small GRIB files, just read the entirety of each GRIB file?
Store .idx files locally?
Convert .idx files to a more concise and cloud-friendly file format, which is published in a cloud bucket?
Put all the .idx data into a cloud-side database?
Put all the .idx data into a local database? DuckDB?
We probably want to avoid using a manifest file, or putting metadata for every GRIB message into a database, because we want to scale to datasets with trillions of GRIB messages. See #14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design.md

design.md

Step 1: Create metadata describing each NWP dataset

Planned `hypergrib` MVP features (for getting coordinate labels)

Features beyond the MVP

Step 2: Load the metadata and load data

Planned `hypergrib` MVP features:

Features beyond the MVP

Allow the user to specify whether to load `.idx` files

Define an extended idx format

Schedule the network IO to balance several different objectives:

Slice into each GRIB message

Other ideas

If it's too slow to get `.idx` files:

Files

design.md

Latest commit

History

design.md

File metadata and controls

Step 1: Create metadata describing each NWP dataset

Planned hypergrib MVP features (for getting coordinate labels)

Features beyond the MVP

Step 2: Load the metadata and load data

Planned hypergrib MVP features:

Features beyond the MVP

Allow the user to specify whether to load .idx files

Define an extended idx format

Schedule the network IO to balance several different objectives:

Slice into each GRIB message

Other ideas

If it's too slow to get .idx files:

Planned `hypergrib` MVP features (for getting coordinate labels)

Planned `hypergrib` MVP features:

Allow the user to specify whether to load `.idx` files

If it's too slow to get `.idx` files: