Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-2292: add SPL2SMAP collection to the SMAP-L2-Gridder #13

Merged
merged 20 commits into from
Jan 30, 2025

Conversation

flamingbear
Copy link
Member

@flamingbear flamingbear commented Jan 27, 2025

This PR adds the configuration and support to allow the Harmony SMAP L2 gridding service to grid SPL2SMAP data.

Jira Issue ID

DAS-2292

Local Test Steps

  1. Check out this PR's branch

  2. Build and test the docker images

    ❯ ./bin/build-image && ./bin/build-test && ./bin/run-test
    
  3. Verify tests pass and coverage is good.

  4. If you have lots of memory: Deploy Harmony-In-A-Box with your freshly built Docker image.

    • Update your local harmony configuration to update the memory limit for the harmony-smap-l2-gridder in `services/harmony/env-defaults

      HARMONY_SMAP_L2_GRIDDER_LIMITS_MEMORY=16Gi
      
    • ensure your Harmony .env has the smap-l2-gridder

    LOCALLY_DEPLOYED_SERVICES=<[blah,blah,etc]>,harmony-smap-l2-gridder
    
    • Restart harmony from scratch to be safe.
     ❯ ./bin/stop-harmony-and-services
     ❯ ./bin/bootstrap-harmony
    
  5. Make a Harmony request for this collection

    http://localhost:3000/C1268429748-EEDTEST/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?forceAsync=true&outputcrs=EPSG%3A4326&granuleId=G1268429760-EEDTEST&format=application%2Fx-netcdf4&maxResults=1
    
  6. Open your local workflow-ui, download the completed asset. SMAP_L2_SM_AP_01061_D_20150414T025639_R13080_001_regridded.nc

  7. Open in panoply and verify that you can plot variables, examine the metadata, particularly the x-dim, y-dim and crs values.
    landcover_class should show up obviously following the contours of the earth along the coast of Africa

More test steps...

Unfortunately, we don't have the NSIDC production collection available in UAT. In order to validate the EGI outputs with the harmony-smap-l2-gridder outputs, we have to download and run the subsetter by hand. There is a notebook attached to the ticket validate-DAS-2292.ipynb that has instructions within. Please also follow along with those to verify the outputs are the same as the on-prem system.

PR Acceptance Checklist

  • [X*] Jira ticket acceptance criteria met.
  • CHANGELOG.md updated to include high level summary of PR changes.
  • docker/service_version.txt updated if publishing a release.
  • Tests added/updated and passing.
  • Documentation updated (if needed).

*Regression tests need to be updated as well, same ticket, different PR.

group is a better name since that is actually what they are.
This should be fixed up a bit.
It turns out for two reasons. We don't have spacecraft_overpass_time_utc.
The first is that it's huge, it's a 3km grid so 11568 x 4872 entries, and those
entries are 24 character String variables that is  1,352,623,104 bytes of data for
that one variable.  The second is that the NetCDF-C lib is not compressing
string data even if it's constant length.
Also updates fetching metadata from config.
@flamingbear flamingbear changed the title DAS-2292: WIP: refactoring tests and adding new collection DAS-2292: add SPL2SMAP collection to the SMAP-L2-Gridder Jan 27, 2025
@flamingbear flamingbear marked this pull request as ready for review January 28, 2025 17:27
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where all of the information about the differences in data structures between the collections live. Right now all of the information for all collections is here, but not all of the collections work yet. I just didn't want to separate out the pieces just for review, all of these will be used when all collections are added.

'col': 'Soil_Moisture_Retrieval_Data/EASE_column_index',
}

GRIDS = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just links the grids with their EPSG for simplicity, in a perfect world they would be in the gpd file.

'row': 'Soil_Moisture_Retrieval_Data_3km/EASE_row_index_3km',
'col': 'Soil_Moisture_Retrieval_Data_3km/EASE_column_index_3km',
**GRIDS['M03km'],
'dropped_variables': ['spacecraft_overpass_time_utc'],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The on prem system doesn't include this variable in it's outputs. I found that including it, adds about 20x the size of the file without it.

       168897423 Jan 24 14:58 output-without-timevar_SMAP_L2_SM_A_02289_D_20150707T020806_R13080_001.nc
      3349625663 Jan 23 17:24 output_SMAP_L2_SM_A_02289_D_20150707T020806_R13080_001.nc

"""
return ['Metadata']
collection_config = get_collection_info(get_collection_shortname(in_data))
return collection_config['metadata']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still just returns ['Metadata'], but it felt like it was fine to leave it in the collection configuration.


def get_collection_shortname(in_data: DataTree) -> str:
"""Extract the short name identifier from the dataset metadata."""
return in_data['Metadata/DatasetIdentification'].shortName
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this ever changes location, I can add it to the configuration and grab it from there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place where it feels like earthdata-varinfo has code that can fulfil this task. A VarInfoFromNetCDF4 or VarInfoFromDmr object looks for the collection short name in a bunch of locations, including this one, if a short name isn't given when the object is instantiated.

Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, the Docker image builds and the unit tests pass. I ran the example request for Harmony in a Box (after bumping up the memory for the service to 16G), but my request ran out of memory and failed.

My biggest question is about having a bunch of configuration in a style that is unique to this service for doing things like identifying variables to be transformed, denoting variables to be excluded from transformation, and extracting the short name from locations of the file. A lot of those things that can be done with earthdata-varinfo. That could potentially also give you a few helper methods for identifying things like science variables, too.

That said, if you think I just have an earthdata-varinfo shaped hammer and I'm seeing everything as a nail, let me know, but there's definitely code that looks nail-like from a first glance.

CHANGELOG.md Show resolved Hide resolved
}


COLLECTION_INFORMATION = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parts of this dictionary feel very similar to how earthdata-varinfo could be used with a configuration file. Could that be used instead? I think the main reason I'm not sure is that I don't know the best metadata attribute names that would be used.

@@ -58,11 +66,11 @@ def prepare_variable(var: DataTree | DataArray, grid_info: dict) -> DataArray:
"""Grid and annotate intput variable."""
grid_data = grid_variable(var, grid_info)
grid_data.attrs = {**var.attrs, 'grid_mapping': 'crs'}
unzippable = ['tb_time_utc'] # can't zip strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more general way of doing this, like looking at the data type for var, rather than hardcoding around a specific variable name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably so. It came up because this was the var that was breaking when I started with the SPL2SMP_E files. Turns out that the SPL2SMAP files have a a different string variable that causes enough trouble that the EGI system just drops it. I will look and see if it really is just string vars or this particular string var.

(It might be good because I think this variable is also going to end up in the dropped vars portion of the config.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def get_collection_shortname(in_data: DataTree) -> str:
"""Extract the short name identifier from the dataset metadata."""
return in_data['Metadata/DatasetIdentification'].shortName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place where it feels like earthdata-varinfo has code that can fulfil this task. A VarInfoFromNetCDF4 or VarInfoFromDmr object looks for the collection short name in a bunch of locations, including this one, if a short name isn't given when the object is instantiated.

@@ -29,23 +35,25 @@ def process_input(in_data: DataTree, output_file: Path, logger: None | Logger =
"""Process input file to generate gridded output file."""
out_data = DataTree()

short_name = get_collection_shortname(in_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Putting this as close to the function signature for process_input as possible)

pylint reckons logger is an unused argument above. Maybe that can be cut from the function signature?

************* Module smap_l2_gridder.grid
smap_l2_gridder/grid.py:34:56: W0613: Unused argument 'logger' (unused-argument)

Copy link
Member Author

@flamingbear flamingbear Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw that, I left it because I've used the logger in other services, I just hadn't used it yet. I guess I can remember it's easy to add back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@owenlittlejohns
Copy link
Member

I ran the example request for Harmony in a Box (after bumping up the memory for the service to 16G), but my request ran out of memory and failed.

Update: I made my Docker Desktop settings more beefy, and the request passed:

  • Memory Limit: 20 GB
  • Swap: 4 GB

@D-Auty
Copy link

D-Auty commented Jan 29, 2025

We do have sample configuration data in VarInfo for other applications, and both CRS, geotransform and exclude_variables could be defined in the configuration. I'm not sure it would buy much as this is already targeted as a SMAP only app. The more general way to get the geotransform, or grid extents is to use EPSG codes and PyProj to get the CRS and then the projection parameters (CF, as is done here), and datum parameters (extents, but not resolution). I'm not sure which is more "authoritative" - the implemented approach, or the EPSG & PyProj, but the later does not include the resolution, so a configuration entry still remaining to define. I'm ok as it is.

Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flamingbear and I had a quick tag-up and talked through the possibility of using earthdata-varinfo. We both agreed that the implementation here is pretty close to what earthdata-varinfo does, but the big question was what the metadata attributes for the information being conveyed in the configuration file would be. (There aren't exact analogues in something like the CF Conventions) Given that, and that it would probably be a day or so of coding for a service that supports 4 collections, we agreed it wasn't likely worth the change.

We also discussed writing down some notes on a potential earthdata-varinfo-based implementation, in case that becomes relevant in the future. Here are those notes:

  • Using ExcludedScienceVariables to indicate what are currently referred to as dropped_variables.
  • Adding MetadataOverrides using a regular expression for the VariablePattern to add metadata attributes for the row and column information specific to the variables in a single data group (e.g., /Soil_Moisture_Retrieval_Data_3km/.*).
  • Using the instantiation of a VarInfoFromNetCDF4 to pull the collection short name out of the Metadata group.
  • Probably iterating through the variables overall, rather than iterating through the data_groups.

@vutrannasa
Copy link
Collaborator

Able to pass test after increasing memory and swap space per Owen's comments.
https://forums.docker.com/t/how-to-increase-memory-size-that-is-available-for-a-docker-container/78483

@vutrannasa vutrannasa self-requested a review January 30, 2025 12:10
@D-Auty
Copy link

D-Auty commented Jan 30, 2025

I agree we need to drop the string variables (...utc) from the output (excluded variables). The lack of compression, either due to NetCDF libraries failures or other library issues, makes the output extremely unwieldy, and as noted the time stamps are available elsewhere in the data. Add to that the implementation issues (machine size) and on-demand costs make inclusion a complicating and expensive feature. Also, the data is not available in the on-prem solution.

@flamingbear flamingbear merged commit 4c31171 into main Jan 30, 2025
8 checks passed
@flamingbear flamingbear deleted the mhs/DAS-2292/add-SPL2SMAP branch January 30, 2025 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants