DAS-2292: add SPL2SMAP collection to the SMAP-L2-Gridder #13

flamingbear · 2025-01-27T23:04:56Z

This PR adds the configuration and support to allow the Harmony SMAP L2 gridding service to grid SPL2SMAP data.

Jira Issue ID

Local Test Steps

Check out this PR's branch

Build and test the docker images

❯ ./bin/build-image && ./bin/build-test && ./bin/run-test

Verify tests pass and coverage is good.
If you have lots of memory: Deploy Harmony-In-A-Box with your freshly built Docker image.
- Update your local harmony configuration to update the memory limit for the harmony-smap-l2-gridder in `services/harmony/env-defaults
```
HARMONY_SMAP_L2_GRIDDER_LIMITS_MEMORY=16Gi
```
- ensure your Harmony .env has the smap-l2-gridder
```
LOCALLY_DEPLOYED_SERVICES=<[blah,blah,etc]>,harmony-smap-l2-gridder
```
- Restart harmony from scratch to be safe.
```
 ❯ ./bin/stop-harmony-and-services
 ❯ ./bin/bootstrap-harmony
```

Make a Harmony request for this collection

http://localhost:3000/C1268429748-EEDTEST/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?forceAsync=true&outputcrs=EPSG%3A4326&granuleId=G1268429760-EEDTEST&format=application%2Fx-netcdf4&maxResults=1

Open your local workflow-ui, download the completed asset. SMAP_L2_SM_AP_01061_D_20150414T025639_R13080_001_regridded.nc
Open in panoply and verify that you can plot variables, examine the metadata, particularly the x-dim, y-dim and crs values.
landcover_class should show up obviously following the contours of the earth along the coast of Africa

More test steps...

Unfortunately, we don't have the NSIDC production collection available in UAT. In order to validate the EGI outputs with the harmony-smap-l2-gridder outputs, we have to download and run the subsetter by hand. There is a notebook attached to the ticket validate-DAS-2292.ipynb that has instructions within. Please also follow along with those to verify the outputs are the same as the on-prem system.

PR Acceptance Checklist

[X*] Jira ticket acceptance criteria met.
CHANGELOG.md updated to include high level summary of PR changes.
docker/service_version.txt updated if publishing a release.
Tests added/updated and passing.
Documentation updated (if needed).

*Regression tests need to be updated as well, same ticket, different PR.

group is a better name since that is actually what they are.

This should be fixed up a bit.

It turns out for two reasons. We don't have spacecraft_overpass_time_utc. The first is that it's huge, it's a 3km grid so 11568 x 4872 entries, and those entries are 24 character String variables that is 1,352,623,104 bytes of data for that one variable. The second is that the NetCDF-C lib is not compressing string data even if it's constant length.

Also updates fetching metadata from config.

flamingbear · 2025-01-28T18:13:10Z

smap_l2_gridder/collections.py

This is where all of the information about the differences in data structures between the collections live. Right now all of the information for all collections is here, but not all of the collections work yet. I just didn't want to separate out the pieces just for review, all of these will be used when all collections are added.

flamingbear · 2025-01-28T18:14:10Z

smap_l2_gridder/collections.py

+    'col': 'Soil_Moisture_Retrieval_Data/EASE_column_index',
+}
+
+GRIDS = {


This just links the grids with their EPSG for simplicity, in a perfect world they would be in the gpd file.

flamingbear · 2025-01-28T18:16:58Z

smap_l2_gridder/collections.py

+                'row': 'Soil_Moisture_Retrieval_Data_3km/EASE_row_index_3km',
+                'col': 'Soil_Moisture_Retrieval_Data_3km/EASE_column_index_3km',
+                **GRIDS['M03km'],
+                'dropped_variables': ['spacecraft_overpass_time_utc'],


The on prem system doesn't include this variable in it's outputs. I found that including it, adds about 20x the size of the file without it.

168897423 Jan 24 14:58 output-without-timevar_SMAP_L2_SM_A_02289_D_20150707T020806_R13080_001.nc 3349625663 Jan 23 17:24 output_SMAP_L2_SM_A_02289_D_20150707T020806_R13080_001.nc

smap_l2_gridder/grid.py

flamingbear · 2025-01-28T18:20:20Z

smap_l2_gridder/grid.py

    """
-    return ['Metadata']
+    collection_config = get_collection_info(get_collection_shortname(in_data))
+    return collection_config['metadata']


This still just returns ['Metadata'], but it felt like it was fine to leave it in the collection configuration.

flamingbear · 2025-01-28T18:20:52Z

smap_l2_gridder/grid.py

+
+def get_collection_shortname(in_data: DataTree) -> str:
+    """Extract the short name identifier from the dataset metadata."""
+    return in_data['Metadata/DatasetIdentification'].shortName


If this ever changes location, I can add it to the configuration and grab it from there.

This is another place where it feels like earthdata-varinfo has code that can fulfil this task. A VarInfoFromNetCDF4 or VarInfoFromDmr object looks for the collection short name in a bunch of locations, including this one, if a short name isn't given when the object is instantiated.

owenlittlejohns

The code looks good, the Docker image builds and the unit tests pass. I ran the example request for Harmony in a Box (after bumping up the memory for the service to 16G), but my request ran out of memory and failed.

My biggest question is about having a bunch of configuration in a style that is unique to this service for doing things like identifying variables to be transformed, denoting variables to be excluded from transformation, and extracting the short name from locations of the file. A lot of those things that can be done with earthdata-varinfo. That could potentially also give you a few helper methods for identifying things like science variables, too.

That said, if you think I just have an earthdata-varinfo shaped hammer and I'm seeing everything as a nail, let me know, but there's definitely code that looks nail-like from a first glance.

CHANGELOG.md

owenlittlejohns · 2025-01-29T20:33:45Z

smap_l2_gridder/collections.py

+}
+
+
+COLLECTION_INFORMATION = {


Parts of this dictionary feel very similar to how earthdata-varinfo could be used with a configuration file. Could that be used instead? I think the main reason I'm not sure is that I don't know the best metadata attribute names that would be used.

owenlittlejohns · 2025-01-29T20:36:15Z

smap_l2_gridder/grid.py

@@ -58,11 +66,11 @@ def prepare_variable(var: DataTree | DataArray, grid_info: dict) -> DataArray:
    """Grid and annotate intput variable."""
    grid_data = grid_variable(var, grid_info)
    grid_data.attrs = {**var.attrs, 'grid_mapping': 'crs'}
+    unzippable = ['tb_time_utc']  # can't zip strings


Is there a more general way of doing this, like looking at the data type for var, rather than hardcoding around a specific variable name?

Probably so. It came up because this was the var that was breaking when I started with the SPL2SMP_E files. Turns out that the SPL2SMAP files have a a different string variable that causes enough trouble that the EGI system just drops it. I will look and see if it really is just string vars or this particular string var.

(It might be good because I think this variable is also going to end up in the dropped vars portion of the config.)

owenlittlejohns · 2025-01-29T20:42:11Z

smap_l2_gridder/grid.py

+
+def get_collection_shortname(in_data: DataTree) -> str:
+    """Extract the short name identifier from the dataset metadata."""
+    return in_data['Metadata/DatasetIdentification'].shortName


This is another place where it feels like earthdata-varinfo has code that can fulfil this task. A VarInfoFromNetCDF4 or VarInfoFromDmr object looks for the collection short name in a bunch of locations, including this one, if a short name isn't given when the object is instantiated.

owenlittlejohns · 2025-01-29T21:05:06Z

smap_l2_gridder/grid.py

@@ -29,23 +35,25 @@ def process_input(in_data: DataTree, output_file: Path, logger: None | Logger =
    """Process input file to generate gridded output file."""
    out_data = DataTree()

+    short_name = get_collection_shortname(in_data)


(Putting this as close to the function signature for process_input as possible)

pylint reckons logger is an unused argument above. Maybe that can be cut from the function signature?

************* Module smap_l2_gridder.grid smap_l2_gridder/grid.py:34:56: W0613: Unused argument 'logger' (unused-argument)

Yes, I saw that, I left it because I've used the logger in other services, I just hadn't used it yet. I guess I can remember it's easy to add back.

owenlittlejohns · 2025-01-29T21:41:31Z

I ran the example request for Harmony in a Box (after bumping up the memory for the service to 16G), but my request ran out of memory and failed.

Update: I made my Docker Desktop settings more beefy, and the request passed:

Memory Limit: 20 GB
Swap: 4 GB

D-Auty · 2025-01-29T21:53:06Z

We do have sample configuration data in VarInfo for other applications, and both CRS, geotransform and exclude_variables could be defined in the configuration. I'm not sure it would buy much as this is already targeted as a SMAP only app. The more general way to get the geotransform, or grid extents is to use EPSG codes and PyProj to get the CRS and then the projection parameters (CF, as is done here), and datum parameters (extents, but not resolution). I'm not sure which is more "authoritative" - the implemented approach, or the EPSG & PyProj, but the later does not include the resolution, so a configuration entry still remaining to define. I'm ok as it is.

owenlittlejohns

@flamingbear and I had a quick tag-up and talked through the possibility of using earthdata-varinfo. We both agreed that the implementation here is pretty close to what earthdata-varinfo does, but the big question was what the metadata attributes for the information being conveyed in the configuration file would be. (There aren't exact analogues in something like the CF Conventions) Given that, and that it would probably be a day or so of coding for a service that supports 4 collections, we agreed it wasn't likely worth the change.

We also discussed writing down some notes on a potential earthdata-varinfo-based implementation, in case that becomes relevant in the future. Here are those notes:

Using ExcludedScienceVariables to indicate what are currently referred to as dropped_variables.
Adding MetadataOverrides using a regular expression for the VariablePattern to add metadata attributes for the row and column information specific to the variables in a single data group (e.g., /Soil_Moisture_Retrieval_Data_3km/.*).
Using the instantiation of a VarInfoFromNetCDF4 to pull the collection short name out of the Metadata group.
Probably iterating through the variables overall, rather than iterating through the data_groups.

vutrannasa · 2025-01-30T12:04:54Z

Able to pass test after increasing memory and swap space per Owen's comments.
https://forums.docker.com/t/how-to-increase-memory-size-that-is-available-for-a-docker-container/78483

D-Auty · 2025-01-30T17:01:02Z

I agree we need to drop the string variables (...utc) from the output (excluded variables). The lack of compression, either due to NetCDF libraries failures or other library issues, makes the output extremely unwieldy, and as noted the time stamps are available elsewhere in the data. Add to that the implementation issues (machine size) and on-demand costs make inclusion a complicating and expensive feature. Also, the data is not available in the on-prem solution.

flamingbear added 14 commits January 22, 2025 16:26

DAS-2292: WIP: refactoring tests and adding new collection

17ca857

DAS-2292: test spl2smap_index_locator

24480ac

DAS-2292: parameterize test_get_target_grid_information and move tests.

56bd91c

DAS-2292: Rename node -> group

dd9089d

group is a better name since that is actually what they are.

DAS-2292: Use configuration for grid and target information

d245fed

DAS-2292: cleans up using configuration for grid and target

8301fe8

DAS-2292: remove unused code.

0dfdafa

DAS-2292: Quick fix for unzippable compression.

ae16a3b

This should be fixed up a bit.

DAS-2292: Updates sample datatrees

f4ed4a3

DAS-2292: Upgrade xarray dependency.

3fd5c40

DAS-2292: Adds collection tests.

0b1dd40

Also updates fetching metadata from config.

DAS-2292: moves a comment

b47742d

DAS-2292: rewording comments.

5a6ab5d

flamingbear changed the title ~~DAS-2292: WIP: refactoring tests and adding new collection~~ DAS-2292: add SPL2SMAP collection to the SMAP-L2-Gridder Jan 27, 2025

flamingbear added 2 commits January 28, 2025 09:49

Merge branch 'main' into mhs/DAS-2292/add-SPL2SMAP

b076788

DAS-2292: update readme and bump the version

9d5518d

flamingbear marked this pull request as ready for review January 28, 2025 17:27

flamingbear requested review from lyonthefrog, owenlittlejohns and a team January 28, 2025 17:27

flamingbear commented Jan 28, 2025

View reviewed changes

smap_l2_gridder/grid.py Show resolved Hide resolved

flamingbear commented Jan 28, 2025

View reviewed changes

DAS-2292: removes a blank line

0122fa8

owenlittlejohns reviewed Jan 29, 2025

View reviewed changes

owenlittlejohns approved these changes Jan 29, 2025

View reviewed changes

vutrannasa self-requested a review January 30, 2025 12:10

flamingbear added 2 commits January 30, 2025 09:52

DAS-2292: control encoding via data type.

e1531f5

DAS-2292: removes Logger from unused places

d42750c

DAS-2292: Remove tb_time_utc variables from SPL2SMP_E output

578b69c

flamingbear merged commit 4c31171 into main Jan 30, 2025
8 checks passed

flamingbear deleted the mhs/DAS-2292/add-SPL2SMAP branch January 30, 2025 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAS-2292: add SPL2SMAP collection to the SMAP-L2-Gridder #13

DAS-2292: add SPL2SMAP collection to the SMAP-L2-Gridder #13

flamingbear commented Jan 27, 2025 •

edited

Loading

flamingbear Jan 28, 2025

flamingbear Jan 28, 2025

flamingbear Jan 28, 2025

flamingbear Jan 28, 2025

flamingbear Jan 28, 2025

owenlittlejohns Jan 29, 2025

owenlittlejohns left a comment

owenlittlejohns Jan 29, 2025

owenlittlejohns Jan 29, 2025

flamingbear Jan 29, 2025

flamingbear Jan 30, 2025

owenlittlejohns Jan 29, 2025

owenlittlejohns Jan 29, 2025

flamingbear Jan 29, 2025 •

edited

Loading

flamingbear Jan 30, 2025

owenlittlejohns commented Jan 29, 2025

D-Auty commented Jan 29, 2025

owenlittlejohns left a comment •

edited

Loading

vutrannasa commented Jan 30, 2025

D-Auty commented Jan 30, 2025

		}


		COLLECTION_INFORMATION = {

DAS-2292: add SPL2SMAP collection to the SMAP-L2-Gridder #13

DAS-2292: add SPL2SMAP collection to the SMAP-L2-Gridder #13

Conversation

flamingbear commented Jan 27, 2025 • edited Loading

Jira Issue ID

Local Test Steps

More test steps...

PR Acceptance Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenlittlejohns left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flamingbear Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenlittlejohns commented Jan 29, 2025

D-Auty commented Jan 29, 2025

owenlittlejohns left a comment • edited Loading

Choose a reason for hiding this comment

vutrannasa commented Jan 30, 2025

D-Auty commented Jan 30, 2025

flamingbear commented Jan 27, 2025 •

edited

Loading

flamingbear Jan 29, 2025 •

edited

Loading

owenlittlejohns left a comment •

edited

Loading