-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAS-2292: add SPL2SMAP collection to the SMAP-L2-Gridder #13
Conversation
group is a better name since that is actually what they are.
This should be fixed up a bit.
It turns out for two reasons. We don't have spacecraft_overpass_time_utc. The first is that it's huge, it's a 3km grid so 11568 x 4872 entries, and those entries are 24 character String variables that is 1,352,623,104 bytes of data for that one variable. The second is that the NetCDF-C lib is not compressing string data even if it's constant length.
Also updates fetching metadata from config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where all of the information about the differences in data structures between the collections live. Right now all of the information for all collections is here, but not all of the collections work yet. I just didn't want to separate out the pieces just for review, all of these will be used when all collections are added.
'col': 'Soil_Moisture_Retrieval_Data/EASE_column_index', | ||
} | ||
|
||
GRIDS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just links the grids with their EPSG for simplicity, in a perfect world they would be in the gpd file.
'row': 'Soil_Moisture_Retrieval_Data_3km/EASE_row_index_3km', | ||
'col': 'Soil_Moisture_Retrieval_Data_3km/EASE_column_index_3km', | ||
**GRIDS['M03km'], | ||
'dropped_variables': ['spacecraft_overpass_time_utc'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The on prem system doesn't include this variable in it's outputs. I found that including it, adds about 20x the size of the file without it.
168897423 Jan 24 14:58 output-without-timevar_SMAP_L2_SM_A_02289_D_20150707T020806_R13080_001.nc
3349625663 Jan 23 17:24 output_SMAP_L2_SM_A_02289_D_20150707T020806_R13080_001.nc
""" | ||
return ['Metadata'] | ||
collection_config = get_collection_info(get_collection_shortname(in_data)) | ||
return collection_config['metadata'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still just returns ['Metadata'], but it felt like it was fine to leave it in the collection configuration.
|
||
def get_collection_shortname(in_data: DataTree) -> str: | ||
"""Extract the short name identifier from the dataset metadata.""" | ||
return in_data['Metadata/DatasetIdentification'].shortName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this ever changes location, I can add it to the configuration and grab it from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another place where it feels like earthdata-varinfo
has code that can fulfil this task. A VarInfoFromNetCDF4
or VarInfoFromDmr
object looks for the collection short name in a bunch of locations, including this one, if a short name isn't given when the object is instantiated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, the Docker image builds and the unit tests pass. I ran the example request for Harmony in a Box (after bumping up the memory for the service to 16G), but my request ran out of memory and failed.
My biggest question is about having a bunch of configuration in a style that is unique to this service for doing things like identifying variables to be transformed, denoting variables to be excluded from transformation, and extracting the short name from locations of the file. A lot of those things that can be done with earthdata-varinfo
. That could potentially also give you a few helper methods for identifying things like science variables, too.
That said, if you think I just have an earthdata-varinfo
shaped hammer and I'm seeing everything as a nail, let me know, but there's definitely code that looks nail-like from a first glance.
} | ||
|
||
|
||
COLLECTION_INFORMATION = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parts of this dictionary feel very similar to how earthdata-varinfo
could be used with a configuration file. Could that be used instead? I think the main reason I'm not sure is that I don't know the best metadata attribute names that would be used.
smap_l2_gridder/grid.py
Outdated
@@ -58,11 +66,11 @@ def prepare_variable(var: DataTree | DataArray, grid_info: dict) -> DataArray: | |||
"""Grid and annotate intput variable.""" | |||
grid_data = grid_variable(var, grid_info) | |||
grid_data.attrs = {**var.attrs, 'grid_mapping': 'crs'} | |||
unzippable = ['tb_time_utc'] # can't zip strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more general way of doing this, like looking at the data type for var
, rather than hardcoding around a specific variable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably so. It came up because this was the var that was breaking when I started with the SPL2SMP_E files. Turns out that the SPL2SMAP files have a a different string variable that causes enough trouble that the EGI system just drops it. I will look and see if it really is just string vars or this particular string var.
(It might be good because I think this variable is also going to end up in the dropped vars portion of the config.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
def get_collection_shortname(in_data: DataTree) -> str: | ||
"""Extract the short name identifier from the dataset metadata.""" | ||
return in_data['Metadata/DatasetIdentification'].shortName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another place where it feels like earthdata-varinfo
has code that can fulfil this task. A VarInfoFromNetCDF4
or VarInfoFromDmr
object looks for the collection short name in a bunch of locations, including this one, if a short name isn't given when the object is instantiated.
@@ -29,23 +35,25 @@ def process_input(in_data: DataTree, output_file: Path, logger: None | Logger = | |||
"""Process input file to generate gridded output file.""" | |||
out_data = DataTree() | |||
|
|||
short_name = get_collection_shortname(in_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Putting this as close to the function signature for process_input
as possible)
pylint
reckons logger
is an unused argument above. Maybe that can be cut from the function signature?
************* Module smap_l2_gridder.grid
smap_l2_gridder/grid.py:34:56: W0613: Unused argument 'logger' (unused-argument)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw that, I left it because I've used the logger in other services, I just hadn't used it yet. I guess I can remember it's easy to add back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I made my Docker Desktop settings more beefy, and the request passed:
|
We do have sample configuration data in VarInfo for other applications, and both CRS, geotransform and exclude_variables could be defined in the configuration. I'm not sure it would buy much as this is already targeted as a SMAP only app. The more general way to get the geotransform, or grid extents is to use EPSG codes and PyProj to get the CRS and then the projection parameters (CF, as is done here), and datum parameters (extents, but not resolution). I'm not sure which is more "authoritative" - the implemented approach, or the EPSG & PyProj, but the later does not include the resolution, so a configuration entry still remaining to define. I'm ok as it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flamingbear and I had a quick tag-up and talked through the possibility of using earthdata-varinfo
. We both agreed that the implementation here is pretty close to what earthdata-varinfo
does, but the big question was what the metadata attributes for the information being conveyed in the configuration file would be. (There aren't exact analogues in something like the CF Conventions) Given that, and that it would probably be a day or so of coding for a service that supports 4 collections, we agreed it wasn't likely worth the change.
We also discussed writing down some notes on a potential earthdata-varinfo
-based implementation, in case that becomes relevant in the future. Here are those notes:
- Using
ExcludedScienceVariables
to indicate what are currently referred to asdropped_variables
. - Adding
MetadataOverrides
using a regular expression for theVariablePattern
to add metadata attributes for the row and column information specific to the variables in a single data group (e.g.,/Soil_Moisture_Retrieval_Data_3km/.*
). - Using the instantiation of a
VarInfoFromNetCDF4
to pull the collection short name out of theMetadata
group. - Probably iterating through the variables overall, rather than iterating through the
data_groups
.
Able to pass test after increasing memory and swap space per Owen's comments. |
I agree we need to drop the string variables (...utc) from the output (excluded variables). The lack of compression, either due to NetCDF libraries failures or other library issues, makes the output extremely unwieldy, and as noted the time stamps are available elsewhere in the data. Add to that the implementation issues (machine size) and on-demand costs make inclusion a complicating and expensive feature. Also, the data is not available in the on-prem solution. |
This PR adds the configuration and support to allow the Harmony SMAP L2 gridding service to grid SPL2SMAP data.
Jira Issue ID
DAS-2292
Local Test Steps
Check out this PR's branch
Build and test the docker images
Verify tests pass and coverage is good.
If you have lots of memory: Deploy Harmony-In-A-Box with your freshly built Docker image.
Update your local harmony configuration to update the memory limit for the harmony-smap-l2-gridder in `services/harmony/env-defaults
ensure your Harmony .env has the smap-l2-gridder
Make a Harmony request for this collection
Open your local
workflow-ui
, download the completed asset.SMAP_L2_SM_AP_01061_D_20150414T025639_R13080_001_regridded.nc
Open in panoply and verify that you can plot variables, examine the metadata, particularly the x-dim, y-dim and crs values.
landcover_class should show up obviously following the contours of the earth along the coast of Africa
More test steps...
Unfortunately, we don't have the NSIDC production collection available in UAT. In order to validate the EGI outputs with the harmony-smap-l2-gridder outputs, we have to download and run the subsetter by hand. There is a notebook attached to the ticket
validate-DAS-2292.ipynb
that has instructions within. Please also follow along with those to verify the outputs are the same as the on-prem system.PR Acceptance Checklist
CHANGELOG.md
updated to include high level summary of PR changes.docker/service_version.txt
updated if publishing a release.*Regression tests need to be updated as well, same ticket, different PR.