Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-2232 - Functionality added to support SMAP L3 products #15

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
79b4c3f
DAS-2232 Initial commit for SMAP L3 spatial subset changes
sudha-murthy Sep 9, 2024
789cc6b
corrections from pre-commit checks
sudha-murthy Sep 9, 2024
c4abf26
DAS-2232 - updates to correct unit test failures
sudha-murthy Sep 9, 2024
be280f0
DAS-2232 updates for unit test failures
sudha-murthy Sep 10, 2024
2a48f93
DAS-2232 - all unit tests pass
sudha-murthy Sep 10, 2024
2c93a4b
DAS-2232 updates to version number
sudha-murthy Sep 10, 2024
dfb1e15
DAS-2232 - updated notebook version due to Synk vulnerability. Also r…
sudha-murthy Sep 10, 2024
e2ff61f
DAS-2232 fixed spatial subsetting bugs introduced when removing dupli…
sudha-murthy Sep 13, 2024
756f7c0
DAS-2232 initial commit after PR feedback
sudha-murthy Sep 26, 2024
53f1660
DAS-2232 - fixed get_variable_crs and a few minor corrections
sudha-murthy Sep 26, 2024
f9f5e8b
DAS-2232 - comments updated for the get_spatial_index_ranges method
sudha-murthy Sep 26, 2024
f836ee6
DAS-2232 simplified get_spatial_index_ranges method
sudha-murthy Sep 26, 2024
c35c8ee
DAS-2232 - changed override to coordinates and dimension_file to dime…
sudha-murthy Sep 26, 2024
ffe035a
DAS-2232 - updates to CHANGELOG.md
sudha-murthy Sep 26, 2024
9c9f190
DAS-2232 - PR feedback update - to detrmine if latitude is ascending …
sudha-murthy Sep 27, 2024
7eda980
DAS-2232 - added get_x_y_index_ranges_from_coordinates as a separate …
sudha-murthy Oct 1, 2024
f07b544
DAS-2232 - fixed assumption of top left origin
sudha-murthy Oct 1, 2024
3b453e5
DAS=2232 - some refactoring
sudha-murthy Oct 1, 2024
681b20d
DAS-2232 - updates to hoss_config.json description fields
sudha-murthy Oct 1, 2024
5d609c9
DAS-2232 - updated the names for required_dimensions to required_vari…
sudha-murthy Oct 1, 2024
91c51c0
DAS-2232 - added exception if one of the coordinate datasets are missing
sudha-murthy Oct 1, 2024
2296a35
DAS-2232 - updated CHANGELOG.md
sudha-murthy Oct 1, 2024
2efc4c7
DAS-2232 - small bug fix in get_override_projection_dimension_name
sudha-murthy Oct 1, 2024
f628166
DAS-2232 - updates to comments
sudha-murthy Oct 1, 2024
ebac2a0
DAS-2232 - added check for fillvalue
sudha-murthy Oct 2, 2024
3b6d605
DAS-2232 - comments and minor changes
sudha-murthy Oct 2, 2024
802fe0e
DAS-2232 - simplified get_coordinate_variables based on PR feedback
sudha-murthy Oct 3, 2024
60fb22a
DAS-2232 - added method to check for variables with no dimensions
sudha-murthy Oct 3, 2024
631dc24
DAS-2232 - added unittest for get_variable_crs
sudha-murthy Oct 3, 2024
1e7bc35
DAS-2232 - added a module to contain coordinate methods
sudha-murthy Oct 4, 2024
36e15c7
DAS-2232 - moved new methods to a separate file
sudha-murthy Oct 4, 2024
5c5eb85
DAS-2232 - removed accidental checkin of an incomplete unit test
sudha-murthy Oct 4, 2024
80c2fb2
DAS-2232 - added unit test for the new method - get_x_y_index_ranges_…
sudha-murthy Oct 7, 2024
30eccd0
DAS-2232 - added unit test for the new method - get_x_y_index_ranges_…
sudha-murthy Oct 8, 2024
16872b7
DAS-2232 - added some unit tests and some bug fixes
sudha-murthy Oct 10, 2024
822758f
DAS-2232 - added some unit tests and some bug fixes
sudha-murthy Oct 10, 2024
7883465
DAS-2232 - minor initialization fix
sudha-murthy Oct 10, 2024
dd98e81
DAS-2232 - added unit test for get_valid_indices
sudha-murthy Oct 10, 2024
de350b6
DAS-2232 - replaced get_geo_grid_corners method to get 2 valid geo gr…
sudha-murthy Oct 16, 2024
0666248
DAS-2232 - minor name from dataset to variable
sudha-murthy Oct 16, 2024
e07099a
DAS-2232 - added unit tests
sudha-murthy Oct 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 166 additions & 27 deletions hoss/dimension_utilities.py
sudha-murthy marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,23 @@
from harmony.util import Config
from netCDF4 import Dataset
from numpy.ma.core import MaskedArray
from numpy import ndarray
from varinfo import VariableFromDmr, VarInfoFromDmr

from hoss.bbox_utilities import flatten_list
from hoss.exceptions import InvalidNamedDimension, InvalidRequestedRange
from hoss.utilities import (
format_variable_set_string,
format_dictionary_string,
get_opendap_nc4,
get_value_or_default,
)
from hoss.projection_utilities import (
get_projected_x_y_extents,
get_projected_x_y_variables,
get_variable_crs,
get_x_y_extents_from_geographic_points
)

IndexRange = Tuple[int]
IndexRanges = Dict[str, IndexRange]
Expand Down Expand Up @@ -63,7 +71,7 @@ def prefetch_dimension_variables(
logger: Logger,
access_token: str,
config: Config,
) -> str:
) -> tuple[str, set[str]]:
"""Determine the dimensions that need to be "pre-fetched" from OPeNDAP in
order to derive index ranges upon them. Initially, this was just
spatial and temporal dimensions, but to support generic dimension
Expand All @@ -73,21 +81,13 @@ def prefetch_dimension_variables(

"""
required_dimensions = varinfo.get_required_dimensions(required_variables)

# Iterate through all requested dimensions and extract a list of bounds
# references for each that has any. This will produce a list of lists,
# which should be flattened into a single list and then combined into a set
# to remove duplicates.
bounds = set(
flatten_list(
[
list(varinfo.get_variable(dimension).references.get('bounds'))
for dimension in required_dimensions
if varinfo.get_variable(dimension).references.get('bounds') is not None
]
)
)

if(len(required_dimensions) == 0):
#check if coordinate variables are provided
required_dimensions = varinfo.get_references_for_attribute(required_variables, 'coordinates')
logger.info('coordinates: ' f'{required_dimensions}')
flamingbear marked this conversation as resolved.
Show resolved Hide resolved
sudha-murthy marked this conversation as resolved.
Show resolved Hide resolved

bounds = varinfo.get_references_for_attribute(required_dimensions, 'bounds')
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
logger.info('bounds: ' f'{bounds}')
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
required_dimensions.update(bounds)

logger.info(
Expand All @@ -101,10 +101,139 @@ def prefetch_dimension_variables(

# Create bounds variables if necessary.
add_bounds_variables(required_dimensions_nc4, required_dimensions, varinfo, logger)
return required_dimensions_nc4, required_dimensions


def is_variable_one_dimensional(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need this function. The equivalent check is just if len(variable.shape) == 1 (where variable is a netCDF4.Variable instance).

Generally, there are a bunch of nice numpy array things that are syntactic sugar you can use for array shape, size, etc. The netCDF4 library does a pretty good job of exposing a lot of them on netCDF4.Variable objects.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah right I did not need that. I could just get the ndim in the calling method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

756f7c0

removed the method

prefetch_dataset: Dataset, dimension_variable: VariableFromDmr
) -> bool:
"""Check if a dimension variable is 1D

"""
dimensions_array = prefetch_dataset[dimension_variable.full_name_path][:]
return dimensions_array.ndim == 1
flamingbear marked this conversation as resolved.
Show resolved Hide resolved


def update_dimension_variables(
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
prefetch_dataset: Dataset,
required_dimensions: Set[str],
varinfo: VarInfoFromDmr,
logger: Logger,
) -> Dict[str, ndarray]:
"""Augment a NetCDF4 file with artificial 1D dimensions variable for each
2D dimension variable"

For each dimension variable:
(1) Check if the dimension variable is 1D.
(2) If it is not 1D and is 2D create a dimensions array from
within the `write_1D_dimensions`
function.
(3) Then write the 1D dimensions variable to the NetCDF4 URL.

return required_dimensions_nc4
"""
for dimension_name in required_dimensions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want to iterate through all required_dimensions, just the ones that are identified as overriding things from the coordinates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the override dimensions are passed to the method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the names to make it clear
c35c8ee

dimension_variable = varinfo.get_variable(dimension_name)
logger.info('dimension name: ' f'{dimension_name}')
logger.info('dimension path:' f'{dimension_variable.full_name_path}')

if is_variable_one_dimensional(prefetch_dataset, dimension_variable):
logger.info(
'No changes needed: 'f'{dimension_name}'
)
else:
col_size = prefetch_dataset[dimension_variable.full_name_path][:].shape[0]
row_size = prefetch_dataset[dimension_variable.full_name_path][:].shape[1]
flamingbear marked this conversation as resolved.
Show resolved Hide resolved
crs = get_variable_crs(dimension_name, varinfo,logger)
logger.info('row_size=' f'{row_size}' ',col_size=' f'{col_size}')

geo_grid_corners = get_geo_grid_corners(
prefetch_dataset,
required_dimensions,
varinfo,
logger
)

x_y_extents = get_x_y_extents_from_geographic_points(
geo_grid_corners, crs
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this methodology is pretty complicated, and doing some maths it doesn't need to. It might be simpler to:

  1. Check if the "dimension" is 1-D or 2-D
  2. If the dimension is 1-D then you're good (although a check for it being regular might be really good, because HOSS assumes that).
  3. If the "dimension" is 2-D (e.g., a coordinate variable), then:
    3a) If the coordinate is in the same projection as the grid, grab a row slice and a column slice.
    3b) If the coordinate is a different projection to the grid (e.g., one is geographic and the other isn't), then project the coordinate variables to grid variables. Then grab a column and a row slice.
  4. If there are fill values in the slices you get, make them correct. So long as you have two consecutive non-fill values, you can derive the resolution (something like numpy.diff is your friend here).

Copy link
Collaborator Author

@sudha-murthy sudha-murthy Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. will change it


#get grid size and resolution
x_min = x_y_extents['x_min']
x_max = x_y_extents['x_max']
y_min = x_y_extents['y_min']
y_max = x_y_extents['y_max']
x_resolution = (x_max-x_min)/row_size
y_resolution = (y_max-y_min)/col_size

logger.info('x_min:' f'{x_min}' ',x_max=' f'{x_max}'
',y_min:' f'{y_min}' ',y_max=' f'{y_max}'
',x_res=' f'{x_resolution}' 'y_res=' f'{y_resolution}')

#create the xy dim scales
x_dim = np.arange(x_min, x_max, x_resolution)

#ascending versus descending..should be based on the coordinate grid
y_dim = np.arange(y_max,y_min, -y_resolution)
logger.info('x_dim:' f'{x_dim}')
logger.info('y_dim:' f'{y_dim}')
return {'projected_y': y_dim, 'projected_x': x_dim}

# to calculate grid corners when there are fill values
def get_geo_grid_corners(
prefetch_dataset: Dataset,
required_dimensions: Set[str],
varinfo: VarInfoFromDmr,
logger: Logger,
) -> list[Tuple[float]]:
"""
This method is used to return the lat lon corners from a 2D
coordinate dataset

"""
for dimension_name in required_dimensions:
dimension_variable = varinfo.get_variable(dimension_name)
if dimension_variable.is_latitude():
lat_arr = prefetch_dataset[dimension_variable.full_name_path][:]
elif dimension_variable.is_longitude():
lon_arr = prefetch_dataset[dimension_variable.full_name_path][:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code block above is likely to cause errors. There is no check for if you manage to find a latitude variable but not a longitude variable (or vice versa). It cannot be assumed that they both will be present. My gut instinct is that this function isn't needed, but if it is, it would be better to catch a missing lat/lon variable here and then raise a human readable error, rather than letting Harmony spit out a complicated and unintuitive stack trace to the end user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are in commit 756f7c0


#skip fill values when calculating min values
#topleft = minlon, maxlat
#bottomright = maxlon, minlat
top_left_row_idx = 0
top_left_col_idx = 0
lon_row = lon_arr[top_left_row_idx,:]
lon_row_valid_indices = np.where(lon_row >= -180.0)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to check for where the array values are not the fill value for the variable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a ticket to address all the fill value cases. https://bugs.earthdata.nasa.gov/browse/DAS-2236

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mainly suggesting that your np.where condition would be better to look for np.where(lon_row != fill_value)[0], where fill_value is derived from the metadata of the variable (via varinfo). As it is now, this check would indicate good values for any element that is at least -180.0. If a fill value is a large positive number, then that's a problem.

Generally, I think this indicates the work has been split a bit of a weird way. Instead of implementing a first pass of everything, with know issues in several places, it might be better to implement one piece at a time making those pieces correct in one go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the coordinate variables do not have fill value attributes (at least in the test case I am using - SPL3SMP). It may need to be added to the json file.. will check into it. I can add a check for -180 to +180 as w a valid range maybe till we resolve DAS-2236

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the better alternative is to see if there is a fill value, if so use a check based around that, otherwise fallback on this alternative check.

I still generally think it's not the best approach to write a function with known issues, and then come back to it later. The more I think about it, the more attractive a feature branch is sounding.

top_left_col_idx = lon_row_valid_indices[lon_row[lon_row_valid_indices].argmin()]
minlon = lon_row[top_left_col_idx]
top_right_col_idx = lon_row_valid_indices[lon_row[lon_row_valid_indices].argmax()]
maxlon = lon_row[top_right_col_idx]

lat_col = lat_arr[:,top_right_col_idx]
lat_col_valid_indices = np.where(lat_col >= -180.0)[0]
bottom_right_row_idx = lat_col_valid_indices[lat_col[lat_col_valid_indices].argmin()]
minlat = lat_col[bottom_right_row_idx]
top_right_row_idx = lat_col_valid_indices[lat_col[lat_col_valid_indices].argmax()]
maxlat = lat_col[top_right_row_idx]

topleft_corner = [minlon, maxlat]
topright_corner = [maxlon, maxlat]
bottomright_corner = [maxlon, minlat]
bottomleft_corner = [minlon, minlat]

geo_grid_corners = [
topleft_corner,
topright_corner,
bottomright_corner,
bottomleft_corner
]
logger.info('topleft:' f'{topleft_corner}'
'topright:' f'{topright_corner}'
',bottomright:' f'{bottomright_corner}'
',bottomleft:' f'{bottomleft_corner}')

return geo_grid_corners
flamingbear marked this conversation as resolved.
Show resolved Hide resolved

def add_bounds_variables(
dimensions_nc4: str,
required_dimensions: Set[str],
Expand Down Expand Up @@ -409,7 +538,8 @@ def get_dimension_indices_from_bounds(


def add_index_range(
variable_name: str, varinfo: VarInfoFromDmr, index_ranges: IndexRanges
variable_name: str, varinfo: VarInfoFromDmr, index_ranges: IndexRanges,
logger:Logger
) -> str:
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
"""Append the index ranges of each dimension for the specified variable.
If there are no dimensions with listed index ranges, then the full
Expand All @@ -422,16 +552,25 @@ def add_index_range(

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function: READ THIS COMMENT FIRST:

I think this is really overcomplicated. Fundamentally, all this function is doing is getting a list of dimensions for a variable. Then it is looking in a cache to see if there is an index range for any of the dimension names, before using that cache value to tack something on the end of a string.

Maybe I'm being naive, but it feels like really all you need is something to determine the correct list of variable dimensions, then all the rest of the logic (looking in the cache and appending strings to the end of the variable name) is the same. That latter stuff 100% does not need to be in the duplicated in multiple condition branches. It's making this function unnecessarily hard to read.

The other, really important comment: I am super suspicious of the bit where you are needing to reverse the order of the dimensions list. However that is derived, it should be reliably in the ordering of the variable axes. I'm wary that what this means is that you have found that for your particular use-case the "random" ordering in a set is predictable for the pseudo-dimensions you have for SMAP L3, and you can coincidentally impose the order you need coincidentally by reversing the set. I really think dimensions should not be passed around in sets, because you need that ordering. I strongly suspect this is the root cause of your issues with 3-D variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering and the shape are things I need to get from varinfo. I dont have that information. Updated DAS-2241 for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rephrasing a different comment: I'm very uncomfortable with the thought of merging code with known problems into main. Going back to one of the things mentioned in the TRT breakout last week:

Teams ought to have a high-quality, maintainable baseline without known defects (particularly in new features) that is deployable to the production environment at all times.

Instead of making a massive PR that does 80% of a lot of stuff, this probably should be submitted piecemeal, with each piece being made 100% solid in a series of smaller PRs.

If you and David feel we're at a point that we can't break the changes down in that way, then a compromise might be to update this PR to merge into a feature branch. Then once all of the multiple PRs are merged into the feature branch, the feature branch can be merged into main.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think terminology has confused some of this discussion. The question of reversing is not one of reversing dimensions, but the lat/lon variables coming from the Coordinates attribute. The recommendation here is that the get_coordinates method itself should return in a standard order (reversed in case), based upon the variable name - and not using reverse as shown here.

I don't think this is a case of "Known Issue".

We are planning to release a version for NSIDC to start testing with, which may not handle the 3D cases or other issues, but this release should not break any existing use cases, and should not truly break, but simply not handle as desired. Incremental feature release is a tenet of agile development.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are in commit 756f7c0

I have simplified the method. The order for SMAP is 'projected_y', 'projected_x'. The override section of the code is only used by SMAP at the moment. It can be generalized if we can get that order of dimensions from varinfo. I am not sure if the order of dimensions is used for other products currently handled by hoss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of sets for required_dimensions is based on how it is returned by varinfo and how it was originally in hoss before my changes. the bounds update requires the dimensions to be a set , It fails for a list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a difference between a set of all dimensions for all required variables (as used in the prefetch function), which aggregates all Variable.dimensions attributes (which individually are lists), and the dimensions on an individual variable. Variable.dimensions is not a set for ordering purposes, it is a list.

With regards to bounds - we know that the pseudo-dimensions won't have a bounds attribute, so you might be better to not try to find bounds variables for them. Then you'll avoid any compatibility issues there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add_index_range is a lot simpler now

variable = varinfo.get_variable(variable_name)

logger.info('variable name:' f'{variable_name}')
range_strings = []

for dimension in variable.dimensions:
dimension_range = index_ranges.get(dimension)

if dimension_range is not None and dimension_range[0] <= dimension_range[1]:
range_strings.append(f'[{dimension_range[0]}:{dimension_range[1]}]')
else:
range_strings.append('[]')
if(variable.dimensions == []):
for dimension in index_ranges.keys():
dimension_range = index_ranges.get(dimension)
logger.info('dimension=' f'{dimension}' ', dimension_range=' f'{dimension_range}')
if dimension_range is not None and dimension_range[0] <= dimension_range[1]:
range_strings.append(f'[{dimension_range[0]}:{dimension_range[1]}]')
else:
range_strings.append('[]')

else :
for dimension in variable.dimensions:
dimension_range = index_ranges.get(dimension)

if dimension_range is not None and dimension_range[0] <= dimension_range[1]:
range_strings.append(f'[{dimension_range[0]}:{dimension_range[1]}]')
else:
range_strings.append('[]')
flamingbear marked this conversation as resolved.
Show resolved Hide resolved

if all(range_string == '[]' for range_string in range_strings):
indices_string = ''
Expand Down
Loading