Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-2219 - Better handle non-projectable variables #27

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Changelog

## [v1.2.1] - 2024-11-26

### Changed

- [[DAS-2219](https://bugs.earthdata.nasa.gov/browse/DAS-2219)]
The Swath Projector has been updated to copy the science variables that fail reprojection to the output. Additionally, dimension variables that are referenced by output metadata variables are copied to the output. With this handling of failed reprojection of science variables in place, the configuration file has been modified to remove the TEMPO_O3TOT_L2 excluded variables.

## [v1.2.0] - 2024-10-10

### Changed
Expand Down Expand Up @@ -53,6 +60,7 @@ Repository structure changes include:
For more information on internal releases prior to NASA open-source approval,
see legacy-CHANGELOG.md.

[v1.2.1]: (https://github.com/nasa/harmony-swath-projector/releases/tag/1.2.1)
[v1.2.0]: (https://github.com/nasa/harmony-swath-projector/releases/tag/1.2.0)
[v1.1.1]: (https://github.com/nasa/harmony-swath-projector/releases/tag/1.1.1)
[v1.1.0]: (https://github.com/nasa/harmony-swath-projector/releases/tag/1.1.0)
Expand Down
2 changes: 1 addition & 1 deletion docker/service_version.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a backwards compatible change. so I'd suggest minor to 1.3.0
agree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Addressed in e8c19ae

Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.2.0
1.2.1
21 changes: 0 additions & 21 deletions swath_projector/earthdata_varinfo_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,6 @@
"VNP10": "VIIRS",
"TEMPO_O3TOT_L2": "TEMPO"
},
"ExcludedScienceVariables": [
{
"Applicability": {
"Mission": "TEMPO",
"ShortNamePath": "TEMPO_O3TOT_L2"
},
"VariablePattern": [
"/support_data/a_priori_layer_o3",
"/support_data/cal_adjustment",
"/support_data/dNdR",
"/support_data/layer_efficiency",
"/support_data/lut_wavelength",
"/support_data/N_value",
"/support_data/N_value_residual",
"/support_data/ozone_sensitivity_ratio",
"/support_data/step_1_N_value_residual",
"/support_data/step_2_N_value_residual",
"/support_data/temp_sensitivity_ratio"
]
}
],
"MetadataOverrides": [
{
"Applicability": {
Expand Down
8 changes: 6 additions & 2 deletions swath_projector/interpolation.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,17 +53,20 @@ def resample_all_variables(
temp_directory: str,
logger: Logger,
var_info: VarInfoFromNetCDF4,
) -> List[str]:
) -> Tuple[List[str], List[str]]:
"""Iterate through all science variables and reproject to the target
coordinate grid.

Returns:
output_variables: A list of names of successfully reprojected
variables.
failed_variables: A list of names of variables that failed
reprojection.
"""
output_extension = os.path.splitext(message_parameters['input_file'])[-1]
reprojection_cache = get_reprojection_cache(message_parameters)
output_variables = []
failed_variables = []

check_for_valid_interpolation(message_parameters, logger)

Expand Down Expand Up @@ -91,8 +94,9 @@ def resample_all_variables(
# other error conditions.
logger.error(f'Cannot reproject {variable}')
logger.exception(error)
failed_variables.append(variable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a very impactful change to solve an issue with one collection. I think the important question here is why do some of the TEMPO L2 science variables fail reprojection? What is distinct about them that makes them fail? And then... how do we detect that in a general sense, so we can handle those variables differently to what happens for a "science variable".

Alternative implementations could include:

  1. Once it's pinned down what makes a variable viable/non-viable for projection beyond just being a "science variable", you could filter the list of science variables that earthdata-varinfo provides and chuck any non-viable variables into the metadata_variables set (or not if they shouldn't be copied over - see next comment in thread). See VarInfoBase.get_science_variables and VarInfoBase.is_science_variable for current definitions of a "science variable".
  2. You could identify if there is a specific exception raised for science variables that can't be projected but should be copied over into the output, and handle that with a separate except block here. So something like:
try:
    <all the stuff currently in try>
except SpecificExceptionMeaningCopyVariableIntoOutput as error:
    logger.warn(f'Cannot reproject {variable}, will copy into output')
    variables_to_copy.append(variable)
except Exception as error:
    logger.error(f'Cannot reproject {variable}')
    logger.exception(error)

There are probably other ways to do this but, of the two suggestions above, the cleaner approach feels like (1) for a couple of reasons. Firstly, it feels like these things that will fail perhaps aren't science variables after all. Secondly, and perhaps more importantly, reprojection is a computationally expensive operation. It would be more efficient to not even try to reproject things that are going to fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backing up a little bit...

It feels a bit weird to now have an implementation that could put swath variables into a gridded output file. Beforehand we were producing something that should be an entirely gridded file (with the addition of metadata variables that did not have spatial dimensions). Now this opens us up to having mixed output with gridded and swath-based variables, so we'd be producing a L3-but-also-possibly-partially-L2 file.

Thinking about the specific variables from the ExcludedScienceVariables above - is something like /support_data/cal_adjustment actually useful in the output if it's still a swath, but all the actual science variables are gridded? Someone with such an output file wouldn't be able to just easily map from the pixels in /support_data/cal_adjustment to the pixels in the grid. How would they use these swath variables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@autydp David can you weigh in on this?

Copy link

@D-Auty D-Auty Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this is as much of an issue as noted. It seems almost an unanswerable philosophical question - what is the result of a gridded data file when some of the datasets cannot be gridded? E.g., when the variable dimensions - in particular those determined to be spatially locating, do not match the lat/lon dimensions, and therefore the mapping data created for projecting the data to the target grid, we are left with data that cannot be projected. (Similarly, non-numeric datasets, but that is probably a separate philosophical discussion. E.g., it is possible to reproject string data using nearest-neighbor or EWA-NN - though it requires special handling).

(Note that at the moment there are some 3D datasets that cannot be projected until we address the proper handling of 3D and/or multi-dimensional data beyond the 2D spatial dimensions - a future ticket, I think the next in our backlog. Following that fix, the number of unprojected datasets should come down a fair bit).

What should we do with such data - either not include in the output, or - I propose copy as is, with coordinate and dimension references as in the source data file. The end result will have a mix of gridded and original source data. It is not unlike the treatment of non-spatially aligned data in a spatial subset, or the “metadata” datasets which are not spatially aligned and are often copied as is, even though some of that metadata may no longer be valid after the subsetting has been applied. Unfortunately, I don’t think we can determine the relevance of non-projectable variables to the end-result, so I think better safe than sorry.

These questions are not resolvable in an absolute sense of right or wrong. Even consistency may be argued, but I imagine the policy of processing what we can and copying all else “as-is”, is most in keeping with how we handle other processing requests.

Perhaps the question here is should we first eliminate the known issues, vs. simply casting all projection failures into the non-projectable bucket - and perhaps missing some kind of inadvertent error in programming that should have enabled this variable to be projected. I suspect that analysis and programmatic implementation is less relevant to the end-user than simply the note that this variable was not projected. I agree this should be noted, perhaps with some known issues noted as well (dimensional mismatch, non-numeric data type), but needn’t be exhaustive. Ultimately, the question comes down to what do we include in messages about the request, and what happens to the non-projected data. The latter question is addressed, I think, by consistency with all other non-projected data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to say, I disagree a bit here. I'm trying to look at this from the perspective of an end-user, and I think the current implementation really sets things up for confusion for someone making requests:

  1. Including swath data in a gridded output. Firstly, that's just downright confusing. Secondly, that information is unusable. That end-user can't relate the pixels in the swath variables to corresponding grid pixels. Further, I can definitely imagine the issues that would arise if we (TRT or DAS) were asked to process a file that was partially swath and partially a grid - we'd be very frustrated with the upstream creation of that file. If we would find a mixed swath/grid file difficult to handle, then we can probably expect that downstream processing will have similar problems.
  2. We are silently handling a failure. An end-user can request to reproject a specific swath variable that will not be able to be projected. The Swath Projector will just copy that variable over in a swath format, and the request will be deemed a "success". The only place that we record that we just copied the variable is a log message, and most end-users don't have access to those Harmony log messages (and won't think to look at them, because the request will be "successful").
  3. The implementation as it is doesn't just silently fail for this specific issue. It swallows all failures in the projection step. This indiscriminate swallowing of failure feels particularly off to me.

The ideal case is to find out why these particular variables are failing and address the underlying reason but, assuming that's not possible, my gut instinct is that an explicit failure is a far better UX than doing something that says it was successful, only to supply the end-user with unexpected/unusable content in their request output. Indeed, if they explicitly requested the projection of only one of the non-projectable variables, it seems incorrect to just copy that variable over as a swath and say we were successful.

(Definitely a personal take on behaviour, and maybe we need to run it past some user-needs folks to work out what their users would expect)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to “take this off-line” - It does not feel like something we can resolve in a pull-request

Unfortunately, we do not have clean data to begin with, so expecting clean results is perhaps asking too much. The source data is not cleanly defined as just swaths and is without the sort of self-describing metadata that allows a clear interpretation without e.g., ATBD reference and often considerable analysis beyond. Unfortunately, removing the non-compliant data is potentially just as invalid as leaving it in.

I’m afraid what you are asking for is way beyond the scope of this ticket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should take this offline, but I think that means that we should not do this ticket and/or merge this PR until we've completed those discussions. I don't think there is consensus here that the proposed implementation in this PR is an improvement on the previous implementation (previously, the problematic variables were indicated as not science variables via the earthdata-varinfo configuration file).

What I would likely propose offline is:

  1. Don't silently fail (i.e., re-raise the exception in the block that currently swallows all failures in projection - that's a one line code change).
  2. Diagnose why these variables are failing to determine if there's a way they might be able to be projected (probably via a research spike ticket).


return output_variables
return output_variables, failed_variables


def resample_variable(
Expand Down
18 changes: 17 additions & 1 deletion swath_projector/nc_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,14 +240,30 @@ def set_metadata_dimensions(
"""Iterate through the dimensions of the metadata variable, and ensure
that all are present in the reprojected output file. This function is
necessary if any of the metadata variables, that aren't to be projected
use the swath-based dimensions from the input granule.
use the swath-based dimensions from the input granule. If the dimension
exists as a variable in the source file, copy it to the output file.

"""
for dimension in source_dataset[metadata_variable].dimensions:
if dimension not in output_dataset.dimensions:
output_dataset.createDimension(
dimension, source_dataset.dimensions[dimension].size
)
if dimension in source_dataset.variables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch - definitely a previous oversight!

attributes = read_attrs(source_dataset[dimension])
fill_value = get_fill_value_from_attributes(attributes)

output_dataset.createVariable(
dimension,
source_dataset[dimension].datatype,
dimensions=source_dataset[dimension].dimensions,
fill_value=fill_value,
zlib=True,
complevel=6,
)

output_dataset[dimension][:] = source_dataset[dimension][:]
output_dataset[dimension].setncatts(attributes)


def copy_metadata_variable(
Expand Down
5 changes: 3 additions & 2 deletions swath_projector/reproject.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def reproject(

# Loop through each dataset and reproject
logger.debug('Using pyresample for reprojection.')
outputs = resample_all_variables(
outputs, failed_variables = resample_all_variables(
D-Auty marked this conversation as resolved.
Show resolved Hide resolved
parameters, science_variables, temp_dir, logger, var_info
)

Expand All @@ -78,11 +78,12 @@ def reproject(

# Now merge outputs (unless we only have one)
metadata_variables = var_info.get_metadata_variables()
metadata_variables.update(failed_variables)
nc_merge.create_output(
parameters,
output_file,
temp_dir,
science_variables,
outputs,
metadata_variables,
logger,
var_info,
Expand Down
Binary file modified tests/data/VNL2_test_data.nc
Binary file not shown.
10 changes: 6 additions & 4 deletions tests/unit/test_interpolation.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,11 @@ def test_resample_all_variables(self, mock_resample_variable):
self.var_info,
)

expected_output = ['/red_var', '/green_var', '/blue_var', '/alpha_var']
expected_output = (['/red_var', '/green_var', '/blue_var', '/alpha_var'], [])
self.assertEqual(output_variables, expected_output)
self.assertEqual(mock_resample_variable.call_count, 4)

for variable in expected_output:
for variable in expected_output[0]:
variable_output_path = f'/tmp/01234{variable}.nc'
mock_resample_variable.assert_any_call(
parameters,
Expand Down Expand Up @@ -125,11 +125,13 @@ def test_resample_single_exception(self, mock_resample_variable):
self.var_info,
)

expected_output = ['/green_var', '/blue_var', '/alpha_var']
reprojectable_variables = ['/green_var', '/blue_var', '/alpha_var']
non_reprojectable_variables = ['/red_var']
expected_output = (reprojectable_variables, non_reprojectable_variables)
self.assertEqual(output_variables, expected_output)
self.assertEqual(mock_resample_variable.call_count, 4)

all_variables = expected_output + ['/red_var']
all_variables = reprojectable_variables + non_reprojectable_variables

for variable in all_variables:
variable_output_path = f'/tmp/01234{variable}.nc'
Expand Down
10 changes: 7 additions & 3 deletions tests/unit/test_nc_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def setUpClass(cls):
'/wind_speed',
}

cls.metadata_variables = set()
cls.metadata_variables = {'/fake_var'}
cls.var_info = VarInfoFromNetCDF4(
cls.properties['input_file'],
short_name='VIIRS_NPP-NAVO-L2P-v3.0',
Expand Down Expand Up @@ -70,9 +70,13 @@ def test_output_has_all_variables(self):
for expected_variable in self.science_variables:
self.assertIn(expected_variable.lstrip('/'), output_dataset.variables)

# Output also has a CRS grid_mapping variable, and three dimensions:
# Output has all projected metadata variables:
for expected_variable in self.metadata_variables:
self.assertIn(expected_variable.lstrip('/'), output_dataset.variables)

# Output also has a CRS grid_mapping variable, and four dimensions:
self.assertIn('latitude_longitude', output_dataset.variables)
for expected_dimension in {'lat', 'lon', 'time'}:
for expected_dimension in {'lat', 'lon', 'time', 'fake_dim'}:
self.assertIn(expected_dimension, output_dataset.variables)

def test_same_dimensions(self):
Expand Down
Loading