Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-2219 - Better handle non-projectable variables #27

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

joeyschultz
Copy link
Collaborator

@joeyschultz joeyschultz commented Dec 18, 2024

Description

This PR updates the Swath Projector to copy the science variables that fail reprojection to the output. Additionally, dimension variables that are referenced by output metadata variables are copied to the output. These changes were initiated to support TEMPO_03TOT_L2.

Jira Issue ID

DAS-2219

Local Test Steps

Checkout this branch, build the docker images and run the test image.
./bin/build-image && ./bin/build-test && ./bin/run-test

Ensure the tests pass
Ran 92 tests in 63.965s OK

Start up Harmony-In-A-Box. If you already have it running, reload Harmony-In-A-Box to pick up this latest image. I ran into the memory limits of the swath-projector container being exceeded during my testing. To avoid this error, add SWATH_PROJECTOR_LIMITS_MEMORY=2Gi to your .env file in harmony and then reload Harmony-In-A-Box:

./bin/restart-services

Download the "TEMPO_O3TOT_L2_demo.ipynb" notebook from DAS-2219 and run it to test the swath-projector in Harmony-In-A-Box. The notebook should run successfully.

Open the output file in Panoply and inspect it to ensure the following variables have been copied over from the source file correctly:

Non-projectable variables:

  • /support_data/a_priori_layer_o3
  • /support_data/cal_adjustment
  • /support_data/dNdR
  • /support_data/layer_efficiency
  • /support_data/lut_wavelength
  • /support_data/N_value
  • /support_data/N_value_residual
  • /support_data/ozone_sensitivity_ratio
  • /support_data/step_1_N_value_residual
  • /support_data/step_2_N_value_residual
  • /support_data/temp_sensitivity_ratio

Dimension variables referenced by other variables:

  • /mirror_step
  • /xtrack

PR Acceptance Checklist

  • Jira ticket acceptance criteria met.
  • CHANGELOG.md updated to include high level summary of PR changes.
  • docker/service_version.txt updated if publishing a release.
  • Tests added/updated and passing.
  • Documentation updated (if needed).

@joeyschultz joeyschultz requested review from owenlittlejohns, lyonthefrog and flamingbear and removed request for lyonthefrog December 18, 2024 14:25
Copy link
Member

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty complete, nice work.

I think it should be a minor version though.

Also I was looking to make sure your test coverage didn't drop and I see that coverage is broken in this repository
can you change this line at https://github.com/nasa/harmony-swath-projector/blob/DAS-2219/tests/run_tests.sh#L22
to coverage html --omit="tests/*" -d /home/tests/coverage (add an 's' to test)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a backwards compatible change. so I'd suggest minor to 1.3.0
agree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Addressed in e8c19ae

swath_projector/reproject.py Outdated Show resolved Hide resolved
Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code itself looks good, but I had a few questions on the general methodology. Essentially it boils down to a couple of points:

  1. Putting swath variables in a gridded output feels wonky, and is potentially confusing for users.
  2. What value is there in putting these swath variables in a gridded output? How could they be used?
  3. It would probably be cleaner to find some property of a non-projectable "science variable", and adjust the science_variables and metadata_variables sets defined in reproject.py, instead of asking the service to try and project something that will fail. (Hopefully, this is something that can be generic and not bespoke to TEMPO L2 data)

@@ -19,7 +19,7 @@ fi

echo "Test Coverage Estimates"
coverage report --omit="tests/*"
coverage html --omit="tests/*" -d /home/test/coverage
coverage html --omit="tests/*" -d /home/tests/coverage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! (Historically the tests directory was called test, but I tweaked it for consistency when migrating to GitHub. There was always going to be one spot I missed in the renaming.)

@@ -91,8 +94,9 @@ def resample_all_variables(
# other error conditions.
logger.error(f'Cannot reproject {variable}')
logger.exception(error)
failed_variables.append(variable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a very impactful change to solve an issue with one collection. I think the important question here is why do some of the TEMPO L2 science variables fail reprojection? What is distinct about them that makes them fail? And then... how do we detect that in a general sense, so we can handle those variables differently to what happens for a "science variable".

Alternative implementations could include:

  1. Once it's pinned down what makes a variable viable/non-viable for projection beyond just being a "science variable", you could filter the list of science variables that earthdata-varinfo provides and chuck any non-viable variables into the metadata_variables set (or not if they shouldn't be copied over - see next comment in thread). See VarInfoBase.get_science_variables and VarInfoBase.is_science_variable for current definitions of a "science variable".
  2. You could identify if there is a specific exception raised for science variables that can't be projected but should be copied over into the output, and handle that with a separate except block here. So something like:
try:
    <all the stuff currently in try>
except SpecificExceptionMeaningCopyVariableIntoOutput as error:
    logger.warn(f'Cannot reproject {variable}, will copy into output')
    variables_to_copy.append(variable)
except Exception as error:
    logger.error(f'Cannot reproject {variable}')
    logger.exception(error)

There are probably other ways to do this but, of the two suggestions above, the cleaner approach feels like (1) for a couple of reasons. Firstly, it feels like these things that will fail perhaps aren't science variables after all. Secondly, and perhaps more importantly, reprojection is a computationally expensive operation. It would be more efficient to not even try to reproject things that are going to fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backing up a little bit...

It feels a bit weird to now have an implementation that could put swath variables into a gridded output file. Beforehand we were producing something that should be an entirely gridded file (with the addition of metadata variables that did not have spatial dimensions). Now this opens us up to having mixed output with gridded and swath-based variables, so we'd be producing a L3-but-also-possibly-partially-L2 file.

Thinking about the specific variables from the ExcludedScienceVariables above - is something like /support_data/cal_adjustment actually useful in the output if it's still a swath, but all the actual science variables are gridded? Someone with such an output file wouldn't be able to just easily map from the pixels in /support_data/cal_adjustment to the pixels in the grid. How would they use these swath variables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@autydp David can you weigh in on this?

Copy link

@D-Auty D-Auty Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this is as much of an issue as noted. It seems almost an unanswerable philosophical question - what is the result of a gridded data file when some of the datasets cannot be gridded? E.g., when the variable dimensions - in particular those determined to be spatially locating, do not match the lat/lon dimensions, and therefore the mapping data created for projecting the data to the target grid, we are left with data that cannot be projected. (Similarly, non-numeric datasets, but that is probably a separate philosophical discussion. E.g., it is possible to reproject string data using nearest-neighbor or EWA-NN - though it requires special handling).

(Note that at the moment there are some 3D datasets that cannot be projected until we address the proper handling of 3D and/or multi-dimensional data beyond the 2D spatial dimensions - a future ticket, I think the next in our backlog. Following that fix, the number of unprojected datasets should come down a fair bit).

What should we do with such data - either not include in the output, or - I propose copy as is, with coordinate and dimension references as in the source data file. The end result will have a mix of gridded and original source data. It is not unlike the treatment of non-spatially aligned data in a spatial subset, or the “metadata” datasets which are not spatially aligned and are often copied as is, even though some of that metadata may no longer be valid after the subsetting has been applied. Unfortunately, I don’t think we can determine the relevance of non-projectable variables to the end-result, so I think better safe than sorry.

These questions are not resolvable in an absolute sense of right or wrong. Even consistency may be argued, but I imagine the policy of processing what we can and copying all else “as-is”, is most in keeping with how we handle other processing requests.

Perhaps the question here is should we first eliminate the known issues, vs. simply casting all projection failures into the non-projectable bucket - and perhaps missing some kind of inadvertent error in programming that should have enabled this variable to be projected. I suspect that analysis and programmatic implementation is less relevant to the end-user than simply the note that this variable was not projected. I agree this should be noted, perhaps with some known issues noted as well (dimensional mismatch, non-numeric data type), but needn’t be exhaustive. Ultimately, the question comes down to what do we include in messages about the request, and what happens to the non-projected data. The latter question is addressed, I think, by consistency with all other non-projected data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to say, I disagree a bit here. I'm trying to look at this from the perspective of an end-user, and I think the current implementation really sets things up for confusion for someone making requests:

  1. Including swath data in a gridded output. Firstly, that's just downright confusing. Secondly, that information is unusable. That end-user can't relate the pixels in the swath variables to corresponding grid pixels. Further, I can definitely imagine the issues that would arise if we (TRT or DAS) were asked to process a file that was partially swath and partially a grid - we'd be very frustrated with the upstream creation of that file. If we would find a mixed swath/grid file difficult to handle, then we can probably expect that downstream processing will have similar problems.
  2. We are silently handling a failure. An end-user can request to reproject a specific swath variable that will not be able to be projected. The Swath Projector will just copy that variable over in a swath format, and the request will be deemed a "success". The only place that we record that we just copied the variable is a log message, and most end-users don't have access to those Harmony log messages (and won't think to look at them, because the request will be "successful").
  3. The implementation as it is doesn't just silently fail for this specific issue. It swallows all failures in the projection step. This indiscriminate swallowing of failure feels particularly off to me.

The ideal case is to find out why these particular variables are failing and address the underlying reason but, assuming that's not possible, my gut instinct is that an explicit failure is a far better UX than doing something that says it was successful, only to supply the end-user with unexpected/unusable content in their request output. Indeed, if they explicitly requested the projection of only one of the non-projectable variables, it seems incorrect to just copy that variable over as a swath and say we were successful.

(Definitely a personal take on behaviour, and maybe we need to run it past some user-needs folks to work out what their users would expect)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to “take this off-line” - It does not feel like something we can resolve in a pull-request

Unfortunately, we do not have clean data to begin with, so expecting clean results is perhaps asking too much. The source data is not cleanly defined as just swaths and is without the sort of self-describing metadata that allows a clear interpretation without e.g., ATBD reference and often considerable analysis beyond. Unfortunately, removing the non-compliant data is potentially just as invalid as leaving it in.

I’m afraid what you are asking for is way beyond the scope of this ticket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should take this offline, but I think that means that we should not do this ticket and/or merge this PR until we've completed those discussions. I don't think there is consensus here that the proposed implementation in this PR is an improvement on the previous implementation (previously, the problematic variables were indicated as not science variables via the earthdata-varinfo configuration file).

What I would likely propose offline is:

  1. Don't silently fail (i.e., re-raise the exception in the block that currently swallows all failures in projection - that's a one line code change).
  2. Diagnose why these variables are failing to determine if there's a way they might be able to be projected (probably via a research spike ticket).


"""
for dimension in source_dataset[metadata_variable].dimensions:
if dimension not in output_dataset.dimensions:
output_dataset.createDimension(
dimension, source_dataset.dimensions[dimension].size
)
if dimension in source_dataset.variables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch - definitely a previous oversight!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants