-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAS-2219 - Better handle non-projectable variables #27
base: main
Are you sure you want to change the base?
Changes from 5 commits
4f3dd3a
d53fd81
a9cda68
4c31cd2
38892d6
e8c19ae
946b3a3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
1.2.0 | ||
1.2.1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,17 +53,20 @@ def resample_all_variables( | |
temp_directory: str, | ||
logger: Logger, | ||
var_info: VarInfoFromNetCDF4, | ||
) -> List[str]: | ||
) -> Tuple[List[str], List[str]]: | ||
"""Iterate through all science variables and reproject to the target | ||
coordinate grid. | ||
|
||
Returns: | ||
output_variables: A list of names of successfully reprojected | ||
variables. | ||
failed_variables: A list of names of variables that failed | ||
reprojection. | ||
""" | ||
output_extension = os.path.splitext(message_parameters['input_file'])[-1] | ||
reprojection_cache = get_reprojection_cache(message_parameters) | ||
output_variables = [] | ||
failed_variables = [] | ||
|
||
check_for_valid_interpolation(message_parameters, logger) | ||
|
||
|
@@ -91,8 +94,9 @@ def resample_all_variables( | |
# other error conditions. | ||
logger.error(f'Cannot reproject {variable}') | ||
logger.exception(error) | ||
failed_variables.append(variable) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feels like a very impactful change to solve an issue with one collection. I think the important question here is why do some of the TEMPO L2 science variables fail reprojection? What is distinct about them that makes them fail? And then... how do we detect that in a general sense, so we can handle those variables differently to what happens for a "science variable". Alternative implementations could include:
There are probably other ways to do this but, of the two suggestions above, the cleaner approach feels like (1) for a couple of reasons. Firstly, it feels like these things that will fail perhaps aren't science variables after all. Secondly, and perhaps more importantly, reprojection is a computationally expensive operation. It would be more efficient to not even try to reproject things that are going to fail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Backing up a little bit... It feels a bit weird to now have an implementation that could put swath variables into a gridded output file. Beforehand we were producing something that should be an entirely gridded file (with the addition of metadata variables that did not have spatial dimensions). Now this opens us up to having mixed output with gridded and swath-based variables, so we'd be producing a L3-but-also-possibly-partially-L2 file. Thinking about the specific variables from the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @autydp David can you weigh in on this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don’t think this is as much of an issue as noted. It seems almost an unanswerable philosophical question - what is the result of a gridded data file when some of the datasets cannot be gridded? E.g., when the variable dimensions - in particular those determined to be spatially locating, do not match the lat/lon dimensions, and therefore the mapping data created for projecting the data to the target grid, we are left with data that cannot be projected. (Similarly, non-numeric datasets, but that is probably a separate philosophical discussion. E.g., it is possible to reproject string data using nearest-neighbor or EWA-NN - though it requires special handling). (Note that at the moment there are some 3D datasets that cannot be projected until we address the proper handling of 3D and/or multi-dimensional data beyond the 2D spatial dimensions - a future ticket, I think the next in our backlog. Following that fix, the number of unprojected datasets should come down a fair bit). What should we do with such data - either not include in the output, or - I propose copy as is, with coordinate and dimension references as in the source data file. The end result will have a mix of gridded and original source data. It is not unlike the treatment of non-spatially aligned data in a spatial subset, or the “metadata” datasets which are not spatially aligned and are often copied as is, even though some of that metadata may no longer be valid after the subsetting has been applied. Unfortunately, I don’t think we can determine the relevance of non-projectable variables to the end-result, so I think better safe than sorry. These questions are not resolvable in an absolute sense of right or wrong. Even consistency may be argued, but I imagine the policy of processing what we can and copying all else “as-is”, is most in keeping with how we handle other processing requests. Perhaps the question here is should we first eliminate the known issues, vs. simply casting all projection failures into the non-projectable bucket - and perhaps missing some kind of inadvertent error in programming that should have enabled this variable to be projected. I suspect that analysis and programmatic implementation is less relevant to the end-user than simply the note that this variable was not projected. I agree this should be noted, perhaps with some known issues noted as well (dimensional mismatch, non-numeric data type), but needn’t be exhaustive. Ultimately, the question comes down to what do we include in messages about the request, and what happens to the non-projected data. The latter question is addressed, I think, by consistency with all other non-projected data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry to say, I disagree a bit here. I'm trying to look at this from the perspective of an end-user, and I think the current implementation really sets things up for confusion for someone making requests:
The ideal case is to find out why these particular variables are failing and address the underlying reason but, assuming that's not possible, my gut instinct is that an explicit failure is a far better UX than doing something that says it was successful, only to supply the end-user with unexpected/unusable content in their request output. Indeed, if they explicitly requested the projection of only one of the non-projectable variables, it seems incorrect to just copy that variable over as a swath and say we were successful. (Definitely a personal take on behaviour, and maybe we need to run it past some user-needs folks to work out what their users would expect) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to “take this off-line” - It does not feel like something we can resolve in a pull-request Unfortunately, we do not have clean data to begin with, so expecting clean results is perhaps asking too much. The source data is not cleanly defined as just swaths and is without the sort of self-describing metadata that allows a clear interpretation without e.g., ATBD reference and often considerable analysis beyond. Unfortunately, removing the non-compliant data is potentially just as invalid as leaving it in. I’m afraid what you are asking for is way beyond the scope of this ticket. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree we should take this offline, but I think that means that we should not do this ticket and/or merge this PR until we've completed those discussions. I don't think there is consensus here that the proposed implementation in this PR is an improvement on the previous implementation (previously, the problematic variables were indicated as not science variables via the What I would likely propose offline is:
|
||
|
||
return output_variables | ||
return output_variables, failed_variables | ||
|
||
|
||
def resample_variable( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -240,14 +240,30 @@ def set_metadata_dimensions( | |
"""Iterate through the dimensions of the metadata variable, and ensure | ||
that all are present in the reprojected output file. This function is | ||
necessary if any of the metadata variables, that aren't to be projected | ||
use the swath-based dimensions from the input granule. | ||
use the swath-based dimensions from the input granule. If the dimension | ||
exists as a variable in the source file, copy it to the output file. | ||
|
||
""" | ||
for dimension in source_dataset[metadata_variable].dimensions: | ||
if dimension not in output_dataset.dimensions: | ||
output_dataset.createDimension( | ||
dimension, source_dataset.dimensions[dimension].size | ||
) | ||
if dimension in source_dataset.variables: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good catch - definitely a previous oversight! |
||
attributes = read_attrs(source_dataset[dimension]) | ||
fill_value = get_fill_value_from_attributes(attributes) | ||
|
||
output_dataset.createVariable( | ||
dimension, | ||
source_dataset[dimension].datatype, | ||
dimensions=source_dataset[dimension].dimensions, | ||
fill_value=fill_value, | ||
zlib=True, | ||
complevel=6, | ||
) | ||
|
||
output_dataset[dimension][:] = source_dataset[dimension][:] | ||
output_dataset[dimension].setncatts(attributes) | ||
|
||
|
||
def copy_metadata_variable( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a backwards compatible change. so I'd suggest minor to
1.3.0
agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Addressed in e8c19ae