Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Min/Max value check does not handle ASCII_Time_* Strings #1135

Open
jmafi opened this issue Feb 13, 2025 · 12 comments
Open

Min/Max value check does not handle ASCII_Time_* Strings #1135

jmafi opened this issue Feb 13, 2025 · 12 comments
Assignees
Labels
B15.1 bug Something isn't working s.medium

Comments

@jmafi
Copy link

jmafi commented Feb 13, 2025

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

Validate generates the following error when run on a data files containing an ASCII time string:

ERROR [error.table.field_value_not_a_number] data object 2, record 1, field 1: Cannot cast field value '2004-183T04:11:26.809' to a Number data type to validate against the min/max values defined in the label.

The error message suggests that when checking data values against File_Area_Observational.Table_.Field_.Special_Constants.valid_minimum/maximum values Validate tool assumes that the valid_minimum/maximum values are numeric. In this example the data values are times strings.

🕵️ Expected behavior

In the case of time string data values, Validate should use an alphabetic instead of numeric comparison for determining whether the data fall within the range defined by the valid_minimum/maximum attributes.

📜 To Reproduce

  1. Run Validate tool on the data files attached in the Test Data section. Using the command:
    validate -t xmlfile -r logfile

🖥 Environment Info

  • Version of this software: v.3.6.3
  • Operating System: Ubuntu Linux 22.04.4

📚 Version of Software Used

No response

🩺 Test Data / Additional context

The attached files include Table_Binary, Table_Character, and Table_Delimited examples, all of which produce similar results.

min_max_value_time_strings-20250212.zip

🦄 Related requirements

🦄 #xyz

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

@al-niessner
Copy link
Contributor

@jmafi @jordanpadams

I do not think an alpha comparison would work. The time regular expressions allow for parts of the date time field to be missing. While all of the examples are consistent enough to use alpha comparison, I think that validate would have to translate them to DataTime objects then compare to cover all allowable cases.

@jordanpadams
Copy link
Member

@jmafi If I understand the definitions of min/max, can we assume that the datetime format used in the min/max values needs to match the format in the data?

For instance if min is 2025-01-01, then the values in the data cannot be 2025-01-01T04:03:00Z?

@jmafi
Copy link
Author

jmafi commented Feb 20, 2025

@jordanpadams: I believe that the specific case that you gave would not be allowed, but only on a technicality. The only requirement is that the min/max values must be the same data type as the data. In the case you gave 2025-01-01 is ASCII_Date_Time_YMD whereas 2025-01-01T04:03:00Z is ASCII_Date_Time_YMD_UTC. If you were to throw "Z" on the min value or drop the "Z" from the data value then that would be legal, just not guaranteed to work correctly for an alpha search.

@al-niessner: You are undoubtedly right. I just offered up an alpha comparison as I guessed that it would be a much less costly approach. Would it be possible to "zero pad" the min/max values to match the format of the data?

@jordanpadams
Copy link
Member

@jmafi we can probably make this happen, but as @al-niessner mentioned, this will just be a very costly operation from an algorithm perspective.

To confirm, is the datetime format described in the table description? That would help a little bit versus having to guess for every column.

@jmafi
Copy link
Author

jmafi commented Feb 20, 2025

@jordanpadams typically the only required description of the datetime format in the table description is the data_type attribute which has values like ASCII_Date_Time_DOY, ASCII_Date_Time_DOY_UTC, etc. An example label that I pulled up did have a field_format for a time column, but it was just given as %23s.

@al-niessner
Copy link
Contributor

al-niessner commented Feb 20, 2025

@jmafi @jordanpadams

To be workable, the format for the field must be given in one of those magic ascii keywords. For instance, if the field was 1981 is that year or an integer value? Guessing is not going to workout well but I could buy a new airplane with all coding required to make it even half work.

I do not know the PDS rules at all. I just fix the code. So, here are some questions that would need to be settled:

  1. Is data_type describing the same field as field_format?
  2. From @jmafi comment above, it seems like data_type is required but he used the word typically. Is data_type required by the XML/schematron?

If 1 and 2 are true, then the solution is straight forward. If 1 is true but 2 is false, then we can do it if 1 is provided; otherwise, what is 1981.

If 1 is false, then I have a host of other questions relating to field_format and if it is required and if it can contain one of the ascii date time keywords.

@jordanpadams
Copy link
Member

@al-niessner

  1. Is data_type describing the field as field_format?

Not sure what this is asking. Apparently field_format can be a string, even if the the data_type is a datetime type, which doesn't make any sense.

  1. From @jmafi comment above, it seems like data_type is required. Is it required?

Yes.

@jordanpadams
Copy link
Member

As Joe mentioned, the data_type should key us in on the format of the column, e.g. ASCII_Date_Time_DOY, ASCII_Date_Time_DOY_UTC. It sounds like we should ignore field_format when we are doing these conversions from String to a Datetime.

@jordanpadams
Copy link
Member

Starting here are the possible values for datetime strings from the IM: https://pds.nasa.gov/datastandards/documents/im/v1/index_1N00.html#19.5%C2%A0%C2%A0class_pds_ascii_date

@al-niessner
Copy link
Contributor

al-niessner commented Feb 20, 2025

@jordanpadams

My question (1) was are data_type and field_format children of the same direct parent. Something like:

<field>
   <data_type>
   <field_format>
</field>

If they are, then we can ignore field_format. We handle all of the magic ascii strings and can detect which to use from the data_type since it is describing the column.

From your answers (both true) this looks pretty straight forward. What to do if the special constants do not match the data_type or cannot be converted to a date time object - warning/error/blowup?

Special constants is getting complicated. Might be a trick to add it without breaking everything, but the actual additive should be straight forward.

@jmafi
Copy link
Author

jmafi commented Feb 20, 2025

@al-niessner

  1. Is data_type describing the same field as field_format?
  2. From @jmafi comment above, it seems like data_type is required but he used the word typically. Is data_type required by the XML/schematron?

I'm sorry if my previous posts were misleading. Yes, the Field_* class includes both data_type and field_format attributes. data_type is required, field_format is optional. For Field_Character it looks like this:
<Field_Character>
<data_type>
<field_format>
</Field_Character>

Field_Binary and Field_Delimited are identical wrt these two attributes.

data_type provides a hint as to the actual format of the time string, but only a hint since all time elements after the year are optional. In other words all of the following would be valid ASCII_Date_Time_DOY_UTC values:
2025-02-20T12:08:57.239Z
2025-02-20T12:08:57Z
2025-02-20Z
2025Z

@jordanpadams

Not sure what this is asking. Apparently field_format can be a string, even if the the data_type is a datetime type, which doesn't make any sense.

First off, any ASCII data field can accurately be described as being an ASCII string (%s), it's just not particularly useful. That said, ISO time strings may contain a number non-numeric characters ("T" and ":", plus "-" in weird places). An alternative to the "%23s" field format_value would have been something like:

%4d-%03dT%02d:%02d:%05.3fZ

However, the pattern defined for field_format (%[\+,-]?[0-9]+(\.([0-9]+))?[doxfeEs]) doesn't allow anything anywhere near that complex. Currently "%23s" is the only option.

@jstone-psi
Copy link

We had some previous discussions about handling datetime strings (and their formatting and precision issues) in this issue:

#964

I'm mostly pointing it out so that whatever solutions we come up with are consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B15.1 bug Something isn't working s.medium
Projects
Status: ToDo
Development

No branches or pull requests

4 participants