-
Notifications
You must be signed in to change notification settings - Fork 254
Orc supports Asia/Shanghai timezone [databricks] #13052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.10
Are you sure you want to change the base?
Conversation
TODO: |
build |
Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Chong Gao <[email protected]>
9b57447
to
b3a403f
Compare
Signed-off-by: Chong Gao <[email protected]>
build |
Will convert to "Ready for reveiw" when premerge is passed. |
build |
1 similar comment
build |
build |
…s Spark does. Signed-off-by: Chong Gao <[email protected]>
build |
Premerge failed, I reproduced the error locally: TZ=America/New_York ./integration_tests/run_pyspark_from_build.sh -s -k test_read_round_trip When It's weird, GpuTimeZoneDB.cpuChangeTimestampTz produces diff result compared to Spark CPU. |
Signed-off-by: Chong Gao <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enables support for non-UTC timezones (specifically Asia/Shanghai) in ORC file operations by updating timezone validation logic and implementing timezone rebasing when reading ORC files.
- Updated timezone validation to support both UTC and Asia/Shanghai timezones instead of only UTC
- Added timezone rebasing functionality to convert timestamps to system default timezone when reading ORC files
- Modified ORC write operations to skip timezone checks since ORC always uses UTC for writing timestamps
Reviewed Changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
GpuOrcFileFormat.scala | Updated timezone validation to support UTC and Asia/Shanghai timezones |
RapidsMeta.scala | Changed checkTimeZone from val to def for overriding in subclasses |
GpuOverrides.scala | Added timezone check override for ORC write operations |
GpuOrcScan.scala | Implemented timezone rebasing logic and updated validation for ORC reads |
orc_test.py | Added comprehensive tests for non-UTC timezone support and updated timestamp ranges |
conftest.py | Added helper function to check for Shanghai timezone |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
* @param input the input table, it will be closed after returning | ||
* @return a new table with rebased timestamp columns | ||
*/ | ||
def rebaseTimeZone(input: Table): Table = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to SchemaUtils.evolveSchemaIfNeededAndClose
Maybe the name rebaseTimeZoneIfNeededAndClose
is better
if (types.exists(GpuOverrides.isOrContainsDateOrTimestamp)) { | ||
val defaultJvmTimeZone = TimeZone.getDefault | ||
if (defaultJvmTimeZone != TimeZone.getTimeZone("UTC") | ||
&& defaultJvmTimeZone != TimeZone.getTimeZone("Asia/Shanghai")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only supports UTC and Asia/Shanghai, other timezones like America/New_York need new implementation.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Chong Gao <[email protected]>
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcTimezoneUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Chong Gao <[email protected]>
It's hard to implement this feature without CUDA kernel support. if (!hasSameTZRules) {
offset = SerializationUtils.convertBetweenTimezones(writerTimeZone,
readerTimeZone, millis);
} We can not know the /**
* Find the relative offset when moving between timezones at a particular
* point in time.
*
* This is a function of ORC v0 and v1 writing timestamps relative to the
* local timezone. Therefore, when we read, we need to convert from the
* writer's timezone to the reader's timezone.
*
* @param writer the timezone we are moving from
* @param reader the timezone we are moving to
* @param millis the point in time
* @return the change in milliseconds
*/
public static long convertBetweenTimezones(TimeZone writer, TimeZone reader,
long millis) {
final long writerOffset = writer.getOffset(millis);
final long readerOffset = reader.getOffset(millis);
long adjustedMillis = millis + writerOffset - readerOffset;
// If the timezone adjustment moves the millis across a DST boundary, we
// need to reevaluate the offsets.
long adjustedReader = reader.getOffset(adjustedMillis);
return writerOffset - adjustedReader;
} If the |
Solution: |
Write: Read: For example:
|
I have a concern about the performance: If there is any year > 2200, we need to call to CPU, there may be a perf issue, since should call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay if we are running into issue with other time zones, than I am fine with doing that work as a follow on. It just was not clear that it did not work in the other time zones or why.
Just FYI. The issue of being off by an hour might be related to CUDF also trying to interpret the timezone. Perhaps the correct fix would be to ask CUDF to provide a config to disable modifying the timestamps and let us handle doing the transitions as needed. |
It is totally doable to do all of the time zone transitions on the GPU.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks find, but the PR is still in draft, so I hesitate to approve it right now. Is it waiting on something more?
This PR can not handle the following scenaria: /**
* Find the relative offset when moving between timezones at a particular
* point in time.
*
* This is a function of ORC v0 and v1 writing timestamps relative to the
* local timezone. Therefore, when we read, we need to convert from the
* writer's timezone to the reader's timezone.
*
* @param writer the timezone we are moving from
* @param reader the timezone we are moving to
* @param millis the point in time
* @return the change in milliseconds
*/
public static long convertBetweenTimezones(TimeZone writer, TimeZone reader,
long millis) {
final long writerOffset = writer.getOffset(millis);
final long readerOffset = reader.getOffset(millis);
long adjustedMillis = millis + writerOffset - readerOffset;
// If the timezone adjustment moves the millis across a DST boundary, we
// need to reevaluate the offsets.
long adjustedReader = reader.getOffset(adjustedMillis);
return writerOffset - adjustedReader;
} From above code, it uses 3 times of We need to:
More relevant info: |
closes #12882
Depends on
Analysis
Refer to link1 , link2 and link3
timestamp types
ORC has two timestamp types:
Refer to ORC types.
Spark only supports the first type. For the second type, Spark throws an analysis error, for more details, refer to the comments in the corresponding issue.
Spark session timezone
Spark ignores the session timezone when reading/writing ORC file.
Can not find any timezone metadata in the ORC file.
JVM timezone
Spark ignores JVM timezone when writing ORC file.
Spark rebases the timezone when reading ORC file according to JVM timezone.
In conclusion:
For timezone other than UTC/Shanghai timezone, this PR does not support.
Refer to ORC code:
Code link1
Code link2
It's not the same as our code GpuTimeZoneDB.cpuChangeTimestampTz
changes
Add cases:
Signed-off-by: Chong Gao [email protected]