Skip to content

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Jul 3, 2025

closes #12882

Depends on

Analysis

Refer to link1 , link2 and link3

timestamp types

ORC has two timestamp types:

  • Timestamp: is a date and time without a time zone, which does not change based on the time zone of the reader.
  • Timestamp with local time zone: is a fixed instant in time, which does change based on the time zone of the reader.

Refer to ORC types.
Spark only supports the first type. For the second type, Spark throws an analysis error, for more details, refer to the comments in the corresponding issue.

Spark session timezone

Spark ignores the session timezone when reading/writing ORC file.
Can not find any timezone metadata in the ORC file.

JVM timezone

Spark ignores JVM timezone when writing ORC file.
Spark rebases the timezone when reading ORC file according to JVM timezone.
In conclusion:

  • Only the JVM timezone impacts ORC reading.
  • Has nothing to do Spark session timezone.
  • Has nothing to do with ORC writing.

For timezone other than UTC/Shanghai timezone, this PR does not support.

Refer to ORC code:
Code link1
Code link2
It's not the same as our code GpuTimeZoneDB.cpuChangeTimestampTz

changes

Add cases:

  • Write on CPU, assert reads on GPU and CPU are identical
  • Write on GPU, assert reads on GPU and CPU are identical
  • Test CPU/GPU do not support ORC timestamp LTZ(local timezone) type.

Signed-off-by: Chong Gao [email protected]

@res-life
Copy link
Collaborator Author

res-life commented Jul 3, 2025

TODO:
Need to support and test write.

@res-life res-life changed the title Orc supports non utc timezone Orc supports non utc timezone [databricks] Jul 3, 2025
@res-life
Copy link
Collaborator Author

res-life commented Jul 3, 2025

build

@sameerz sameerz added the feature request New feature or request label Jul 3, 2025
@GaryShen2008 GaryShen2008 changed the base branch from branch-25.08 to branch-25.10 July 29, 2025 06:30
Chong Gao added 4 commits August 7, 2025 13:16
Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Chong Gao <[email protected]>
@res-life res-life force-pushed the orc-non-utc-timezone branch from 9b57447 to b3a403f Compare August 7, 2025 05:16
@res-life
Copy link
Collaborator Author

res-life commented Aug 7, 2025

build

@res-life
Copy link
Collaborator Author

res-life commented Aug 7, 2025

Will convert to "Ready for reveiw" when premerge is passed.

@res-life res-life closed this Aug 8, 2025
@res-life res-life reopened this Aug 8, 2025
@res-life
Copy link
Collaborator Author

res-life commented Aug 8, 2025

build

1 similar comment
@res-life
Copy link
Collaborator Author

res-life commented Aug 8, 2025

build

@res-life res-life closed this Aug 13, 2025
@res-life res-life reopened this Aug 13, 2025
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

res-life commented Aug 18, 2025

Premerge failed, I reproduced the error locally:

TZ=America/New_York ./integration_tests/run_pyspark_from_build.sh -s -k test_read_round_trip

When TZ=Asia/Shanghai, it passes.
When years < 2200, it also passes.
In summary: TZ=America/New_York and years > 2200 produce errors.

It's weird, GpuTimeZoneDB.cpuChangeTimestampTz produces diff result compared to Spark CPU.

@res-life
Copy link
Collaborator Author

build

@res-life res-life self-assigned this Aug 19, 2025
@res-life res-life marked this pull request as ready for review August 19, 2025 23:37
@Copilot Copilot AI review requested due to automatic review settings August 19, 2025 23:37
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables support for non-UTC timezones (specifically Asia/Shanghai) in ORC file operations by updating timezone validation logic and implementing timezone rebasing when reading ORC files.

  • Updated timezone validation to support both UTC and Asia/Shanghai timezones instead of only UTC
  • Added timezone rebasing functionality to convert timestamps to system default timezone when reading ORC files
  • Modified ORC write operations to skip timezone checks since ORC always uses UTC for writing timestamps

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
GpuOrcFileFormat.scala Updated timezone validation to support UTC and Asia/Shanghai timezones
RapidsMeta.scala Changed checkTimeZone from val to def for overriding in subclasses
GpuOverrides.scala Added timezone check override for ORC write operations
GpuOrcScan.scala Implemented timezone rebasing logic and updated validation for ORC reads
orc_test.py Added comprehensive tests for non-UTC timezone support and updated timestamp ranges
conftest.py Added helper function to check for Shanghai timezone

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@res-life res-life changed the title Orc supports non utc timezone [databricks] Orc supports Asia/Shanghai timezone [databricks] Aug 19, 2025
* @param input the input table, it will be closed after returning
* @return a new table with rebased timestamp columns
*/
def rebaseTimeZone(input: Table): Table = {
Copy link
Collaborator Author

@res-life res-life Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to SchemaUtils.evolveSchemaIfNeededAndClose
Maybe the name rebaseTimeZoneIfNeededAndClose is better

if (types.exists(GpuOverrides.isOrContainsDateOrTimestamp)) {
val defaultJvmTimeZone = TimeZone.getDefault
if (defaultJvmTimeZone != TimeZone.getTimeZone("UTC")
&& defaultJvmTimeZone != TimeZone.getTimeZone("Asia/Shanghai")) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only supports UTC and Asia/Shanghai, other timezones like America/New_York need new implementation.

@res-life res-life requested a review from revans2 August 20, 2025 09:05
Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Collaborator Author

build

Chong Gao added 3 commits August 22, 2025 18:40
@res-life res-life marked this pull request as draft August 22, 2025 12:01
Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Collaborator Author

It's hard to implement this feature without CUDA kernel support.
Refer to TreeReaderFactory:

      if (!hasSameTZRules) {
        offset = SerializationUtils.convertBetweenTimezones(writerTimeZone,
            readerTimeZone, millis);
      }

We can not know the writerTimeZone via Table.readORC API, because writerTimeZone is in the stripe footer.
Each stripe footer has a writerTimeZone, it's not guaranteed that the writerTimeZones in a ORC file are the same.
When readerTimeZone and writerTimeZone are different, we need cuDF kernel to implement the following logic:

In SerializationUtils

  /**
   * Find the relative offset when moving between timezones at a particular
   * point in time.
   *
   * This is a function of ORC v0 and v1 writing timestamps relative to the
   * local timezone. Therefore, when we read, we need to convert from the
   * writer's timezone to the reader's timezone.
   *
   * @param writer the timezone we are moving from
   * @param reader the timezone we are moving to
   * @param millis the point in time
   * @return the change in milliseconds
   */
  public static long convertBetweenTimezones(TimeZone writer, TimeZone reader,
                                             long millis) {
    final long writerOffset = writer.getOffset(millis);
    final long readerOffset = reader.getOffset(millis);
    long adjustedMillis = millis + writerOffset - readerOffset;
    // If the timezone adjustment moves the millis across a DST boundary, we
    // need to reevaluate the offsets.
    long adjustedReader = reader.getOffset(adjustedMillis);
    return writerOffset - adjustedReader;
  }

If the writerTimeZone and readerTimeZone(JVM timezone) are the same, then this PR and cuDF PR works.

@res-life
Copy link
Collaborator Author

Solution:
cuDF: provide a API to get the writer timezones in strip footers, and check if all the writer timezones are the same.
We do not support different writer timezones in a ORC file, because it's a rare case, but we must throw an error if the standard does not meets.
JNI: provide an kernel to implement TimeZone.getOffset. If all the times are < 2200 year, use the kernel; if not, copy to the CPU and use CPU to calculate.
Spark-Rapids: implement the SerializationUtils.convertBetweenTimezones

@res-life
Copy link
Collaborator Author

Write:
Given a time and a write timezone, create a java.sql.Timestamp,
calls Timestamp.getTime to get a MS from epoch(1970-01-01) in UTC,
get a base timestamp (MS) in writer tz for 2015-01-01 00:00:00,
calculate a diff between MS of 2015-01-01 00:00:00 in writer timezone, write the diff compared to 2015 into ORC.

Read:
Base time is: 2015-01-01 00:00:00.
Use MS_2015_in_UTC = 1420070400000

For example:

write time writer tz Timestamp.getTime reader tz base timestamp (MS) in writer tz base timestamp in write tz - 2015 timestamp in UTC write value read apply base timestamp offsets between reader/writertimezones final time in UTC
1970-01-01 00:00:00 +00:00 UTC 0 UTC ms(2015-01-01 in writer tz ) = MS_2015_in_UTC 0 -MS_2015_in_UTC 0 0 1970-01-01 00:00:00
1970-01-01 00:00:00 +00:00 UTC 0 +08:00 ms(2015-01-01 in writer tz ) = MS_2015_in_UTC 0 -MS_2015_in_UTC 0 -8 hours 1969-12-31 16:00:00
1970-01-01 08:00:00 +08:00 +08:00 0 UTC ms(2015-01-01 in writer tz ) = 1420041600000 -8 hours -MS_2015_in_UTC+8 hours 0 +8 hours 1970-01-01 08:00:00
1970-01-01 08:00:00 +08:00 +08:00 0 +08:00 ms(2015-01-01 in writer tz ) = 1420041600000 -8 hours -MS_2015_in_UTC+8 hours 0 0 1970-01-01 00:00:00

@res-life
Copy link
Collaborator Author

I have a concern about the performance: If there is any year > 2200, we need to call to CPU, there may be a perf issue, since should call Timezone.getOffset 3 times. Another topic, for the DST timezones, it's doable totally on GPU IIUC. It's easy to get the day of week from a date, we can dynamically get the offset for a time considering DST. Usually DST switches offset on Sunday around the midnight.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay if we are running into issue with other time zones, than I am fine with doing that work as a follow on. It just was not clear that it did not work in the other time zones or why.

@revans2
Copy link
Collaborator

revans2 commented Aug 25, 2025

Just FYI. The issue of being off by an hour might be related to CUDF also trying to interpret the timezone.

https://github.com/rapidsai/cudf/blob/5be7bebf6d650d4606e545c9f90908fef050446a/cpp/src/io/orc/reader_impl_chunking.cu#L262-L264

Perhaps the correct fix would be to ask CUDF to provide a config to disable modifying the timestamps and let us handle doing the transitions as needed.

@revans2
Copy link
Collaborator

revans2 commented Aug 25, 2025

I have a concern about the performance: If there is any year > 2200, we need to call to CPU, there may be a perf issue, since should call Timezone.getOffset 3 times. Another topic, for the DST timezones, it's doable totally on GPU IIUC. It's easy to get the day of week from a date, we can dynamically get the offset for a time considering DST. Usually DST switches offset on Sunday around the midnight.

It is totally doable to do all of the time zone transitions on the GPU.

  1. The actual java code that covers the transitions is under a GPL license so it is not something that we can look at to help us debug things. That is not a big deal because it should be documented well enough that we don't need it. But it may slow us down a little.
  2. When looking at what the algorithm would look like to calculate the transition tables on the GPU I determined that there is a high probability of thread divergence, especially compared to the simple look up table operations supported today. Because java releases a time zone DB that has stopped when the JVM is released, then we know that as that JVM is used it is likely that the thread divergence issue would be more and more of a problem over time. We thought it would be best to cache a transition table that covered what we expect to happen (up to 2200), and then did different processing for the rest. The first step was to just make it work, which is where we have the CPU fallback. The next step is to make it so we don't need the CPU fallback, but we also don't take a huge performance hit. I thought we had an issue filed on the backlog for this, but I could not find one. @res-life if you want to file one, feel free to do so. If not let me know and I will do it.

@res-life
Copy link
Collaborator Author

@revans2 Filed an issue

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks find, but the PR is still in draft, so I hesitate to approve it right now. Is it waiting on something more?

@res-life
Copy link
Collaborator Author

This PR can not handle the following scenaria:
Write with a timezone and read with another timezone.
cuDF can not know the reader timezone, because the reder timezone is from JVM default timezone.
Spark-Rapids can not know the writer timezone via Table.readORC because it only returns data.
The logic of ORC:
In SerializationUtils

  /**
   * Find the relative offset when moving between timezones at a particular
   * point in time.
   *
   * This is a function of ORC v0 and v1 writing timestamps relative to the
   * local timezone. Therefore, when we read, we need to convert from the
   * writer's timezone to the reader's timezone.
   *
   * @param writer the timezone we are moving from
   * @param reader the timezone we are moving to
   * @param millis the point in time
   * @return the change in milliseconds
   */
  public static long convertBetweenTimezones(TimeZone writer, TimeZone reader,
                                             long millis) {
    final long writerOffset = writer.getOffset(millis);
    final long readerOffset = reader.getOffset(millis);
    long adjustedMillis = millis + writerOffset - readerOffset;
    // If the timezone adjustment moves the millis across a DST boundary, we
    // need to reevaluate the offsets.
    long adjustedReader = reader.getOffset(adjustedMillis);
    return writerOffset - adjustedReader;
  }

From above code, it uses 3 times of writer.getOffset to deal with DST timezone.
Seems this why I always get a hour diff when reading with New_York timezone.

We need to:

  • Implement GPU version of TimeZone.getOffset.
    two options:
    1. when year > 2200 fallback to CPU; when year < 2200 use kernel.
    2. one kernel instead of hybrid mode. Maybe refer to joda-time OSS library. We already has to_epoch_day, to_date.
  • Get all the writer timezones in ORC file stripe footers, and check they are the same.
  • cuDF read as UTC, then add the diff using ColumnView.add(long value)

More relevant info:
ORC Java library is using java.sql.Timestamp which is using JVM default timezone, so reader also uses the JVM default timezone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] It would be nice if we support Asia/Shanghai timezone for orc file scan
4 participants