Orc supports Asia/Shanghai timezone [databricks] #13052

res-life · 2025-07-03T10:03:03Z

closes #12882

Depends on

Add an option to support reading ORC timestamp column as UTC time. rapidsai/cudf#19773

Analysis

Refer to link1 , link2 and link3

timestamp types

ORC has two timestamp types:

Timestamp: is a date and time without a time zone, which does not change based on the time zone of the reader.
Timestamp with local time zone: is a fixed instant in time, which does change based on the time zone of the reader.

Refer to ORC types.
Spark only supports the first type. For the second type, Spark throws an analysis error, for more details, refer to the comments in the corresponding issue.

Spark session timezone

Spark ignores the session timezone when reading/writing ORC file.
Can not find any timezone metadata in the ORC file.

JVM timezone

Spark ignores JVM timezone when writing ORC file.
Spark rebases the timezone when reading ORC file according to JVM timezone.
In conclusion:

Only the JVM timezone impacts ORC reading.
Has nothing to do Spark session timezone.
Has nothing to do with ORC writing.

For timezone other than UTC/Shanghai timezone, this PR does not support.

Refer to ORC code:
Code link1
Code link2
It's not the same as our code GpuTimeZoneDB.cpuChangeTimestampTz

changes

Change the checkings for timezone, only supports UTC/Shanghai timezone
Rebase the timestamp column if JVM timezone is Shanghai timezone when reading ORC
Found an issue when add test cases: [BUG] GPU generates a wrong file when writing timestamp < 1970 year. #13272, update timestamp from 1590 to 1970.

Add cases:

Write on CPU, assert reads on GPU and CPU are identical
Write on GPU, assert reads on GPU and CPU are identical
Test CPU/GPU do not support ORC timestamp LTZ(local timezone) type.

Signed-off-by: Chong Gao [email protected]

res-life · 2025-07-03T10:03:39Z

TODO:
Need to support and test write.

res-life · 2025-07-03T10:04:22Z

build

Signed-off-by: Chong Gao <[email protected]>

res-life · 2025-08-07T09:46:55Z

build

res-life · 2025-08-07T09:53:55Z

Will convert to "Ready for reveiw" when premerge is passed.

res-life · 2025-08-08T01:20:49Z

build

res-life · 2025-08-08T09:50:54Z

build

res-life · 2025-08-13T05:05:04Z

build

…s Spark does. Signed-off-by: Chong Gao <[email protected]>

res-life · 2025-08-16T01:41:19Z

build

res-life · 2025-08-18T06:25:41Z

Premerge failed, I reproduced the error locally:

TZ=America/New_York ./integration_tests/run_pyspark_from_build.sh -s -k test_read_round_trip

When TZ=Asia/Shanghai, it passes.
When years < 2200, it also passes.
In summary: TZ=America/New_York and years > 2200 produce errors.

It's weird, GpuTimeZoneDB.cpuChangeTimestampTz produces diff result compared to Spark CPU.

Signed-off-by: Chong Gao <[email protected]>

res-life · 2025-08-19T10:03:33Z

build

Copilot

Pull Request Overview

This PR enables support for non-UTC timezones (specifically Asia/Shanghai) in ORC file operations by updating timezone validation logic and implementing timezone rebasing when reading ORC files.

Updated timezone validation to support both UTC and Asia/Shanghai timezones instead of only UTC
Added timezone rebasing functionality to convert timestamps to system default timezone when reading ORC files
Modified ORC write operations to skip timezone checks since ORC always uses UTC for writing timestamps

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
GpuOrcFileFormat.scala	Updated timezone validation to support UTC and Asia/Shanghai timezones
RapidsMeta.scala	Changed checkTimeZone from val to def for overriding in subclasses
GpuOverrides.scala	Added timezone check override for ORC write operations
GpuOrcScan.scala	Implemented timezone rebasing logic and updated validation for ORC reads
orc_test.py	Added comprehensive tests for non-UTC timezone support and updated timestamp ranges
conftest.py	Added helper function to check for Shanghai timezone

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

res-life · 2025-08-19T23:43:30Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

+   * @param input the input table, it will be closed after returning
+   * @return a new table with rebased timestamp columns
+   */
+  def rebaseTimeZone(input: Table): Table = {


Similar to SchemaUtils.evolveSchemaIfNeededAndClose
Maybe the name rebaseTimeZoneIfNeededAndClose is better

res-life · 2025-08-19T23:46:33Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

+    if (types.exists(GpuOverrides.isOrContainsDateOrTimestamp)) {
+      val defaultJvmTimeZone = TimeZone.getDefault
+      if (defaultJvmTimeZone != TimeZone.getTimeZone("UTC")
+        && defaultJvmTimeZone != TimeZone.getTimeZone("Asia/Shanghai")) {


Only supports UTC and Asia/Shanghai, other timezones like America/New_York need new implementation.

integration_tests/src/main/python/orc_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

Signed-off-by: Chong Gao <[email protected]>

res-life · 2025-08-20T09:43:42Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcTimezoneUtils.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Signed-off-by: Chong Gao <[email protected]>

res-life · 2025-08-22T12:42:37Z

It's hard to implement this feature without CUDA kernel support.
Refer to TreeReaderFactory:

      if (!hasSameTZRules) {
        offset = SerializationUtils.convertBetweenTimezones(writerTimeZone,
            readerTimeZone, millis);
      }

We can not know the writerTimeZone via Table.readORC API, because writerTimeZone is in the stripe footer.
Each stripe footer has a writerTimeZone, it's not guaranteed that the writerTimeZones in a ORC file are the same.
When readerTimeZone and writerTimeZone are different, we need cuDF kernel to implement the following logic:

In SerializationUtils

  /**
   * Find the relative offset when moving between timezones at a particular
   * point in time.
   *
   * This is a function of ORC v0 and v1 writing timestamps relative to the
   * local timezone. Therefore, when we read, we need to convert from the
   * writer's timezone to the reader's timezone.
   *
   * @param writer the timezone we are moving from
   * @param reader the timezone we are moving to
   * @param millis the point in time
   * @return the change in milliseconds
   */
  public static long convertBetweenTimezones(TimeZone writer, TimeZone reader,
                                             long millis) {
    final long writerOffset = writer.getOffset(millis);
    final long readerOffset = reader.getOffset(millis);
    long adjustedMillis = millis + writerOffset - readerOffset;
    // If the timezone adjustment moves the millis across a DST boundary, we
    // need to reevaluate the offsets.
    long adjustedReader = reader.getOffset(adjustedMillis);
    return writerOffset - adjustedReader;
  }

If the writerTimeZone and readerTimeZone(JVM timezone) are the same, then this PR and cuDF PR works.

res-life · 2025-08-25T08:22:59Z

Solution:
cuDF: provide a API to get the writer timezones in strip footers, and check if all the writer timezones are the same.
We do not support different writer timezones in a ORC file, because it's a rare case, but we must throw an error if the standard does not meets.
JNI: provide an kernel to implement TimeZone.getOffset. If all the times are < 2200 year, use the kernel; if not, copy to the CPU and use CPU to calculate.
Spark-Rapids: implement the SerializationUtils.convertBetweenTimezones

res-life · 2025-08-25T08:34:48Z

Write:
Given a time and a write timezone, create a java.sql.Timestamp,
calls Timestamp.getTime to get a MS from epoch(1970-01-01) in UTC,
get a base timestamp (MS) in writer tz for 2015-01-01 00:00:00,
calculate a diff between MS of 2015-01-01 00:00:00 in writer timezone, write the diff compared to 2015 into ORC.

Read:
Base time is: 2015-01-01 00:00:00.
Use MS_2015_in_UTC = 1420070400000

For example:

write time	writer tz	reader tz	base timestamp (MS) in writer tz	base timestamp in write tz - 2015 timestamp in UTC	write value	offsets between reader/writertimezones	final time in UTC
1970-01-01 00:00:00 +00:00	UTC	UTC	ms(2015-01-01 in writer tz ) = MS_2015_in_UTC	0	-MS_2015_in_UTC	0	1970-01-01 00:00:00
1970-01-01 00:00:00 +00:00	UTC	+08:00	ms(2015-01-01 in writer tz ) = MS_2015_in_UTC	0	-MS_2015_in_UTC	-8 hours	1969-12-31 16:00:00
1970-01-01 08:00:00 +08:00	+08:00	UTC	ms(2015-01-01 in writer tz ) = 1420041600000	-8 hours	-MS_2015_in_UTC+8 hours	+8 hours	1970-01-01 08:00:00
1970-01-01 08:00:00 +08:00	+08:00	+08:00	ms(2015-01-01 in writer tz ) = 1420041600000	-8 hours	-MS_2015_in_UTC+8 hours	0	1970-01-01 00:00:00

res-life · 2025-08-25T08:55:28Z

I have a concern about the performance: If there is any year > 2200, we need to call to CPU, there may be a perf issue, since should call Timezone.getOffset 3 times. Another topic, for the DST timezones, it's doable totally on GPU IIUC. It's easy to get the day of week from a date, we can dynamically get the offset for a time considering DST. Usually DST switches offset on Sunday around the midnight.

revans2

Okay if we are running into issue with other time zones, than I am fine with doing that work as a follow on. It just was not clear that it did not work in the other time zones or why.

revans2 · 2025-08-25T15:27:49Z

Just FYI. The issue of being off by an hour might be related to CUDF also trying to interpret the timezone.

https://github.com/rapidsai/cudf/blob/5be7bebf6d650d4606e545c9f90908fef050446a/cpp/src/io/orc/reader_impl_chunking.cu#L262-L264

Perhaps the correct fix would be to ask CUDF to provide a config to disable modifying the timestamps and let us handle doing the transitions as needed.

revans2 · 2025-08-25T15:46:28Z

I have a concern about the performance: If there is any year > 2200, we need to call to CPU, there may be a perf issue, since should call Timezone.getOffset 3 times. Another topic, for the DST timezones, it's doable totally on GPU IIUC. It's easy to get the day of week from a date, we can dynamically get the offset for a time considering DST. Usually DST switches offset on Sunday around the midnight.

It is totally doable to do all of the time zone transitions on the GPU.

The actual java code that covers the transitions is under a GPL license so it is not something that we can look at to help us debug things. That is not a big deal because it should be documented well enough that we don't need it. But it may slow us down a little.
When looking at what the algorithm would look like to calculate the transition tables on the GPU I determined that there is a high probability of thread divergence, especially compared to the simple look up table operations supported today. Because java releases a time zone DB that has stopped when the JVM is released, then we know that as that JVM is used it is likely that the thread divergence issue would be more and more of a problem over time. We thought it would be best to cache a transition table that covered what we expect to happen (up to 2200), and then did different processing for the rest. The first step was to just make it work, which is where we have the CPU fallback. The next step is to make it so we don't need the CPU fallback, but we also don't take a huge performance hit. I thought we had an issue filed on the backlog for this, but I could not find one. @res-life if you want to file one, feel free to do so. If not let me know and I will do it.

res-life · 2025-08-27T07:27:18Z

@revans2 Filed an issue

revans2

The code looks find, but the PR is still in draft, so I hesitate to approve it right now. Is it waiting on something more?

res-life · 2025-08-28T02:28:04Z

This PR can not handle the following scenaria:
Write with a timezone and read with another timezone.
cuDF can not know the reader timezone, because the reder timezone is from JVM default timezone.
Spark-Rapids can not know the writer timezone via Table.readORC because it only returns data.
The logic of ORC:
In SerializationUtils

  /**
   * Find the relative offset when moving between timezones at a particular
   * point in time.
   *
   * This is a function of ORC v0 and v1 writing timestamps relative to the
   * local timezone. Therefore, when we read, we need to convert from the
   * writer's timezone to the reader's timezone.
   *
   * @param writer the timezone we are moving from
   * @param reader the timezone we are moving to
   * @param millis the point in time
   * @return the change in milliseconds
   */
  public static long convertBetweenTimezones(TimeZone writer, TimeZone reader,
                                             long millis) {
    final long writerOffset = writer.getOffset(millis);
    final long readerOffset = reader.getOffset(millis);
    long adjustedMillis = millis + writerOffset - readerOffset;
    // If the timezone adjustment moves the millis across a DST boundary, we
    // need to reevaluate the offsets.
    long adjustedReader = reader.getOffset(adjustedMillis);
    return writerOffset - adjustedReader;
  }

From above code, it uses 3 times of writer.getOffset to deal with DST timezone.
Seems this why I always get a hour diff when reading with New_York timezone.

We need to:

Implement GPU version of TimeZone.getOffset.
two options:
1. when year > 2200 fallback to CPU; when year < 2200 use kernel.
2. one kernel instead of hybrid mode. Maybe refer to joda-time OSS library. We already has to_epoch_day, to_date.
Get all the writer timezones in ORC file stripe footers, and check they are the same.
cuDF read as UTC, then add the diff using ColumnView.add(long value)

More relevant info:
ORC Java library is using java.sql.Timestamp which is using JVM default timezone, so reader also uses the JVM default timezone.

res-life changed the title ~~Orc supports non utc timezone~~ Orc supports non utc timezone [databricks] Jul 3, 2025

res-life mentioned this pull request Jul 3, 2025

[FEA] It would be nice if we support Asia/Shanghai timezone for orc file scan #12882

Open

sameerz added the feature request New feature or request label Jul 3, 2025

GaryShen2008 changed the base branch from branch-25.08 to branch-25.10 July 29, 2025 06:30

Chong Gao added 4 commits August 7, 2025 13:16

Orc supports non-utc-timezone

af32842

Signed-off-by: Chong Gao <[email protected]>

Add test case

3fac929

Signed-off-by: Chong Gao <[email protected]>

Fix compile error

be85bc9

Update test case

b3a403f

Signed-off-by: Chong Gao <[email protected]>

res-life force-pushed the orc-non-utc-timezone branch from 9b57447 to b3a403f Compare August 7, 2025 05:16

Enable write with non-UTC timezone; Add/Update test cases

e82b733

Signed-off-by: Chong Gao <[email protected]>

res-life closed this Aug 8, 2025

res-life reopened this Aug 8, 2025

res-life closed this Aug 13, 2025

res-life reopened this Aug 13, 2025

Rebase the timestamp columns (if it has) to system default timezone a…

a814120

…s Spark does. Signed-off-by: Chong Gao <[email protected]>

Only support UTC and Asia/Shanghai timezone for ORC

0bdbc8b

Signed-off-by: Chong Gao <[email protected]>

res-life self-assigned this Aug 19, 2025

res-life marked this pull request as ready for review August 19, 2025 23:37

Copilot AI review requested due to automatic review settings August 19, 2025 23:37

Copilot AI reviewed Aug 19, 2025

View reviewed changes

res-life changed the title ~~Orc supports non utc timezone [databricks]~~ Orc supports Asia/Shanghai timezone [databricks] Aug 19, 2025

res-life commented Aug 19, 2025

View reviewed changes

res-life requested a review from firestarman August 19, 2025 23:47

res-life mentioned this pull request Aug 20, 2025

[FEA] [follow-up] Supports more timezones other than UTC and Asia/Shanghai for ORC #13341

Open

2 tasks

firestarman reviewed Aug 20, 2025

View reviewed changes

res-life requested a review from revans2 August 20, 2025 09:05

Address comments

51f65a6

Signed-off-by: Chong Gao <[email protected]>

revans2 reviewed Aug 20, 2025

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcTimezoneUtils.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

res-life mentioned this pull request Aug 22, 2025

Add an option to support reading ORC timestamp column as UTC time. rapidsai/cudf#19773

Open

3 tasks

Chong Gao added 3 commits August 22, 2025 18:40

Supports non-UTC for reading; Do not supports non-UTC for writing

4499ce5

Signed-off-by: Chong Gao <[email protected]>

Merge branch-25.10

4c2968d

Signed-off-by: Chong Gao <[email protected]>

Fix

01b87ad

Signed-off-by: Chong Gao <[email protected]>

res-life marked this pull request as draft August 22, 2025 12:01

Update

4db36a4

Signed-off-by: Chong Gao <[email protected]>

revans2 reviewed Aug 25, 2025

View reviewed changes

res-life mentioned this pull request Aug 27, 2025

[FEA] [performace] [follow-up] investigate to handle DST timezones on GPU for year > 2200. NVIDIA/spark-rapids-jni#3674

Open

revans2 reviewed Aug 27, 2025

View reviewed changes

Orc supports Asia/Shanghai timezone [databricks] #13052

Are you sure you want to change the base?

Orc supports Asia/Shanghai timezone [databricks] #13052

Conversation

res-life commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Depends on

Analysis

timestamp types

Spark session timezone

JVM timezone

For timezone other than UTC/Shanghai timezone, this PR does not support.

changes

Uh oh!

res-life commented Jul 3, 2025

Uh oh!

res-life commented Jul 3, 2025

Uh oh!

res-life commented Aug 7, 2025

Uh oh!

res-life commented Aug 7, 2025

Uh oh!

res-life commented Aug 8, 2025

Uh oh!

res-life commented Aug 8, 2025

Uh oh!

res-life commented Aug 13, 2025

Uh oh!

res-life commented Aug 16, 2025

Uh oh!

res-life commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

res-life commented Aug 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

res-life Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

res-life Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

res-life commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

res-life commented Aug 22, 2025

Uh oh!

res-life commented Aug 25, 2025

Uh oh!

res-life commented Aug 25, 2025

Uh oh!

res-life commented Aug 25, 2025

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

revans2 commented Aug 25, 2025

Uh oh!

revans2 commented Aug 25, 2025

Uh oh!

res-life commented Aug 27, 2025

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

res-life commented Aug 28, 2025

Uh oh!

Uh oh!

res-life commented Jul 3, 2025 •

edited

Loading

res-life commented Aug 18, 2025 •

edited

Loading

res-life Aug 19, 2025 •

edited

Loading