Skip to content

HIVE-20889: Support timestamp-micros in AvroSerDe #5779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

armitage420
Copy link
Contributor

What changes were proposed in this pull request?

Timestamps for avro file format in Hive supports timestamp-millis. This PR brings in the feature where Hive can now support Timestamp-micros logicaltype for avro too.
Some details on Logical Types of Avro here: https://gist.github.com/armitage420/fa1432f4dbda56f67bdb08c252415623
More details on Jira description is already present: HIVE-20889

Why are the changes needed?

These changes will increase the reading of timestamps precision for users.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Added new qfile and other existing tests

Copy link
Contributor

@SourabhBadhya SourabhBadhya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@armitage420 This is nice contribution!
However, you might have to cover with some tests for a variety of cases to avoid any timestamp issues. I would highly advise to extend existing tests for timestamp-millis to timestamp-micros.

Comment on lines 183 to 184
(AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS.equals(schema.getProp(AvroSerDe.AVRO_PROP_LOGICAL_TYPE)) ||
AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS.equals(schema.getProp(AvroSerDe.AVRO_PROP_LOGICAL_TYPE)))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating a function to handle all these timestamp cases. Currently its only 2 cases, but looking at the extension itself, in the future it can be more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@armitage420 I am already seeing a Timestamp-nanos now in Avro specification.
https://avro.apache.org/docs/1.12.0/specification/#timestamps

@@ -159,7 +159,7 @@ private Schema createAvroPrimitive(TypeInfo typeInfo) {
case TIMESTAMP:
schema = AvroSerdeUtils.getSchemaFor("{" +
"\"type\":\"" + AvroSerDe.AVRO_LONG_TYPE_NAME + "\"," +
"\"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME + "\"}");
"\"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS + "\"}");
Copy link
Contributor

@SourabhBadhya SourabhBadhya May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Does this have the effect of making the type name as timestamp-micros as default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does make parsing avro timestamps in microsecond precision in Hive as default. Before it was parsed in millisecond precision by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider keeping it as millis to make it backward compatible. We dont want to surprise users with this change.

I wonder how will users differentiate between millis and micros in their table schema. Is it only possible by providing a custom avro schema in avro.schema.url which explicitly mentions the timestamp column as timestamp-micros?

long millis = defaultProleptic ? timestamp.toEpochMilli() :
CalendarUtils.convertTimeToHybrid(timestamp.toEpochMilli());
CalendarUtils.convertTimeToHybrid(timestamp.toEpochMilli());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Remove unnecessary whitespace changes to avoid rewriting git history

Comment on lines 210 to 273
public static long convertTimeToHybridMicros(long prolepticMicros) {
long hybridMicros = prolepticMicros;
long prolepticMillis = prolepticMicros / 1_000L; // Convert micros to millis

if (prolepticMillis < SWITCHOVER_MILLIS) {
String dateStr = PROLEPTIC_TIME_FORMAT.get().format(new Date(prolepticMillis));
try {
hybridMicros = HYBRID_TIME_FORMAT.get().parse(dateStr).getTime() * 1_000L; // Convert millis back to micros
} catch (ParseException e) {
throw new IllegalArgumentException("Can't parse " + dateStr, e);
}
}
return hybridMicros;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier, consider keeping it as micros due to loss in precision while changing to millis.

@@ -162,6 +162,11 @@ public long toEpochMilli() {
return localDateTime.toInstant(ZoneOffset.UTC).toEpochMilli();
}

public long toEpochMicro() {
return localDateTime.toEpochSecond(ZoneOffset.UTC) * 1_000_000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use similar semantics as toEpochMilli()?

return localDateTime.toInstant(ZoneOffset.UTC).toEpochMilli() * 1000 + localDateTime.getNano() / 1000;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing so might lead to loss of precision due to integer overflow. Hence we are parsing the value to seconds here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localDateTime.toInstant(ZoneOffset.UTC).toEpochMilli() returns a long value and due to addition of all these values the implicit datatype is long hence there should be no chance of integer overflow.

return localDateTime.toEpochSecond(ZoneOffset.UTC) * 1_000_000
+ localDateTime.getNano() / 1000;
}

public long toEpochMilli(ZoneId id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't zone conversion also applicable for micros?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find any code changes related to zone, and hence did not add extra unnecessary changes. Although, adding a similar method for future does make sense.

@@ -0,0 +1,3 @@
CREATE EXTERNAL TABLE hive_test(`dt` timestamp) STORED AS AVRO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to have more test cases. Consider atleast referring to existing test case from millis and extend it for micros.
We are handling a lot of cases, such as proleptic calendar, zone based conversions etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understandable, adding equivalent test cases to avro in milliseconds precision, and some more test classes as required

}
timestamp = TimestampTZUtil.convertTimestampToZone(
timestamp, ZoneOffset.UTC, convertToTimeZone, legacyConversion);
if (!skipProlepticConversion && logicalType != null && logicalType.getName().equals(AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified into a nested if block -
if (!skipProlepticConversion) {
// Add other if checks
....
}

Also use equalsIgnoreCase

@armitage420
Copy link
Contributor Author

@SourabhBadhya Thank you so much for the review! I will fix the required changes as soon as possible :D

@armitage420
Copy link
Contributor Author

@SourabhBadhya I have made the required changes, I hope you could have a look at it, when you have time!
P.P.S- The test failures are unrelated to my changes and everyone recently running the pipeline having the same failures. I will re-trigger for green build, post this issue gets fixed :D

Copy link

@armitage420
Copy link
Contributor Author

@SourabhBadhya The build is green! Please look at the changes when you have the time! Thank you ✨

Comment on lines +235 to 245
LogicalType logicalType = schema.getLogicalType();
if (logicalType != null && logicalType.getName().equalsIgnoreCase(AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS)) {
long micros = defaultProleptic ? timestamp.toEpochMicro() :
CalendarUtils.convertTimeToHybridMicros(timestamp.toEpochMicro());
timestamp = TimestampTZUtil.convertTimestampToZone(
Timestamp.ofEpochMicro(micros), TimeZone.getDefault().toZoneId(), ZoneOffset.UTC, legacyConversion);
return timestamp.toEpochMicro();
}
long millis = defaultProleptic ? timestamp.toEpochMilli() :
CalendarUtils.convertTimeToHybrid(timestamp.toEpochMilli());
timestamp = TimestampTZUtil.convertTimestampToZone(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating a function which would pass the required fields and return timestamp object.

This is because of enhancing extensibility for timestamp-nanos.

Comment on lines +209 to +215
if (logicalType == null) {
return false;
}

// Supported timestamp logical types (extend this set as needed)
return logicalType.equalsIgnoreCase(AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS) ||
logicalType.equalsIgnoreCase(AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its better to replace this code with -
AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS.equalsIgnoreCase(logicalType) || AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS.equalsIgnoreCase(logicalType);

@@ -227,7 +227,7 @@ public class TestAvroObjectInspectorGenerator {
" \"fields\" : [\n" +
" {\"name\":\"timestampField\", " +
" \"type\":\"" + AvroSerDe.AVRO_LONG_TYPE_NAME + "\", " +
" \"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME + "\"}" +
" \"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS + "\"}" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and add explicit test cases for micros.

@@ -125,7 +125,7 @@ public void canSerializeDoubles() throws SerDeException, IOException {
public void canSerializeTimestamps() throws SerDeException, IOException {
singleFieldTest("timestamp1", Timestamp.valueOf("2011-01-01 00:00:00").toEpochMilli(),
"\"" + AvroSerDe.AVRO_LONG_TYPE_NAME + "\"," +
"\"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME + "\"");
"\"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS + "\"");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and add explicit test cases for micros.

@@ -261,7 +261,7 @@ public void createAvroDateSchema() {
public void createAvroTimestampSchema() {
final String specificSchema = "{" +
"\"type\":\"long\"," +
"\"logicalType\":\"timestamp-millis\"}";
"\"logicalType\":\"timestamp-micros\"}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and add explicit test cases for micros.

('0947-02-11 07:15:11.123'),
('0200-02-11 07:15:11.123');
('0200-02-11 07:15:11.1234');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always refrain from modifying existing tests. Considering adding new tests.

('2012-02-21 07:08:09.123'),
('1014-02-11 07:15:11.12345');

SELECT * FROM micros_table;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a test with timestamp-micros as a column and timestamp-millis as a column.

('0947-02-11 07:15:11.123'),
('0200-02-11 07:15:11.123');
('0947-02-11 07:15:11.12345'),
('0200-02-11 07:15:11.123456');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refrain from changing existing tests. Add a new test for timestamp-micros explicitly mentioning its column type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants