-
Notifications
You must be signed in to change notification settings - Fork 4.7k
HIVE-20889: Support timestamp-micros in AvroSerDe #5779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@armitage420 This is nice contribution!
However, you might have to cover with some tests for a variety of cases to avoid any timestamp issues. I would highly advise to extend existing tests for timestamp-millis to timestamp-micros.
serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java
Outdated
Show resolved
Hide resolved
serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerializer.java
Outdated
Show resolved
Hide resolved
(AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS.equals(schema.getProp(AvroSerDe.AVRO_PROP_LOGICAL_TYPE)) || | ||
AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS.equals(schema.getProp(AvroSerDe.AVRO_PROP_LOGICAL_TYPE)))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider creating a function to handle all these timestamp cases. Currently its only 2 cases, but looking at the extension itself, in the future it can be more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@armitage420 I am already seeing a Timestamp-nanos now in Avro specification.
https://avro.apache.org/docs/1.12.0/specification/#timestamps
@@ -159,7 +159,7 @@ private Schema createAvroPrimitive(TypeInfo typeInfo) { | |||
case TIMESTAMP: | |||
schema = AvroSerdeUtils.getSchemaFor("{" + | |||
"\"type\":\"" + AvroSerDe.AVRO_LONG_TYPE_NAME + "\"," + | |||
"\"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME + "\"}"); | |||
"\"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS + "\"}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Does this have the effect of making the type name as timestamp-micros
as default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does make parsing avro timestamps in microsecond precision in Hive as default. Before it was parsed in millisecond precision by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider keeping it as millis to make it backward compatible. We dont want to surprise users with this change.
I wonder how will users differentiate between millis and micros in their table schema. Is it only possible by providing a custom avro schema in avro.schema.url
which explicitly mentions the timestamp column as timestamp-micros
?
long millis = defaultProleptic ? timestamp.toEpochMilli() : | ||
CalendarUtils.convertTimeToHybrid(timestamp.toEpochMilli()); | ||
CalendarUtils.convertTimeToHybrid(timestamp.toEpochMilli()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Remove unnecessary whitespace changes to avoid rewriting git history
public static long convertTimeToHybridMicros(long prolepticMicros) { | ||
long hybridMicros = prolepticMicros; | ||
long prolepticMillis = prolepticMicros / 1_000L; // Convert micros to millis | ||
|
||
if (prolepticMillis < SWITCHOVER_MILLIS) { | ||
String dateStr = PROLEPTIC_TIME_FORMAT.get().format(new Date(prolepticMillis)); | ||
try { | ||
hybridMicros = HYBRID_TIME_FORMAT.get().parse(dateStr).getTime() * 1_000L; // Convert millis back to micros | ||
} catch (ParseException e) { | ||
throw new IllegalArgumentException("Can't parse " + dateStr, e); | ||
} | ||
} | ||
return hybridMicros; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as earlier, consider keeping it as micros due to loss in precision while changing to millis.
@@ -162,6 +162,11 @@ public long toEpochMilli() { | |||
return localDateTime.toInstant(ZoneOffset.UTC).toEpochMilli(); | |||
} | |||
|
|||
public long toEpochMicro() { | |||
return localDateTime.toEpochSecond(ZoneOffset.UTC) * 1_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use similar semantics as toEpochMilli()
?
return localDateTime.toInstant(ZoneOffset.UTC).toEpochMilli() * 1000 + localDateTime.getNano() / 1000;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing so might lead to loss of precision due to integer overflow. Hence we are parsing the value to seconds here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
localDateTime.toInstant(ZoneOffset.UTC).toEpochMilli()
returns a long value and due to addition of all these values the implicit datatype is long hence there should be no chance of integer overflow.
return localDateTime.toEpochSecond(ZoneOffset.UTC) * 1_000_000 | ||
+ localDateTime.getNano() / 1000; | ||
} | ||
|
||
public long toEpochMilli(ZoneId id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't zone conversion also applicable for micros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find any code changes related to zone, and hence did not add extra unnecessary changes. Although, adding a similar method for future does make sense.
@@ -0,0 +1,3 @@ | |||
CREATE EXTERNAL TABLE hive_test(`dt` timestamp) STORED AS AVRO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to have more test cases. Consider atleast referring to existing test case from millis and extend it for micros.
We are handling a lot of cases, such as proleptic calendar, zone based conversions etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understandable, adding equivalent test cases to avro in milliseconds precision, and some more test classes as required
} | ||
timestamp = TimestampTZUtil.convertTimestampToZone( | ||
timestamp, ZoneOffset.UTC, convertToTimeZone, legacyConversion); | ||
if (!skipProlepticConversion && logicalType != null && logicalType.getName().equals(AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified into a nested if block -
if (!skipProlepticConversion) {
// Add other if checks
....
}
Also use equalsIgnoreCase
@SourabhBadhya Thank you so much for the review! I will fix the required changes as soon as possible :D |
@SourabhBadhya I have made the required changes, I hope you could have a look at it, when you have time! |
|
@SourabhBadhya The build is green! Please look at the changes when you have the time! Thank you ✨ |
LogicalType logicalType = schema.getLogicalType(); | ||
if (logicalType != null && logicalType.getName().equalsIgnoreCase(AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS)) { | ||
long micros = defaultProleptic ? timestamp.toEpochMicro() : | ||
CalendarUtils.convertTimeToHybridMicros(timestamp.toEpochMicro()); | ||
timestamp = TimestampTZUtil.convertTimestampToZone( | ||
Timestamp.ofEpochMicro(micros), TimeZone.getDefault().toZoneId(), ZoneOffset.UTC, legacyConversion); | ||
return timestamp.toEpochMicro(); | ||
} | ||
long millis = defaultProleptic ? timestamp.toEpochMilli() : | ||
CalendarUtils.convertTimeToHybrid(timestamp.toEpochMilli()); | ||
timestamp = TimestampTZUtil.convertTimestampToZone( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider creating a function which would pass the required fields and return timestamp object.
This is because of enhancing extensibility for timestamp-nanos
.
if (logicalType == null) { | ||
return false; | ||
} | ||
|
||
// Supported timestamp logical types (extend this set as needed) | ||
return logicalType.equalsIgnoreCase(AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS) || | ||
logicalType.equalsIgnoreCase(AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its better to replace this code with -
AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS.equalsIgnoreCase(logicalType) || AvroSerDe.TIMESTAMP_TYPE_NAME_MICROS.equalsIgnoreCase(logicalType);
@@ -227,7 +227,7 @@ public class TestAvroObjectInspectorGenerator { | |||
" \"fields\" : [\n" + | |||
" {\"name\":\"timestampField\", " + | |||
" \"type\":\"" + AvroSerDe.AVRO_LONG_TYPE_NAME + "\", " + | |||
" \"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME + "\"}" + | |||
" \"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS + "\"}" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this and add explicit test cases for micros.
@@ -125,7 +125,7 @@ public void canSerializeDoubles() throws SerDeException, IOException { | |||
public void canSerializeTimestamps() throws SerDeException, IOException { | |||
singleFieldTest("timestamp1", Timestamp.valueOf("2011-01-01 00:00:00").toEpochMilli(), | |||
"\"" + AvroSerDe.AVRO_LONG_TYPE_NAME + "\"," + | |||
"\"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME + "\""); | |||
"\"logicalType\":\"" + AvroSerDe.TIMESTAMP_TYPE_NAME_MILLIS + "\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this and add explicit test cases for micros.
@@ -261,7 +261,7 @@ public void createAvroDateSchema() { | |||
public void createAvroTimestampSchema() { | |||
final String specificSchema = "{" + | |||
"\"type\":\"long\"," + | |||
"\"logicalType\":\"timestamp-millis\"}"; | |||
"\"logicalType\":\"timestamp-micros\"}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this and add explicit test cases for micros.
('0947-02-11 07:15:11.123'), | ||
('0200-02-11 07:15:11.123'); | ||
('0200-02-11 07:15:11.1234'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always refrain from modifying existing tests. Considering adding new tests.
('2012-02-21 07:08:09.123'), | ||
('1014-02-11 07:15:11.12345'); | ||
|
||
SELECT * FROM micros_table; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a test with timestamp-micros as a column and timestamp-millis as a column.
('0947-02-11 07:15:11.123'), | ||
('0200-02-11 07:15:11.123'); | ||
('0947-02-11 07:15:11.12345'), | ||
('0200-02-11 07:15:11.123456'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refrain from changing existing tests. Add a new test for timestamp-micros explicitly mentioning its column type.
What changes were proposed in this pull request?
Timestamps for avro file format in Hive supports timestamp-millis. This PR brings in the feature where Hive can now support Timestamp-micros logicaltype for avro too.
Some details on Logical Types of Avro here: https://gist.github.com/armitage420/fa1432f4dbda56f67bdb08c252415623
More details on Jira description is already present: HIVE-20889
Why are the changes needed?
These changes will increase the reading of timestamps precision for users.
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Added new qfile and other existing tests