-
Notifications
You must be signed in to change notification settings - Fork 46
Better datatypes for S3 access logs? #28
Comments
That would be ideal, but per the documentation, certain fields (like I think |
Wouldn't a translation from |
Not really because in each field the It would probably be better to create new fields based off a case statement when we see hyphens. :/ |
Makes sense, and you're right. sigh. Especially annoyed by this datatype stuff b/c even in the official examples like this they just CAST a turnaround_time to an INT in the sql query anyway. 😡 I tried to create a grok-based glue crawler (based loosely on this post) and was moderately successful, but specifically with the s3 access logs since they're delivered as static flat files, I can't actually partition it without a little ETL magic to move the files themselves around.
but it seems like Glue doesn't play nice with conditionals (or certain datatypes maybe?), so it ended up being this one that worked to create the schema correctly:
custom Grok patt def (same as HTTPDATE, just keeps the values): Does everything it needs to. ....except make partitions. |
Yea, S3 access logs (given their age) are particularly challenging. re: conditionals, you can see something I did back in 2019 when there was an extra field briefly in the middle by searching the grok pattern for |
Just wondering if bytes_sent and object_size could be switched from type string to int or bigint for the optimized table. Is there a reason these are set the way they are?
https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/s3_access.py#L112
The text was updated successfully, but these errors were encountered: