Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ability to add partitions in Athena to resources #1349

Closed
wants to merge 1 commit into from

Conversation

Vitalii0-o
Copy link

@Vitalii0-o Vitalii0-o commented May 13, 2024

dlt version
0.4.10

Describe the problem
Athena code there is no way to add partitions for Athena tables

Expected behavior
Added the ability to partition Athena tables in a schema

Steps to reproduce
try to create a partitioned table in Athena

Operating system
Linux

Runtime environment
Airflow

Python version
3.11

dlt data source
No response

dlt destination
AWS Athena / Glue Catalog

Other deployment details
No response

Additional information
No response

Copy link

netlify bot commented May 13, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 700cefc
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6641d32efb6c760008ab1796
😎 Deploy Preview https://deploy-preview-1349--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution :)
We need:

  • Move to adapter (see comment)
  • A test for creating a table with partitions
  • A test for adding a partition to an existing table
  • A small update in the athena docs about the partitions (see how it is done bigquery)

@@ -91,6 +91,7 @@ class TColumnType(TypedDict, total=False):
data_type: Optional[TDataType]
precision: Optional[int]
scale: Optional[int]
partition: Optional[bool]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at how we add partitions in bigquery with the bigquery adapter, it's very easy to do and you can copy most of the code from there probably.

@nicor88
Copy link

nicor88 commented May 21, 2024

Few comments from my side, because there is something missing:

  • Adding partitions for external tables (hive) without having dlt writing to the right partitioning schema in s3 is not enough. The data in S3 must match the partitions schema defined - you should add some example on how we you can leverage the s3 layout: "{table_name}/{test-params}/{YYYY}-{MM}-{DD}/{load_id}.{file_id}.{ext}" in this case the partition can be called for example extaction_date . Also, after dlt writes to s3 you should add the partition to the table via a metadata call, for example ALTER TABLE elb_logs_raw_native_part ADD PARTITION (dt='2015-01-01') location 's3://athena-examples-us-west-1/elb/plaintext/2015/01/01/'. - on the first run you can leverage MSCK REPAIR TABLE impressions, but for incremental runs this operation is too expensive.
  • You should add also partitioning for iceberg tables, in that case the writer is athena itself, not dlt, so what you proposed for hive/external tables should be enough.

Not sure if dlt has integration tests available to run, but you should add integrations tests to verify that the result of adding data to a partitioned table is correct - @sh-rp can for sure help you out here.

@rudolfix
Copy link
Collaborator

rudolfix commented May 22, 2024

@nicor88 yes we have end to end test and we can test the layout. ideally we could somehow test this via some SQL command. we'll put you as a reviewer here. OK?

also when looking at the documentation: if I use ADD PARTITION and LOCATION I can have any file layout I want. I just need to provide a file name to LOCATION and that's it, right?

@rudolfix rudolfix mentioned this pull request May 22, 2024
14 tasks
@rudolfix
Copy link
Collaborator

@Vitalii0-o we are moving to #1403

@rudolfix rudolfix closed this May 24, 2024
@nicor88
Copy link

nicor88 commented May 24, 2024

@rudolfix

also when looking at the documentation: if I use ADD PARTITION and LOCATION I can have any file layout I want. I just need to provide a file name to LOCATION and that's it, right?

In theory yes, practically it's a bad practises, because we introduce inconsistency between the S3 layout and the partition definition, and on scale could slow the engine.

Also, feel free to add me anywhere or ping me when necessary :) happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants