Skip to content
This repository has been archived by the owner on Mar 11, 2024. It is now read-only.

feat: add lazyframe support for parquet and delta #52

Merged

Conversation

ion-elgreco
Copy link
Contributor

@ion-elgreco ion-elgreco commented Jan 24, 2024

This PR exposes LazyFrame support to parquet and delta IO Managers:

  • Renamed dump_df_to_path to write_df_to_path
  • Added abstract method sink_df_to_path
  • Extended dump_to_path to handle LazyFrames
  • Removed get_storage_options since the UPathIOmanager already handles this with the storage_options property
  • Delta: sink_df_to_path always has to collect into memory, people can configure this with metadata to collect with streaming
  • Delta: write_df_to_path will first try to initialize a delta table, this way we can log the version better in case of concurrent runs, since the TableState gets updated, and we grab the version from there
  • Parquet: cloud sinks are not supported in polars yet so we need to collect and dispatch to write, in write it either uses pyarrow or polars native writer depending on the params.
  • Removed all experimental flags
  • Bumped polars>= 0.20.0 and deltalake>=0.15.0
  • Refactored some pieces of the metadata collection, we don't collect stats on lazyframes since we can't do that with cheap operations
  • Extended all tests for Lazy->Lazy
  • Added _process_env_vars to take the env variables and pass them as storage options to UPath (all won't be visible in UI)
  • Removed the legacy reader since we bump polars

@ion-elgreco ion-elgreco marked this pull request as ready for review January 24, 2024 15:50
@danielgafni danielgafni self-requested a review January 24, 2024 21:15
@danielgafni
Copy link
Owner

Hey @ion-elgreco, can we keep polars >= 0.18.0? Would it cause any harm?

@ion-elgreco
Copy link
Contributor Author

If you keep it at 0.18 you need to manually keep logic in to handle the different versions.

Also the polars write delta is tied to a delta lake version, so that's why we need at least 0.20 because I made various changes to polars 0.19

In general also Ritchie advises to keep everything up to date so I don't see much harm in locking it to >=0.20,

@danielgafni
Copy link
Owner

danielgafni commented Jan 27, 2024

Sometimes upgrading is not so easy (I personally can't do it at work), it should be the end user's decision to upgrade unless absolutely required by the library.

I don't think we can't support 0.18.0. Even 0.17.0 is working correctly with the latest deltalake. In fact, the CI in this repo will be run for all minor polars versions from 0.17 to 0.20, and this worked perfectly previously.

We can remove the legacy pyarrow logic, that's fine.

Is there any other version-dependent logic which would case any issues? Can you point out the specific issues with 0.18.0?

README.md Show resolved Hide resolved
dagster_polars/io_managers/parquet.py Outdated Show resolved Hide resolved
dagster_polars/io_managers/delta.py Outdated Show resolved Hide resolved
dagster_polars/io_managers/base.py Show resolved Hide resolved
dagster_polars/io_managers/utils.py Show resolved Hide resolved
dagster_polars/io_managers/utils.py Outdated Show resolved Hide resolved
dagster_polars/io_managers/parquet.py Outdated Show resolved Hide resolved
dagster_polars/io_managers/parquet.py Outdated Show resolved Hide resolved
dagster_polars/io_managers/delta.py Outdated Show resolved Hide resolved
dagster_polars/io_managers/delta.py Outdated Show resolved Hide resolved
@ion-elgreco
Copy link
Contributor Author

@danielgafni alright, I've added support back for older polars versions, it's up to the users then. But we should drop older polars support once v1.0 lands. Maintaining code for multiple versions is not very practical : )

@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Jan 28, 2024

Looking at the test failures, polars==0.18 will be incompatible. I still think we need to drop that one at least otherwise we cannot release the changes

@ion-elgreco ion-elgreco force-pushed the feat/refactor_to_lazyframe branch from bef18d3 to 44ae873 Compare January 28, 2024 20:49
@ion-elgreco ion-elgreco reopened this Jan 28, 2024
dagster_polars/io_managers/base.py Show resolved Hide resolved
dagster_polars/io_managers/parquet.py Outdated Show resolved Hide resolved
dagster_polars/io_managers/base.py Show resolved Hide resolved
dagster_polars/io_managers/base.py Outdated Show resolved Hide resolved
dagster_polars/io_managers/parquet.py Show resolved Hide resolved
dagster_polars/io_managers/parquet.py Show resolved Hide resolved
Copy link
Owner

@danielgafni danielgafni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work @ion-elgreco ! Thanks for this PR!

@danielgafni danielgafni merged commit fed5350 into danielgafni:master Jan 29, 2024
37 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants