-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: dagster-pyiceberg #25
Comments
Hey! Sounds really great, I think both additions are reasonable and welcome. The only slight problem we might have with I know we also have I would like to avoid this with Iceberg. Do you think it would be possible to add the Polars IOManager to In general, I think less packages is better than more packages, it will make the ecosystem easier to navigate. Perhaps the same could be done with |
Thanks for your response. I understand your point about the organization of this code. I only looked at the Adding the Polars IOManager to Ultimately, the polars IOManager is simply inheriting from the Arrow IOManager. All the helper functions (and there are quite a few) live in the If you do add it as a dependency, wouldn't it be strange that there's a dependency on a community-maintained library? |
|
Thanks for your reply. The pyiceberg Of course, Polars has the scan_iceberg() function that returns a LazyFrame, so you'd be able to just call pl.scan_iceberg(
table # pyiceberg table
) For writing, I don't think it matters much which library is used. Polars doesn't support writes I think and Daft calls 'collect' on the input DF which returns a pyarrow table. TBH I'm not married to the current implementation so I'm happy to refactor or donate the code to another library if it fits there. I think the 'general' pieces of code (i.e. the mapping from Dagster partitions to e.g. Iceberg partition specs) can be reused by multiple libraries, reading tables can be handled by specific libraries (polars, dagster, duckdb, daft, etc.), and writes probably only have the pyarrow implementation at the moment. |
FYI I've refactored such that Polars and Daft are now supported as IO managers without being dependent on PyArrow and support lazy Dataframes. |
Hey @JasperHG90, that sounds great. I think it's going to be easier to proceed if we have a short call first. Do you mind reaching out to me on Dagster's community slack? |
Hey @danielgafni and @JasperHG90 - did this meeting take place? Would love to hear an update on this! |
Hey! Yes, we did have a call.
|
* 🔨 Use cargo nextest * 📝 Update changelog * Include doctest
Hi 👋 ,
I have two possible additions that I'd like to add to this repository:
Dagster-pyiceberg
dagster-pyiceberg is an IO manager to read from and write to Iceberg tables using PyIceberg. I'm working on a couple of end-to-end examples. The example with a postgresql catalog backend can be found here. Next I'll be working on an example using a polaris catalog backend.
This library has alpha status and depends currently on a prerelease version of pyiceberg.
Most PyIceberg features are supported. Users can define partitions on assets which will be mapped to a pyiceberg partition spec, and can update this partition spec by updating the dagster partition mapping. Schema evolution is also supported (albeit a bit crudely implemented). Some features that are not yet supported in PyIcberg (but will be supported in future versions), such as commit retries are also implemented.
I need to see what the best way is to port the code to this repository. I use UV with a workspace package layout. The setup is similar to dagster-deltalake, with pandas and polars support added to separate libraries that import the
dagster-pyiceberg
library.Dagster-pipes-gcp
This library essentially ports the dagster-pipes AWS Lambda external execution functionality to GCP cloud functions. I'd like to expand this to support Cloud Run Jobs and GCP batch as execution environments.
This code needs tests (currently no tests are written).
Could you let me know if this is of interest to you?
Thanks!
The text was updated successfully, but these errors were encountered: