Feat/add polars delta merge support #47

edgBR · 2023-12-18T23:14:18Z

No description provided.

Adding new polars version

fake test

edgBR · 2023-12-18T23:17:55Z

Placeholder, new to dagster and not good writting tests, but I guess that the PR gives you the idea of what Im trying to achieve. Now in polars you can do:

Also I added a pre-commit hook that upgrade polars code to a target version.

danielgafni · 2023-12-19T13:05:07Z

Hey @edgBR !

Thanks for the PR, sounds like a neat feature to have. Let's work on getting this merged, we would have to change a few things for that.

danielgafni

So the main thing here is keeping lower version constraints to ensure backwards-compatibility with existing code bases

danielgafni · 2023-12-19T13:05:35Z

.github/workflows/ci.yml

@@ -37,6 +37,7 @@ jobs:
          - "0.17.0"
          - "0.18.0"
          - "0.19.0"
+          - "0.20.1"


nit: 0.20.0 would be more in line with the others

Hi, I understand but polars merge was added in 0.20 is okay to change to 0.20?

0.20.0 in this contest means ">=0.20.0, <0.21.0", or "latest available before 0.21.0". That's how CI is set up. So yes, it's ok to do 0.20.0 here.

danielgafni · 2023-12-19T13:05:46Z

.github/workflows/ci.yml

@@ -81,6 +82,7 @@ jobs:
          - "0.17.0"
          - "0.18.0"
          - "0.19.0"
+          - "0.20.1"


.pre-commit-config.yaml

danielgafni · 2023-12-19T13:07:57Z

pyproject.toml

@@ -28,11 +29,11 @@ license = "Apache-2.0"
 [tool.poetry.dependencies]
 python = "^3.8"
 dagster = "^1.4.0"
-polars = ">=0.17.0"
+polars = ">=0.20.1"


Let's not change the lower polars constraint. We don't want to force an update for users as it can break their code.

To be clear, we do want to update the dev polars version pinned in poetry.lock. This can be done via "poetry update polars" command.

danielgafni · 2023-12-19T13:08:21Z

pyproject.toml

 pyarrow = ">=8.0.0"
 typing-extensions = "^4.7.1"

-deltalake = { version = ">=0.10.0", optional = true }
+deltalake = { version = ">=0.14.0", optional = true }


same here, I don't think we need to change this

danielgafni · 2023-12-19T13:09:02Z

tests/test_polars_delta.py

@@ -101,7 +101,16 @@ def append_asset() -> pl.DataFrame:

    pl_testing.assert_frame_equal(pl.concat([df, df]), pl.read_delta(saved_path))

-
+def test_polars_delta_io_manager_merge(polars_delta_io_manager: PolarsDeltaIOManager):


Please add a test for the new merge functionality. You can take a look at other tests for inspiration. Let me know if you need any help with this

Removing polars upgrade pre-commit hooks to ensure backward compatibility of the code

edgBR · 2023-12-22T14:31:45Z

Hi @danielgafni ,

We have been busy with a PR to scikit-learn. I will update this next week.

Thanks for the support.

BR
E

danielgafni · 2023-12-29T09:53:05Z

Hey @edgBR !

It has been decided to merge this repo into the main Dagster project.
I'm going to start with it soon.
Do you want me to step in here and complete the rest of this PR so we can merge it faster?
I would like to see this merged before we move this code to Dagster.

edgBR · 2023-12-29T12:58:40Z

Hi Daniel,

What is the timeline?

Im on vacation right now. If you need to complete this before 8th of January then yes you can go ahead.

If not I will do it as soon as Im back.

Our scikit-learn PR is done (yet to be merged but their CI pipeline was broke for 1day and half due to a bug upstream in conda), took most of my last days.

BR
E

danielgafni · 2024-01-02T08:49:52Z

Hey @edgBR !

I'm hoping to finish with this in around 2 weeks. The Dagster PR is already created.

ion-elgreco · 2024-01-21T10:15:48Z

dagster_polars/io_managers/delta.py

+                    storage_options=storage_options,
+                    delta_merge_options=delta_merge_options,
+                )
+                .when_matched_update_all()


This needs to be configurable in some way, basically this is a default upsert, but MERGEs can be complex set of different update, delete and insert operations.

I commonly use deduplicate on insert

Hi Ion,

You are right and if you look to my example you will find a deduplication strategy using a rank function over a primary key and then selecting the first row. However for that the input dataset needs to have a "cdc" column (like load dts).

Shouldn´t this be responsibility of the user?

An alternative could be to modify:

def get_metadata(self, context: OutputContext, obj: pl.DataFrame) -> Dict[str, MetadataValue]: assert context.metadata is not None metadata = super().get_metadata(context, obj) if context.has_asset_partitions: partition_by = context.metadata.get("partition_by") if partition_by is not None: metadata["partition_by"] = partition_by if context.metadata.get("mode") == "append": # modify the medatata to reflect the fact that we are appending to the table if context.has_asset_partitions: # paths = self._get_paths_for_partitions(context) # assert len(paths) == 1 # path = list(paths.values())[0] # FIXME: what to about row_count metadata do if we are appending to a partitioned table? # we should not be using the full table length, # but it's unclear how to get the length of the partition we are appending to pass else: metadata["append_row_count"] = metadata["row_count"] path = self._get_path(context) # we need to get row_count from the full table metadata["row_count"] = MetadataValue.int( DeltaTable(str(path), storage_options=self.get_storage_options(path)) .to_pyarrow_dataset() .count_rows() ) return metadata

To maybe do something like this:

if context.metadata.get("mode") == "append": # modify the medatata to reflect the fact that we are appending to the table if context.has_asset_partitions: # paths = self._get_paths_for_partitions(context) # assert len(paths) == 1 # path = list(paths.values())[0] # FIXME: what to about row_count metadata do if we are appending to a partitioned table? # we should not be using the full table length, # but it's unclear how to get the length of the partition we are appending to pass else: metadata["append_row_count"] = metadata["row_count"] if context.metadata.get("mode") == "merge": # modify the medatata to reflect the fact that we are appending to the table metadata["primary_key"] == "something here that refers to this key" metadata["cdc_column"] == "something here that refers to this key" path = self._get_path(context) # we need to get row_count from the full table metadata["row_count"] = MetadataValue.int( DeltaTable(str(path), storage_options=self.get_storage_options(path)) .to_pyarrow_dataset() .count_rows() ) return metadata

Hey :)

Yeah, I need to go a bit more through the current implementation of dagster-polars. I've already pushed a PR for dagster-deltalake-polars as a first step.

@edgBR for my own work I am planning to use the dagster-deltalake-polars and then only the parquet IO manager in dagster-polars.

So somewhere next week after my first PR get's merged in dagster-deltalake-polars I will expand it there to cover a couple common MERGE strategies

danielgafni · 2024-01-27T23:24:22Z

Hey @edgBR ! The DeltaLake IOManger is being reworked in #52. You might want to get back to this after this PR is merged.

edgBR added 6 commits December 18, 2023 23:33

Update pyproject.toml

79efa4b

Adding new polars version

Update pyproject.toml

2b5a1f8

Update .pre-commit-config.yaml

6c6b5b5

Update ci.yml

f5f458c

Update delta.py

212773c

Update test_polars_delta.py

a397cc2

fake test

danielgafni suggested changes Dec 19, 2023

View reviewed changes

edgBR added 2 commits December 19, 2023 15:56

Update ci.yml

67e9549

Update .pre-commit-config.yaml

2f0ffbb

Removing polars upgrade pre-commit hooks to ensure backward compatibility of the code

ion-elgreco reviewed Jan 21, 2024

View reviewed changes

Merge branch 'master' into feat/add_polars_delta_merge_support

b38d9f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/add polars delta merge support #47

Feat/add polars delta merge support #47

edgBR commented Dec 18, 2023

edgBR commented Dec 18, 2023 •

edited

Loading

danielgafni commented Dec 19, 2023 •

edited

Loading

danielgafni left a comment

danielgafni Dec 19, 2023

edgBR Dec 19, 2023

danielgafni Dec 19, 2023 •

edited

Loading

danielgafni Dec 19, 2023

danielgafni Dec 19, 2023

danielgafni Dec 19, 2023 •

edited

Loading

danielgafni Dec 19, 2023

danielgafni Dec 19, 2023

edgBR commented Dec 22, 2023

danielgafni commented Dec 29, 2023

edgBR commented Dec 29, 2023

danielgafni commented Jan 2, 2024

ion-elgreco Jan 21, 2024

edgBR Jan 21, 2024

ion-elgreco Jan 21, 2024

ion-elgreco Jan 28, 2024

danielgafni commented Jan 27, 2024

		@@ -101,7 +101,16 @@ def append_asset() -> pl.DataFrame:

		pl_testing.assert_frame_equal(pl.concat([df, df]), pl.read_delta(saved_path))


		def test_polars_delta_io_manager_merge(polars_delta_io_manager: PolarsDeltaIOManager):

Feat/add polars delta merge support #47

Are you sure you want to change the base?

Feat/add polars delta merge support #47

Conversation

edgBR commented Dec 18, 2023

edgBR commented Dec 18, 2023 • edited Loading

danielgafni commented Dec 19, 2023 • edited Loading

danielgafni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielgafni Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielgafni Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edgBR commented Dec 22, 2023

danielgafni commented Dec 29, 2023

edgBR commented Dec 29, 2023

danielgafni commented Jan 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielgafni commented Jan 27, 2024

edgBR commented Dec 18, 2023 •

edited

Loading

danielgafni commented Dec 19, 2023 •

edited

Loading

danielgafni Dec 19, 2023 •

edited

Loading

danielgafni Dec 19, 2023 •

edited

Loading