Upgrade to datafusion 38 #691

Michael-J-Ward · 2024-05-13T20:38:03Z

Which issue does this PR close?

Closes #690.

Are there any user-facing changes?

DFField and related methods were removed
PyScalarFunction and PyBuiltinScalarFunction were removed
null_count was fixed upstream so the behavior has changed

Ref apache#690

Moved in apache/datafusion#10089

Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372

…ters_pushdown Deprecated function removed in apache/datafusion#9923

…n_with_option`

These relied on upstream BuiltinScalarFunction, which are now removed. Ref apache/datafusion#10098

`null_count` was fixed upstream. Ref apache/datafusion#10260

DFField was removed upstream. Ref: apache/datafusion#9595

andygrove · 2024-05-13T21:43:08Z

src/expr.rs

-
-    pub fn column_name(&self, plan: PyLogicalPlan) -> PyResult<String> {
-        self._column_name(&plan.plan()).map_err(py_runtime_err)
-    }
 }

 impl PyExpr {


@jdye64 you may want to review this PR since it removes code that I believe you originally added

@jdye64 - I had removed the method because it relied on DFField which was removed in datafusion.

The last commit attempts to re-implement the method using arrow's Field.

I'd still much appreciate any feedback / context!

also cc @charlesbluca

It looks like Dask SQL is using a pinned version of this repo from more than six months ago, so we likely won't get a review from the team right away. The new functionality based on Field looks good to me, so I will go ahead and merge this PR.

Yeah this is fine. Honestly we need to come up with a better way to get the column name anyway and as you mentioned are using a pinned older version for now anyway.

Cargo.toml

andygrove · 2024-05-13T21:45:13Z

datafusion/tests/test_dataframe.py

+        "a": [3.0, 0.0, 2.0, 1.0, 1.0, 3.0, 2.0],
+        "b": [3.0, 0.0, 5.0, 1.0, 4.0, 6.0, 5.0],
+        "c": [3.0, 0.0, 7.0, 1.7320508075688772, 5.0, 8.0, 8.0],


Why are these changes needed?

null_count was fixed upstream in apache/datafusion#10260

The underlying data being described:

>>> print(df) DataFrame() +---+---+---+ | a | b | c | +---+---+---+ | 1 | 4 | 8 | | 2 | 5 | 5 | | 3 | 6 | 8 | +---+---+---+

The previous implementation relied on `DFField` which was removed upstream. Ref: apache/datafusion#9595

andygrove

LGTM. Thank you @Michael-J-Ward. It is great to see this project keeping up with DataFusion core.

Michael-J-Ward added 9 commits May 13, 2024 13:41

chore: upgrade datafusion Deps

f266cd3

Ref apache#690

update concat and concat_ws to use datafusion_functions

2be45eb

Moved in apache/datafusion#10089

feat: upgrade functions.rs

4d89cd7

Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372

fix ScalarUDF import

366c092

feat: remove deprecated suppors_filter_pushdown and impl supports_fil…

f2519a0

…ters_pushdown Deprecated function removed in apache/datafusion#9923

use unnest_columns_with_options instead of deprecated `unnest_colum…

7a7af87

…n_with_option`

remove ScalarFunction wrappers

fcb8bab

These relied on upstream BuiltinScalarFunction, which are now removed. Ref apache/datafusion#10098

update dataframe test_describe

ddcd58f

`null_count` was fixed upstream. Ref apache/datafusion#10260

remove PyDFField and related methods

abe09a2

DFField was removed upstream. Ref: apache/datafusion#9595

Michael-J-Ward mentioned this pull request May 13, 2024

Tracking Upgrade to Datafusion 38 #690

Closed

3 tasks

Michael-J-Ward force-pushed the upgrade-df-38 branch from f311d66 to abe09a2 Compare May 13, 2024 21:13

andygrove reviewed May 13, 2024

View reviewed changes

Cargo.toml Show resolved Hide resolved

andygrove reviewed May 13, 2024

View reviewed changes

Michael-J-Ward added 2 commits May 13, 2024 16:54

bump datafusion-python package version to 38.0.0

cc0b4b2

re-implement PyExpr::column_name

6ae2007

The previous implementation relied on `DFField` which was removed upstream. Ref: apache/datafusion#9595

andygrove mentioned this pull request May 14, 2024

feat: add python bindings for ends_with function #693

Merged

andygrove approved these changes May 14, 2024

View reviewed changes

andygrove merged commit 01a370e into apache:main May 14, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to datafusion 38 #691

Upgrade to datafusion 38 #691

Michael-J-Ward commented May 13, 2024

andygrove May 13, 2024

Michael-J-Ward May 13, 2024

andygrove May 14, 2024

andygrove May 14, 2024

jdye64 May 14, 2024

andygrove May 13, 2024

Michael-J-Ward May 13, 2024 •

edited

Loading

andygrove left a comment

Upgrade to datafusion 38 #691

Upgrade to datafusion 38 #691

Conversation

Michael-J-Ward commented May 13, 2024

Which issue does this PR close?

Are there any user-facing changes?

andygrove May 13, 2024

Choose a reason for hiding this comment

Michael-J-Ward May 13, 2024

Choose a reason for hiding this comment

andygrove May 14, 2024

Choose a reason for hiding this comment

andygrove May 14, 2024

Choose a reason for hiding this comment

jdye64 May 14, 2024

Choose a reason for hiding this comment

andygrove May 13, 2024

Choose a reason for hiding this comment

Michael-J-Ward May 13, 2024 • edited Loading

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

Michael-J-Ward May 13, 2024 •

edited

Loading