Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to datafusion 38 #691

Merged
merged 11 commits into from
May 14, 2024
Merged

Conversation

Michael-J-Ward
Copy link
Contributor

Which issue does this PR close?

Closes #690.

Are there any user-facing changes?

  • DFField and related methods were removed
  • PyScalarFunction and PyBuiltinScalarFunction were removed
  • null_count was fixed upstream so the behavior has changed


pub fn column_name(&self, plan: PyLogicalPlan) -> PyResult<String> {
self._column_name(&plan.plan()).map_err(py_runtime_err)
}
}

impl PyExpr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdye64 you may want to review this PR since it removes code that I believe you originally added

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdye64 - I had removed the method because it relied on DFField which was removed in datafusion.

The last commit attempts to re-implement the method using arrow's Field.

I'd still much appreciate any feedback / context!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @charlesbluca

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Dask SQL is using a pinned version of this repo from more than six months ago, so we likely won't get a review from the team right away. The new functionality based on Field looks good to me, so I will go ahead and merge this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is fine. Honestly we need to come up with a better way to get the column name anyway and as you mentioned are using a pinned older version for now anyway.

Cargo.toml Show resolved Hide resolved
Comment on lines +733 to +735
"a": [3.0, 0.0, 2.0, 1.0, 1.0, 3.0, 2.0],
"b": [3.0, 0.0, 5.0, 1.0, 4.0, 6.0, 5.0],
"c": [3.0, 0.0, 7.0, 1.7320508075688772, 5.0, 8.0, 8.0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these changes needed?

Copy link
Contributor Author

@Michael-J-Ward Michael-J-Ward May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null_count was fixed upstream in apache/datafusion#10260

The underlying data being described:

>>> print(df)
DataFrame()
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 4 | 8 |
| 2 | 5 | 5 |
| 3 | 6 | 8 |
+---+---+---+

The previous implementation relied on `DFField` which was removed upstream.

Ref: apache/datafusion#9595
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @Michael-J-Ward. It is great to see this project keeping up with DataFusion core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tracking Upgrade to Datafusion 38
3 participants