Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to use rust DataFusion UDFs in datafusion-python #1017

Open
timsaucer opened this issue Feb 9, 2025 · 4 comments
Open

Make it easier to use rust DataFusion UDFs in datafusion-python #1017

timsaucer opened this issue Feb 9, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@timsaucer
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Suppose someone wants to build a library that is usable by both rust and python DataFusion users. They have written a UDF in rust and it implements the rust DataFusion traits (whether scalar, aggregate, or window). Right now, if that user wants to use their UDF in datafusion-python, they need to expose a variety of methods that basically mimic the trait functions of the rust code. For scalar UDFs the interface requires a bit of wrangling from ColumnarValue to PyArrow objects.

While it is possible to do this, it is likely error prone and tedious for implementers to write and maintain this code.

Describe the solution you'd like

We have an established pattern of adding foreign table providers via FFI interface and using PyCapsule. This makes adding a TableProvider a very easy operation. In our example code, the function to expose a table provider is only 6 lines of code and likely will require minimal maintenance.

It would be nice to expose all of the varieties of user defined functions via FFI to make this follow the established pattern and also easy for users to reuse their code.

Describe alternatives you've considered

I did a brief proof of concept where I used python calls to the required functions. This did work, but it took quite a bit of code and I suspect it will be difficult to maintain.

Additional context

This may provide additional value in that it would get us much closer to being able to expose a SessionContext via ffi, which would have nice impacts to both the datafusion-ray and ballista projects.

@Spaarsh
Copy link

Spaarsh commented Feb 9, 2025

@timsaucer so if I understood this correctly, we are trying to either create an FFI_UDF that is capable of ingesting different types of UDF or, we create an FFI for each kind of UDF i.e., FFI_UDF_Scalar, FFI_UDF_Window and so on. Right?

@timsaucer
Copy link
Contributor Author

@Spaarsh I’m thinking the latter. I should have a draft PR for the scalar variant in the next few days to demonstrate.

@Spaarsh
Copy link

Spaarsh commented Feb 9, 2025

@timsaucer okay! If no one is doing this already, I will try and understand how it can be done for Aggregator or Window UDFs. Though I am unclear of the approach, I am going through the TableProvider UDF code to gain some clarity! Any tips are helpful!

@timsaucer
Copy link
Contributor Author

@Spaarsh I've put up a draft PR for the scalar udf, but it has a few points that need cleaning up still: apache/datafusion#14579

One thing to be wary of when going down this direction is that it is very important to clearly understand which side of the FFI code exists in the provider and which is the consumer. I've tried to follow a pattern of FFI_ being the provider and Foreign* being the consumers. Additionally we want to be careful about how much of the DataFusion API we want and need to expose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants