-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier to use rust DataFusion UDFs in datafusion-python #1017
Comments
@timsaucer so if I understood this correctly, we are trying to either create an |
@Spaarsh I’m thinking the latter. I should have a draft PR for the scalar variant in the next few days to demonstrate. |
@timsaucer okay! If no one is doing this already, I will try and understand how it can be done for Aggregator or Window UDFs. Though I am unclear of the approach, I am going through the TableProvider UDF code to gain some clarity! Any tips are helpful! |
@Spaarsh I've put up a draft PR for the scalar udf, but it has a few points that need cleaning up still: apache/datafusion#14579 One thing to be wary of when going down this direction is that it is very important to clearly understand which side of the FFI code exists in the provider and which is the consumer. I've tried to follow a pattern of |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Suppose someone wants to build a library that is usable by both rust and python DataFusion users. They have written a UDF in rust and it implements the rust DataFusion traits (whether scalar, aggregate, or window). Right now, if that user wants to use their UDF in
datafusion-python
, they need to expose a variety of methods that basically mimic the trait functions of the rust code. For scalar UDFs the interface requires a bit of wrangling from ColumnarValue to PyArrow objects.While it is possible to do this, it is likely error prone and tedious for implementers to write and maintain this code.
Describe the solution you'd like
We have an established pattern of adding foreign table providers via FFI interface and using PyCapsule. This makes adding a TableProvider a very easy operation. In our example code, the function to expose a table provider is only 6 lines of code and likely will require minimal maintenance.
It would be nice to expose all of the varieties of user defined functions via FFI to make this follow the established pattern and also easy for users to reuse their code.
Describe alternatives you've considered
I did a brief proof of concept where I used python calls to the required functions. This did work, but it took quite a bit of code and I suspect it will be difficult to maintain.
Additional context
This may provide additional value in that it would get us much closer to being able to expose a
SessionContext
via ffi, which would have nice impacts to both the datafusion-ray and ballista projects.The text was updated successfully, but these errors were encountered: