-
-
Notifications
You must be signed in to change notification settings - Fork 25
Add support for new pandas UDF engine #418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job. I just have a small comment about dimensionality of NumPy arrays, although I don't think this is going to be too important for pandas.
src/blosc2/proxy.py
Outdated
""" | ||
data = cls._ensure_numpy_data(data) | ||
func = decorator(func) | ||
if data.ndim in (1, 2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this check? I am pretty sure that Blosc2 can handle arrays up to 8 dims (can be made larger by recompiling the underlying C-Blosc2 library). If that is not the case, this is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. pandas is the one that shouldn't be sending data with more than 2D. I had this check first as I had separate if branches for 1D and 2D, and I wanted to raise for unknown cases, but this is indeed not really needed. I removed it now.
BTW, do you have some preliminary benchmarks to check speed-ups? Thanks for your time! |
Not at this point. I'm not sure about implementing proper benchmarking, since from the pandas side we'll just allow anyone to run the |
Yes, Blosc2 is using numexpr behind the scenes, so it will work well with any mathematical function (sin, log, exp), conditionals, where (in the sense of numpy.where); see https://www.blosc.org/python-blosc2/reference/array_operations.html for the full list. In addition, it also supports most of the ufuncs implemented by NumPy, although the performance may not be as stellar as the ones supported by numexpr (listed above). |
I forgot to mention that |
Great, thanks for confirming. I'm wondering now if it'd be clearer for users if we just raise a NotImplementedError for map. I thought at first that it could make sense to run the vectorized function, but maybe that's confusing for users, if they try the same code with different engines. Maybe better to make engines always behave like the default engine, and raise if the exact behaviour is not supported. What do you think? |
I think |
I updated the PR so the blosc2 engine passes to the udf the data in the way is expected. And I raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you Marc! |
BTW, I see that pandas-dev/pandas#61467 has been merged into main. Would that mean that we can start benchmarking with pandas main already? If so, can you suggest an existing benchmark that would suitable for this? If there is not an existing one, I can do some quick bench with some largish dataframe in combination with some function. Thanks again! |
I'm not sure about a particular benchmark. What I have in mind for the pandas docs is to find some use cases that are a good usage for the different JIT compilers, and benchmark those against not using a JIT compiler. I may be wrong, but I don't think it makes too much sense to have a benchmark suite to compare Blosc2 with Numba or with Bodo.ai, since they solve very different use cases for what I know. I'm not sure if it's easy, but personally I'd like to use real-world cases for that. Like, instead of using |
Closes #383
With these changes, it'll be possible to do things like this with pandas and blosc2:
In this PR I'm assuming that
blosc2.jit
is intended to be used with vectorized numpy operations. If that's not the case, and you want to support scalar functions and jit Python loops (see example below), let me know and we can implement this for map:Also,
apply
is designed in pandas to call the udf for each column or row in a dataframe. I'm passing the whole array since it's a numpy array and I think operations should be vectorized, but I can make it work by columns or rows.