Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implementation of udf and udaf decorator #1040

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

CrystalZhou0529
Copy link

@CrystalZhou0529 CrystalZhou0529 commented Mar 2, 2025

Which issue does this PR close?

Closes #806

Rationale for this change

This PR implements decorators for udf and udaf to make UDF creation more easily.

Idea was suggested by: apache/datafusion-site#17 (comment)

What changes are included in this PR?

  • Implemented two decorator methods and exposed them to the users. The internal logic is simple because it serves as a wrapper to the actual udf/udaf methods.
  • Added test cases to validate that the decorator methods are equivalent to the original APIs.
  • Slight modification in tests/test_udf.py to make the is_null test cases more reliable. Previously, since all data are not null, the return value is always [False, False, False], which is the same as the default empty vector. It caused confusion during my development because it didn't fail when I had a wrong implementation. I changed one value to NULL so that the output becomes [False, False, True], and it can test the functionality better.
  • In order to use udf to represent both a function and a decorator, we check if the first argument is a Callable. If so, then it's a function all. If not, then it is a decorator call.

Are there any user-facing changes?

Yes, this PR provides a more straightforward way for users to create UDF and UDAF.

Old way to create UDF:

def is_of_interest_impl(
    partkey_arr: pa.Array,
    suppkey_arr: pa.Array,
    returnflag_arr: pa.Array,
) -> pa.Array:
    # Implementation skipped

is_of_interest = udf(
    is_of_interest_impl,
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable",
)

df_udf_filter = df_lineitem.filter(
    is_of_interest(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
)

New way to create UDF:

@udf(
    [pa.int64(), pa.int64(), pa.utf8()],
    pa.bool_(),
    "stable")
def is_of_interest(
    partkey_arr: pa.Array,
    suppkey_arr: pa.Array,
    returnflag_arr: pa.Array,
) -> pa.Array:
    # Implementation skipped

df_udf_filter = df_lineitem.filter(
    is_of_interest(col("l_partkey"), col("l_suppkey"), col("l_returnflag"))
)

Old way to create UDAF:

def sum_bias_10_impl() -> Summarize:
    return Summarize(10.0)

sum_bias_10 = udaf(sum_bias_10_impl, pa.float64(), pa.float64(), [pa.float64()], "immutable")
sum_bias_10(...)

New way to create UDAF:

@udaf(pa.float64(), pa.float64(), [pa.float64()], "immutable")
def sum_bias_10() -> Summarize:
    return Summarize(10.0)

sum_bias_10(...)

@CrystalZhou0529 CrystalZhou0529 changed the title Implementation of udf and udaf decorator feat: Implementation of udf and udaf decorator Mar 2, 2025
@CrystalZhou0529 CrystalZhou0529 marked this pull request as ready for review March 2, 2025 22:46
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice addition and I love that you've already got some good unit tests.

I think from an end user perspective it might be slightly nicer if we could call these just @udf instead of @udf_decorator. I think it can be done.

This isn't my strongest suite, so I got a llm to generate this code:

import functools

class udf:
    """Acts both as a function and a decorator."""

    def __new__(cls, func_or_value):
        if callable(func_or_value):
            return cls._decorator(func_or_value)  # If used as a decorator
        else:
            return cls._function(func_or_value)  # If used as a function

    @staticmethod
    def _function(value):
        """Original function behavior."""
        return value * 2  # Example behavior

    @staticmethod
    def _decorator(func):
        """Decorator behavior."""
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            print(f"Calling {func.__name__} with {args}, {kwargs}")
            result = func(*args, **kwargs)
            print(f"Result: {result}")
            return result
        return wrapper

Obviously we would need to adapt that some for our use case. What do you think?

@CrystalZhou0529
Copy link
Author

Thanks for your suggestion! I totally agree that @udf is a better name. I'll experiment it and provide an update soon!

@CrystalZhou0529
Copy link
Author

@timsaucer Hi, I borrowed your suggested idea and managed to get it work! I also used llm a bit to write the documentation. I hope it's not too confusing for users to understand that the APIs for function call and decorator call are slightly different (one with the callable parameter and one without the callable). Please let me know if you have any feedback on how to improve it!

@timsaucer
Copy link
Contributor

This is looking very nice. Would you mind if I do some wordsmithing on the documentation? I'll also run the work flows now.

@CrystalZhou0529
Copy link
Author

CrystalZhou0529 commented Mar 3, 2025

No problem! I will also fix the linting errors!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add udf / udaf decorators
2 participants