[SPARK-53029][PYTHON] Support return type coercion for Arrow Python UDTFs #52140

shujingyang-db · 2025-08-27T06:55:19Z

What changes were proposed in this pull request?

Support return type coercion for Arrow Python UDTFs by doing arrow_cast by default

Why are the changes needed?

Consistent behavior across Arrow UDFs and Arrow UDTFs

Does this PR introduce any user-facing change?

No, Arrow UDTF is not a public API yet

How was this patch tested?

New and existing UTs

Was this patch authored or co-authored using generative AI tooling?

No

dismiss

zhengruifeng · 2025-08-27T07:46:17Z

python/pyspark/sql/pandas/serializers.py

@@ -201,9 +201,26 @@ class ArrowStreamArrowUDTFSerializer(ArrowStreamUDTFSerializer):
    Serializer for PyArrow-native UDTFs that work directly with PyArrow RecordBatches and Arrays.
    """

-    def __init__(self, table_arg_offsets=None):
+    def __init__(self, table_arg_offsets=None, arrow_cast=False):


the default value should be True?

yep! I changed it to True and add a SQLConf to gate it

zhengruifeng · 2025-08-27T07:47:46Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

    def test_arrow_udtf_with_empty_column_result(self):
        @arrow_udtf(returnType=StructType())
        class EmptyResultUDTF:
            def eval(self) -> Iterator["pa.Table"]:
                yield pa.Table.from_struct_array(pa.array([{}] * 3))

-        assertDataFrameEqual(EmptyResultUDTF(), [Row(), Row(), Row()])
+        assertDataFrameEqual(EmptyResultUDTF(), [None, None, None])


I guess this change is unexpected?

Good catch! I have reverted it and create an empty batch with the number of rows set

python/pyspark/sql/pandas/serializers.py

allisonwang-db

Thanks for supporting this!

allisonwang-db · 2025-08-29T00:47:51Z

python/pyspark/sql/pandas/serializers.py

+                if batch.num_columns == 0:
+                    # When batch has no column, it should still create
+                    # an empty batch with the number of rows set.
+                    struct = pa.array([{}] * batch.num_rows)
+                    coerced_batch = pa.RecordBatch.from_arrays([struct], ["_0"])


I don't think we need to handle this case? cc @ueshin

This is to ensure the test case "test_arrow_udtf_with_empty_column_result" to work. Please refer to #52140 (comment) comment for the unexpected behavior change.

allisonwang-db · 2025-08-29T00:48:34Z

python/pyspark/sql/pandas/serializers.py

@@ -201,9 +201,26 @@ class ArrowStreamArrowUDTFSerializer(ArrowStreamUDTFSerializer):
    Serializer for PyArrow-native UDTFs that work directly with PyArrow RecordBatches and Arrays.
    """

-    def __init__(self, table_arg_offsets=None):
+    def __init__(self, table_arg_offsets=None, arrow_cast=True):


Let's enable arrow_cast by default for ArrowUDTFs (it's a new feature) so we don't need a flag here.

allisonwang-db · 2025-08-29T00:50:17Z

python/pyspark/sql/pandas/serializers.py

+        if arr.type == arrow_type:
+            return arr
+        elif self._arrow_cast:
+            return arr.cast(target_type=arrow_type, safe=True)


what's the difference between safe=True vs False

It will only allow casts that are guaranteed not to lose information. Truncation (floats to ints), narrowing (int64 → int8), or precision loss are not allowed. Will add a comment

cc @zhengruifeng is this the same behavior as Arrow UDFs?

allisonwang-db · 2025-08-29T00:51:13Z

python/pyspark/sql/pandas/serializers.py

+                assert isinstance(
+                    batch, pa.RecordBatch
+                ), f"Expected pa.RecordBatch, got {type(batch)}"


I think we already check this worker.py so no need to duplicate this check :)

allisonwang-db · 2025-08-29T00:52:02Z

python/pyspark/sql/pandas/serializers.py

+                            raise PySparkRuntimeError(
+                                errorClass="UDTF_RETURN_SCHEMA_MISMATCH",
+                                messageParameters={
+                                    "expected": str(len(arrow_return_type)),
+                                    "actual": str(batch.num_columns),
+                                    "func": "ArrowUDTF",
+                                },
+                            )


ditto. I think we already checked if the return column mismatch the expected return schema in worker.py. Would you mind double check?

Do you mean verify_arrow_result in worker.py? I removed it since verify_arrow_result requires return type to strictly match arrow_return_type in the conversion of pa.Table.from_batches([result], schema=pa.schema(list(arrow_return_type))).

verify_arrow_result( pa.Table.from_batches([result], schema=pa.schema(list(arrow_return_type))), assign_cols_by_name=False, expected_cols_and_types=[ (col.name, to_arrow_type(col.dataType)) for col in return_type.fields ], )

The column length is checked before it. Please take a look at:

if result.num_columns != return_type_size: ...

in verify_result.

allisonwang-db · 2025-08-29T00:52:39Z

python/pyspark/sql/pandas/serializers.py

+        if arr.type == arrow_type:
+            return arr
+        elif self._arrow_cast:
+            return arr.cast(target_type=arrow_type, safe=True)


Also, it would be great to list he type coercion rule here!

added a comment

allisonwang-db · 2025-08-29T21:20:37Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

-        with self.assertRaisesRegex(PythonException, "Schema at index 0 was different"):
-            result_df = MismatchedSchemaUDTF()
-            result_df.collect()
+        if self.spark.conf.get("spark.sql.execution.pythonUDTF.typeCoercion.enabled").lower() == "false":


you can use with self.sql_conf("...")

allisonwang-db · 2025-08-29T21:22:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val PYTHON_TABLE_UDF_TYPE_CORERION_ENABLED =
+    buildConf("spark.sql.execution.pythonUDTF.typeCoercion.enabled")


Let's enable Arrow cast for Arrow Python UDTFs by default so we don't need this config :)

sure, on it

Update: done

allisonwang-db · 2025-08-29T21:23:09Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

        class MismatchedSchemaUDTF:
            def eval(self) -> Iterator["pa.Table"]:
                result_table = pa.table(
                    {
-                        "wrong_col": pa.array([1], type=pa.int32()),
-                        "another_wrong_col": pa.array([2.5], type=pa.float64()),
+                        "col_with_arrow_cast": pa.array([1], type=pa.int32()),


What if we have input to be int64 and output to be int32? Does arrow cast throw exception in this case?

Yes, it will. We had a test case "test_return_type_coercion_overflow"

allisonwang-db · 2025-08-29T21:23:43Z

python/pyspark/sql/pandas/serializers.py

+        if arr.type == arrow_type:
+            return arr
+        elif self._arrow_cast:
+            return arr.cast(target_type=arrow_type, safe=True)


cc @zhengruifeng is this the same behavior as Arrow UDFs?

allisonwang-db · 2025-08-29T21:24:52Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

+                result_df = MismatchedSchemaUDTF()
+                result_df.collect()
+        else:
+            with self.assertRaisesRegex(PythonException, "Failed to parse string: 'wrong_col' as a scalar of type int32"):


Hmm looks like without arrow cast, the error message looks better.

I added a try-catch block to polish the error message with the arrow cast

ueshin · 2025-08-29T22:41:44Z

python/pyspark/sql/pandas/serializers.py

+            for packed in iterator:
+                batch, arrow_return_type = packed


nit:

Suggested change

for packed in iterator:

batch, arrow_return_type = packed

for batch, arrow_return_type in iterator:

ueshin · 2025-08-29T22:46:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -4003,6 +4003,7 @@ object SQLConf {
      .booleanConf
      .createWithDefault(false)

+


nit: revert this?

ueshin · 2025-08-29T22:47:29Z

python/pyspark/worker.py

+        ser = ArrowStreamArrowUDTFSerializer(
+            table_arg_offsets=table_arg_offsets
+        )


Looks like an unnecessary change?

ueshin · 2025-08-29T22:49:10Z

python/pyspark/sql/tests/arrow/test_arrow_udtf.py

@@ -612,6 +690,7 @@ class ArrowUDTFTests(ArrowUDTFTestsMixin, ReusedSQLTestCase):
    pass


+


ditto.

Could you run:

./dev/reformat-python

to make the linter happy?

ueshin · 2025-08-29T22:54:21Z

python/pyspark/sql/pandas/serializers.py

+                            raise PySparkRuntimeError(
+                                errorClass="UDTF_RETURN_SCHEMA_MISMATCH",
+                                messageParameters={
+                                    "expected": str(len(arrow_return_type)),
+                                    "actual": str(batch.num_columns),
+                                    "func": "ArrowUDTF",
+                                },
+                            )


The column length is checked before it. Please take a look at:

if result.num_columns != return_type_size: ...

in verify_result.

ueshin · 2025-08-29T22:56:50Z

python/pyspark/sql/pandas/serializers.py

+                if should_write_start_length:
+                    write_int(SpecialLengths.START_ARROW_STREAM, stream)
+                    should_write_start_length = False
+
+                yield coerced_batch


These are done in the super().dump_stream(). What we should do here is just type-casting.

I'm just wondering whether we can use RecordBatch.cast for this instead of casting each column?

shujingyang-db added 4 commits August 26, 2025 17:46

init

83ad11a

ckp

f27409a

revert changes

25a9742

tests

f6ff4c7

github-actions bot added SQL PYTHON labels Aug 27, 2025

zhengruifeng previously approved these changes Aug 27, 2025

View reviewed changes

zhengruifeng reviewed Aug 27, 2025

View reviewed changes

xinrong-meng reviewed Aug 27, 2025

View reviewed changes

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

shujingyang-db added 4 commits August 27, 2025 15:48

lint

481a3a5

default to true

6df8f57

add a sqlconf for spark.sql.execution.pythonUDTF.typeCoercion.enabled

e29f5d1

add udtfTypeCoercion

44a46b0

github-actions bot added the CORE label Aug 28, 2025

handle empty rows

505db61

shujingyang-db requested review from zhengruifeng and xinrong-meng August 28, 2025 20:36

fix tests

65fa7a1

allisonwang-db reviewed Aug 29, 2025

View reviewed changes

shujingyang-db added 4 commits August 29, 2025 14:35

rm conf - ckp

e49e59b

rm sql conf

0b5337b

polish error message and rm checks

4ed23a4

add comments

8174be8

shujingyang-db requested a review from allisonwang-db August 29, 2025 22:47

ueshin reviewed Aug 29, 2025

View reviewed changes

		val PYTHON_TABLE_UDF_TYPE_CORERION_ENABLED =
		buildConf("spark.sql.execution.pythonUDTF.typeCoercion.enabled")

	for packed in iterator:
	batch, arrow_return_type = packed
	for batch, arrow_return_type in iterator:

		@@ -4003,6 +4003,7 @@ object SQLConf {
		.booleanConf
		.createWithDefault(false)

		@@ -612,6 +690,7 @@ class ArrowUDTFTests(ArrowUDTFTestsMixin, ReusedSQLTestCase):
		pass

[SPARK-53029][PYTHON] Support return type coercion for Arrow Python UDTFs #52140

Are you sure you want to change the base?

[SPARK-53029][PYTHON] Support return type coercion for Arrow Python UDTFs #52140

Conversation

shujingyang-db commented Aug 27, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shujingyang-db Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shujingyang-db Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shujingyang-db Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

shujingyang-db Aug 29, 2025 •

edited

Loading

shujingyang-db Aug 29, 2025 •

edited

Loading

shujingyang-db Aug 29, 2025 •

edited

Loading