Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support struct_pack function if duckdb enabled #68

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion src/gateway/converter/spark_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
"""Provides the mapping of Spark functions to Substrait."""
import dataclasses

from backends.backend_options import BackendEngine
from gateway.converter.conversion_options import ConversionOptions
from substrait.gen.proto import algebra_pb2, type_pb2

Expand Down Expand Up @@ -460,11 +461,27 @@ def __lt__(self, obj) -> bool:
i64=type_pb2.Type.I64(
nullability=type_pb2.Type.Nullability.NULLABILITY_REQUIRED))),
}
SPARK_SUBSTRAIT_MAPPING_FOR_DUCKDB = {
'struct': ExtensionFunction(
'/functions_structs.yaml', 'struct_pack:any_str', type_pb2.Type(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This location isn't a standard Substrait extension as defined here: https://github.com/substrait-io/substrait/tree/main/extensions

The way to create structs on the fly in Substrait is with the Nested feature: https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L915

The way I'd go about implementing this would be to catch an attempt to use this in spark_to_substrait (in convert_function) and then expand it to create the appropriate structure. That function would also be responsible for constructing the return type.

That said, I'm not sure how many backends implement the Nested expression feature. I will check with the DuckDB folks tomorrow to see if they have time to add it.

Copy link
Contributor Author

@pat70 pat70 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Re:
    This location isn't a standard Substrait extension

This was attempted based on the "struct_extract" usage here , and it worked 🤞 :

    'struct_extract': ExtensionFunction(
        '/functions_structs.yaml', 'struct_extract:any_str', type_pb2.Type(
  • Re:
The way I'd go about implementing this would be to catch an attempt to use this in spark_to_substrait (in convert_function) and then expand it to create the appropriate structure. That function would also be responsible for constructing the return type.

I think I understand.

  • Re:
    I'm not sure how many backends implement the Nested expression feature

I think DuckDB does not support STRUCT logical types. I've only tried the Substrait - DuckDB extension so far, to produce substrait from a DuckDB-supported query. It fails with a "Not implemented error" for queries with struct_packusages, e.g.

CALL get_substrait_json("
SELECT 
struct_pack(cust_name:=c_name, cust_key:=c_custkey) as test_struct 
FROM 
read_parquet('<base_path>/third_party/tpch/parquet/customer/*.parquet') LIMIT 10
")

Is there another good way to test production and consumption of Substrait from DuckDB-queries?

  • Re:
    I will check with the DuckDB folks tomorrow to see if they have time to add it.

I'd be happy to subscribe to a thread, and maybe find time to help.

Copy link
Contributor

@pthatte1-bb pthatte1-bb Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I'm not sure how many backends implement the Nested expression feature. I will check with the DuckDB folks tomorrow to see if they have time to add it.

In case it helps, here's a snippet showing "nested-expression"-like support in DuckDb (used via the polars LazyFrame API) -

import duckdb
import polars as pl

parquet_path = "<basepath>/third_party/tpch/parquet/customer"
df = (pl.scan_parquet(parquet_path)
      .select(pl.col("c_custkey").alias("cust_key"), pl.col("c_name").alias("cust_name"))
      .select(pl.struct(pl.col("cust_key"), pl.col("cust_name")).alias("test_struct"))
      .select(pl.col("test_struct").struct.field("cust_key"), pl.col("test_struct").struct.field("cust_name"))
      )
duckdb.sql("SELECT * from df limit 10").show()

i64=type_pb2.Type.I64(
nullability=type_pb2.Type.Nullability.NULLABILITY_REQUIRED))),
**SPARK_SUBSTRAIT_MAPPING
}


def _find_mapping(options:ConversionOptions) -> dict[str, ExtensionFunction]:
match options.backend.backend:
case BackendEngine.DUCKDB:
return SPARK_SUBSTRAIT_MAPPING_FOR_DUCKDB
case _:
return SPARK_SUBSTRAIT_MAPPING


def lookup_spark_function(name: str, options: ConversionOptions) -> ExtensionFunction:
"""Return a Substrait function given a spark function name."""
definition = SPARK_SUBSTRAIT_MAPPING.get(name)
mapping = _find_mapping(options)
definition = mapping.get(name)
if definition is None:
raise ValueError(f'Function {name} not found in the Spark to Substrait mapping table.')
if not options.return_names_with_types:
Expand Down
Loading