[SPARK-52807][SDP] Proto changes to support analysis inside Declarative Pipelines query functions #52154

SCHJonathan · 2025-08-28T01:00:02Z

What changes were proposed in this pull request?

Introduces a mechanism for lazy execution of Declarative Pipelines query functions. A query function is something like the mv1 in this example:

@materialized_view
def mv1():
    return spark.table("upstream_table").filter(some_condition)

Currently, query functions are always executed eagerly. I.e. the implementation of the materialized_view decorator immediately invokes the function that it decorates and then registers the resulting DataFrame with the server.

This PR introduces Spark Connect proto changes that enable executing query functions later on, initiated by the server during graph resolution. After all datasets and flows have been registered with the server, the server can tell the client to execute the query functions for flows that haven't yet successfully been executed. The way this works is that the client initiates an RPC with the server, and then the server streams back responses that indicate to the client when it's time to execute a query function for one of its flows. Relevant changes:

New QueryFunctionFailure message
New QueryFunctionResult message
Replace relation field in DefineFlow with query_function_result field
New DefineFlowQueryFunctionResult message
New GetQueryFunctionExecutionSignalStream message
New PipelineQueryFunctionExecutionSignal message

Why are the changes needed?

There are some situations where we can't resolve the relation immediately at the time we're registering a flow.

E.g. consider this situation:
file 1:

@materialized_view
def mv1():
    data = [("Alice", 10), ("Bob", 15), ("Alice", 5)]
    return spark.createDataFrame(data, ["name", "amount"])

file 2:

@materialized_view
def mv2():
    return spark.table("mv1").groupBy("name").agg(sum("amount").alias("total_amount"))

Unlike some other transformations, which get analyzed lazily, groupBy can trigger an AnalyzePlan Spark Connect request immediately. If the query function for mv2 gets executed before mv1, then it will hit an error, because mv1 doesn't exist yet. groupBy isn't the only example here (df.schema, etc).

Other examples of these kinds of situations:

The set of columns for a downstream table is determined from the set of columns in an upstream table.
When spark.sql is used.

Does this PR introduce any user-facing change?

No

How was this patch tested?

It is a proto only changes. Will followup with unit tests and E2E tests once we add implementation.

Was this patch authored or co-authored using generative AI tooling?

No

github-actions bot added SQL CONNECT labels Aug 28, 2025

SCHJonathan changed the title ~~Jonathan chang data/proto changes~~ [SPARK-52807][SDP] Proto changes to support analysis inside Declarative Pipelines query functions Aug 28, 2025

proto

d595752

SCHJonathan force-pushed the jonathan-chang_data/proto-changes branch from a74b139 to d595752 Compare August 28, 2025 01:20

python proto

4aa03b7

github-actions bot added the PYTHON label Aug 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52807][SDP] Proto changes to support analysis inside Declarative Pipelines query functions #52154

[SPARK-52807][SDP] Proto changes to support analysis inside Declarative Pipelines query functions #52154

SCHJonathan commented Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

[SPARK-52807][SDP] Proto changes to support analysis inside Declarative Pipelines query functions #52154

Are you sure you want to change the base?

[SPARK-52807][SDP] Proto changes to support analysis inside Declarative Pipelines query functions #52154

Conversation

SCHJonathan commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

SCHJonathan commented Aug 28, 2025 •

edited

Loading