[SPARK-52346][SDP] Propagate partition columns from destination for BatchTableWrite #52119

JiaqiWang18 · 2025-08-25T17:56:36Z

What changes were proposed in this pull request?

Propagate partition columns specified in the destination table into during flow execution for batch table write.

Fix below exception during flow execution:

org.apache.spark.sql.AnalysisException: Specified partitioning does not match that of the existing table spark_catalog.default.mv.
Specified partition columns: [].
Existing partition columns: [id_mod].

Why are the changes needed?

In "append" mode with saveAsTable, partition columns must be specified in query because the format and options of the existing table is used, and the table could have been created with partition columns.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests

Was this patch authored or co-authored using generative AI tooling?

No

JiaqiWang18 · 2025-08-25T18:12:44Z

@anishm-db @gengliangwang

AnishMahto · 2025-08-26T06:25:53Z

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala

+    val updateContext = new PipelineUpdateContextImpl(graph, eventCallback = _ => ())
+    updateContext.pipelineExecution.runPipeline()
+    updateContext.pipelineExecution.awaitCompletion()


nit: use PipelineTest.startPipelineAndWaitForCompletion

we actually don't extend PipelineTest in PythonPipelineSuite, adding it is causing some conflict in the inheritance hierarchy. PipelineTest does have a lot of helpful methods like checkAnswer, might worth to do a refactor separately

AnishMahto · 2025-08-26T06:28:22Z

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala

+
+    // check table is created with correct partitioning
+    Seq("mv", "st").foreach { tableName =>
+      val table = spark.sessionState.catalog.getTableMetadata(graphIdentifier(tableName))


nit: lets use DSv2 API, i.e spark.sessionState.catalogManager

AnishMahto · 2025-08-26T06:34:02Z

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala

+      val rows = spark.table(tableName).collect().map(r => (r.getLong(0), r.getLong(1))).toSet
+      val expected = (0 until 5).map(id => (id.toLong, (id % 2).toLong)).toSet
+      assert(rows == expected)


nit: use checkAnswer

AnishMahto · 2025-08-26T06:38:31Z

sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/SqlPipelineSuite.scala

+      val tableMeta = spark.sessionState.catalog.getTableMetadata(
+        fullyQualifiedIdentifier(tableName)
+      )
+      assert(tableMeta.partitionColumnNames == Seq("id_mod"))


Same nit as above, use DSv2 if possible

AnishMahto · 2025-08-26T06:40:52Z

...nnect/server/src/test/scala/org/apache/spark/sql/connect/pipelines/PythonPipelineSuite.scala

@@ -434,6 +434,34 @@ class PythonPipelineSuite
        .map(_.identifier) == Seq(graphIdentifier("a"), graphIdentifier("something")))
  }

+  test("MV/ST with partition columns works") {


Should we also add a test for what happens when a user attempts to change partition columns between pipeline runs?

I'd expect that to either trigger a full refresh or throw an exception. Either are probably acceptable, but it'd be nice for this behavior to be tested.

I think we already have a test for this: specifying partition column different from existing partitioned table in MaterializeTablesSuite, it throws a CANNOT_UPDATE_PARTITION_COLUMNS error

jackywang-db added 2 commits August 25, 2025 00:32

propagate partition columns from destination for BatchTableWrite

b4705ed

Merge branch 'master' into flow-exe-propagate-partition-columns

09bb98d

github-actions bot added SQL CONNECT labels Aug 25, 2025

fix test

4ea8b1a

AnishMahto reviewed Aug 26, 2025

View reviewed changes

jackywang-db added 2 commits August 28, 2025 23:04

Merge branch 'master' into flow-exe-propagate-partition-columns

249f280

address feedback

81a6d27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52346][SDP] Propagate partition columns from destination for BatchTableWrite #52119

[SPARK-52346][SDP] Propagate partition columns from destination for BatchTableWrite #52119

JiaqiWang18 commented Aug 25, 2025 •

edited

Loading

Uh oh!

JiaqiWang18 commented Aug 25, 2025

Uh oh!

AnishMahto Aug 26, 2025

Uh oh!

JiaqiWang18 Aug 29, 2025

Uh oh!

AnishMahto Aug 26, 2025

Uh oh!

AnishMahto Aug 26, 2025

Uh oh!

AnishMahto Aug 26, 2025

Uh oh!

AnishMahto Aug 26, 2025

Uh oh!

JiaqiWang18 Aug 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

[SPARK-52346][SDP] Propagate partition columns from destination for BatchTableWrite #52119

Are you sure you want to change the base?

[SPARK-52346][SDP] Propagate partition columns from destination for BatchTableWrite #52119

Conversation

JiaqiWang18 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

JiaqiWang18 commented Aug 25, 2025

Uh oh!

AnishMahto Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

AnishMahto Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

AnishMahto Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

AnishMahto Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

AnishMahto Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JiaqiWang18 commented Aug 25, 2025 •

edited

Loading

JiaqiWang18 Aug 29, 2025 •

edited

Loading