-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-52346][SDP] Propagate partition columns from destination for BatchTableWrite #52119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-52346][SDP] Propagate partition columns from destination for BatchTableWrite #52119
Conversation
val updateContext = new PipelineUpdateContextImpl(graph, eventCallback = _ => ()) | ||
updateContext.pipelineExecution.runPipeline() | ||
updateContext.pipelineExecution.awaitCompletion() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use PipelineTest.startPipelineAndWaitForCompletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually don't extend PipelineTest
in PythonPipelineSuite
, adding it is causing some conflict in the inheritance hierarchy. PipelineTest
does have a lot of helpful methods like checkAnswer
, might worth to do a refactor separately
|
||
// check table is created with correct partitioning | ||
Seq("mv", "st").foreach { tableName => | ||
val table = spark.sessionState.catalog.getTableMetadata(graphIdentifier(tableName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: lets use DSv2 API, i.e spark.sessionState.catalogManager
val rows = spark.table(tableName).collect().map(r => (r.getLong(0), r.getLong(1))).toSet | ||
val expected = (0 until 5).map(id => (id.toLong, (id % 2).toLong)).toSet | ||
assert(rows == expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use checkAnswer
val tableMeta = spark.sessionState.catalog.getTableMetadata( | ||
fullyQualifiedIdentifier(tableName) | ||
) | ||
assert(tableMeta.partitionColumnNames == Seq("id_mod")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same nit as above, use DSv2 if possible
@@ -434,6 +434,34 @@ class PythonPipelineSuite | |||
.map(_.identifier) == Seq(graphIdentifier("a"), graphIdentifier("something"))) | |||
} | |||
|
|||
test("MV/ST with partition columns works") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add a test for what happens when a user attempts to change partition columns between pipeline runs?
I'd expect that to either trigger a full refresh or throw an exception. Either are probably acceptable, but it'd be nice for this behavior to be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already have a test for this: specifying partition column different from existing partitioned table
in MaterializeTablesSuite
, it throws a CANNOT_UPDATE_PARTITION_COLUMNS
error
What changes were proposed in this pull request?
Propagate partition columns specified in the destination table into during flow execution for batch table write.
Fix below exception during flow execution:
Why are the changes needed?
In "append" mode with saveAsTable, partition columns must be specified in query because the format and options of the existing table is used, and the table could have been created with partition columns.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New unit tests
Was this patch authored or co-authored using generative AI tooling?
No