Skip to content

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Aug 28, 2025

What changes were proposed in this pull request?

I saw the test failure when trying to upgrade Parquet to 1.16.0, actually, this occurs many times in previous Parquet version upgrades, we should not assume that Parquet files contain the same records have a fixed size, as it might vary in each version.

[info] - SPARK-30269 failed to update partition stats if it's equal to table's old stats *** FAILED *** (374 milliseconds)
[info]   666 did not equal 690 (StatisticsSuite.scala:1623)

Here we get the expectedSize from the table stats.

Why are the changes needed?

Make the test robust.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

$ build/sbt -Phive "hive/testOnly *StatisticsSuite -- -z SPARK-30269"
[info] StatisticsSuite:
[info] - SPARK-30269 failed to update partition stats if it's equal to table's old stats (9 seconds, 525 milliseconds)
[info] Run completed in 13 seconds, 519 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 179 s (02:59), completed Aug 29, 2025, 2:58:44 AM

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 28, 2025
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-53422][SQL][TEST] Make SPARK-30269 test robust [SPARK-53422][SQL][TEST] Make SPARK-30269 test case robust Aug 28, 2025
// analyze table
sql(s"ANALYZE TABLE $tblName COMPUTE STATISTICS NOSCAN")
var tableStats = getTableStats(tblName)
assert(tableStats.sizeInBytes == expectedSize)
val expectedSize = tableStats.sizeInBytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is a removal of test coverage technically, @pan3793 .

This test case is a known issue which fails due to the Parquet metadata (mostly version string) change. However, I'd like not to remove this test coverage.

Copy link
Member Author

@pan3793 pan3793 Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the original PR, the intention of this test is to make sure partition stats get updated, even though it equals existing table stats. The number of table's sizeInBytes does not really matter here.

Generally, asserting the size of binary data files like Parquet/ORC does not make sense, it can vary due to metadata change, as you pointed out, this is likely caused by the version string change, and also might be affected by the compression codec, the compressed data length might be different in different snappy version or platform.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to keep this test coverage.

@@ -1616,11 +1616,10 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto
Seq(tbl, ext_tbl).foreach { tblName =>
sql(s"INSERT INTO $tblName VALUES (1, 'a', '2019-12-13')")

val expectedSize = 690
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we compare with the size as unexpected before insertion instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when spark.sql.statistics.size.autoUpdate.enabled is false (it's the default value), table stats is None until executing ANALYZE TABLE ...

I update the test to reflect that.

@pan3793 pan3793 requested a review from yaooqinn August 29, 2025 09:23
| USING PARQUET
| PARTITIONED BY (ds)
| LOCATION '${dir.toURI}'
withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for ensuring this (although this is the default as you mentioned, @pan3793 )

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. I guess we need @yaooqinn 's final sign-off.
Thank you, @pan3793 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants