-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-53422][SQL][TEST] Make SPARK-30269 test case robust #52168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
// analyze table | ||
sql(s"ANALYZE TABLE $tblName COMPUTE STATISTICS NOSCAN") | ||
var tableStats = getTableStats(tblName) | ||
assert(tableStats.sizeInBytes == expectedSize) | ||
val expectedSize = tableStats.sizeInBytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is a removal of test coverage technically, @pan3793 .
This test case is a known issue which fails due to the Parquet metadata (mostly version string) change. However, I'd like not to remove this test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the original PR, the intention of this test is to make sure partition stats get updated, even though it equals existing table stats. The number of table's sizeInBytes does not really matter here.
Generally, asserting the size of binary data files like Parquet/ORC does not make sense, it can vary due to metadata change, as you pointed out, this is likely caused by the version string change, and also might be affected by the compression codec, the compressed data length might be different in different snappy version or platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to keep this test coverage.
@@ -1616,11 +1616,10 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto | |||
Seq(tbl, ext_tbl).foreach { tblName => | |||
sql(s"INSERT INTO $tblName VALUES (1, 'a', '2019-12-13')") | |||
|
|||
val expectedSize = 690 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we compare with the size as unexpected before insertion instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when spark.sql.statistics.size.autoUpdate.enabled
is false
(it's the default value), table stats is None until executing ANALYZE TABLE ...
I update the test to reflect that.
| USING PARQUET | ||
| PARTITIONED BY (ds) | ||
| LOCATION '${dir.toURI}' | ||
withSQLConf(SQLConf.AUTO_SIZE_UPDATE_ENABLED.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for ensuring this (although this is the default as you mentioned, @pan3793 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes were proposed in this pull request?
I saw the test failure when trying to upgrade Parquet to 1.16.0, actually, this occurs many times in previous Parquet version upgrades, we should not assume that Parquet files contain the same records have a fixed size, as it might vary in each version.
Here we get the
expectedSize
from the table stats.Why are the changes needed?
Make the test robust.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No.