[spark], [infra] run spark integration tests in CI. #5590

zhongyujiang · 2025-05-12T10:34:44Z

Purpose

The current github CI for Spark module is missing integration tests, and some of Spark's integration tests are actually failing, they've just been consistently ignored by the CI.

Tests

API and Format

Documentation

zhongyujiang

cc @Zouxxyy @YannByron Can you please take a look when you have time? Thanks!

zhongyujiang · 2025-05-12T10:35:17Z

.github/workflows/utitcase-spark-4.x.yml

@@ -58,6 +58,6 @@ jobs:
          test_modules+="org.apache.paimon:paimon-spark-${suffix},"
          done
          test_modules="${test_modules%,}"
-          mvn -T 2C -B test -pl "${test_modules}" -Duser.timezone=$jvm_timezone -Pspark4,flink1
+          mvn -T 2C -B verify -pl "${test_modules}" -Duser.timezone=$jvm_timezone -Pspark4,flink1


test phase only cover unit tests, not integration tests.

zhongyujiang · 2025-05-12T11:49:07Z

Some unit tests are failing, it looks like it's releated to the changes in the time travel part. Let me take a look.

zhongyujiang · 2025-05-12T13:11:34Z

paimon-core/src/main/java/org/apache/paimon/table/AbstractFileStoreTable.java

+        return snapshot.isPresent()
+                ? Optional.of(
+                        schemaManager().schema(snapshot.get().schemaId()).copy(options.toMap()))
+                : Optional.empty();


This was changed to throw an exception directly if the time travel fails to find the snapshot.

But when I'm investigating the time travel releated test failures, I found that it seems Paimon currently allows querying a non-existent snapshot and returns an empty result.
Is this behavior reasonable? Shouldn't an exception be thrown if time travel fails to find the snapshot?

paimon/paimon-spark/paimon-spark-ut/src/test/scala/org/apache/paimon/spark/PaimonSourceTest.scala

Lines 89 to 123 in 408b78d

// set scan.snapshot-id = 4, this query will read data from the next commit.

val query2 = spark.readStream

.format("paimon")

.option("scan.snapshot-id", 4)

.load(location)

.writeStream

.format("memory")

.option("checkpointLocation", checkpointDir2.getCanonicalPath)

.queryName("mem_table2")

.outputMode("append")

.start()

val currentResult1 = () => spark.sql("SELECT * FROM mem_table1")

val currentResult2 = () => spark.sql("SELECT * FROM mem_table2")

try {

query1.processAllAvailable()

query2.processAllAvailable()

var totalStreamingData1 = latestChanges

var totalStreamingData2 = Seq.empty[Row]

checkAnswer(currentResult1(), totalStreamingData1)

checkAnswer(currentResult2(), totalStreamingData2)

spark.sql("INSERT INTO T VALUES (40, 'v_40'), (41, 'v_41'), (42, 'v_42')")

query1.processAllAvailable()

query2.processAllAvailable()

totalStreamingData1 =

totalStreamingData1 ++ (Row(40, "v_40") :: Row(41, "v_41") :: Row(42, "v_42") :: Nil)

totalStreamingData2 =

totalStreamingData2 ++ (Row(40, "v_40") :: Row(41, "v_41") :: Row(42, "v_42") :: Nil)

checkAnswer(currentResult1(), totalStreamingData1)

checkAnswer(currentResult2(), totalStreamingData2)

} finally {

query1.stop()

query2.stop()

}

Run spark integration tests in CI.

41e5cb9

zhongyujiang commented May 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[spark], [infra] run spark integration tests in CI. #5590

[spark], [infra] run spark integration tests in CI. #5590

Uh oh!

zhongyujiang commented May 12, 2025

Uh oh!

zhongyujiang left a comment

Uh oh!

zhongyujiang May 12, 2025

Uh oh!

zhongyujiang commented May 12, 2025

Uh oh!

zhongyujiang May 12, 2025

Uh oh!

Uh oh!

	// set scan.snapshot-id = 4, this query will read data from the next commit.
	val query2 = spark.readStream
	.format("paimon")
	.option("scan.snapshot-id", 4)
	.load(location)
	.writeStream
	.format("memory")
	.option("checkpointLocation", checkpointDir2.getCanonicalPath)
	.queryName("mem_table2")
	.outputMode("append")
	.start()

	val currentResult1 = () => spark.sql("SELECT * FROM mem_table1")
	val currentResult2 = () => spark.sql("SELECT * FROM mem_table2")
	try {
	query1.processAllAvailable()
	query2.processAllAvailable()
	var totalStreamingData1 = latestChanges
	var totalStreamingData2 = Seq.empty[Row]
	checkAnswer(currentResult1(), totalStreamingData1)
	checkAnswer(currentResult2(), totalStreamingData2)

	spark.sql("INSERT INTO T VALUES (40, 'v_40'), (41, 'v_41'), (42, 'v_42')")
	query1.processAllAvailable()
	query2.processAllAvailable()
	totalStreamingData1 =
	totalStreamingData1 ++ (Row(40, "v_40") :: Row(41, "v_41") :: Row(42, "v_42") :: Nil)
	totalStreamingData2 =
	totalStreamingData2 ++ (Row(40, "v_40") :: Row(41, "v_41") :: Row(42, "v_42") :: Nil)
	checkAnswer(currentResult1(), totalStreamingData1)
	checkAnswer(currentResult2(), totalStreamingData2)
	} finally {
	query1.stop()
	query2.stop()
	}

[spark], [infra] run spark integration tests in CI. #5590

Are you sure you want to change the base?

[spark], [infra] run spark integration tests in CI. #5590

Uh oh!

Conversation

zhongyujiang commented May 12, 2025

Purpose

Tests

API and Format

Documentation

Uh oh!

zhongyujiang left a comment

Choose a reason for hiding this comment

Uh oh!

zhongyujiang May 12, 2025

Choose a reason for hiding this comment

Uh oh!

zhongyujiang commented May 12, 2025

Uh oh!

zhongyujiang May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!