Improve strict schema string match with smarter matching #621

andyl-db · 2024-12-19T00:34:05Z

This PR improves the strict schema string match with smarter matching i.e. subset matching

Added unit test

Tested against DBR:

linzhou-db · 2025-01-07T00:09:47Z

client/src/main/scala/io/delta/sharing/spark/RemoteDeltaLog.scala

+    val newSchemaFieldNames = newSchemaFields.keySet
+    val currentSchemaFieldNames = currentSchemaFields.keySet
+
+    // Ensure that all the new schema field names are a subset of current schema field names


is subset good enough?

I feel we could follow this Delta check in general. Right now I see few differences:

For the Delta check, they ensure current schema fields are a subset new schema fields, but we're doing the opposite.

They have checks on the nullability too.

I'm not familiar with the rationale behind the check either. Might worth to ask Ryan or Delta folks to make sure the change is safe.

Reversed the check

Nullability is checked within DataType.equalsStructurally

linzhou-db · 2025-01-07T00:11:09Z

client/src/main/scala/io/delta/sharing/spark/RemoteDeltaLog.scala

+      field.name -> field.dataType
+    }).toMap
+
+    val schemaChangedException = new SparkException(


nit: do we want to include the info of culprit field in the error message?

client/src/main/scala/io/delta/sharing/spark/RemoteDeltaLog.scala

charlenelyu-db · 2025-01-07T00:57:31Z

client/src/main/scala/io/delta/sharing/spark/RemoteDeltaLog.scala

+    val newSchemaFieldNames = newSchemaFields.keySet
+    val currentSchemaFieldNames = currentSchemaFields.keySet
+
+    // Ensure that all the new schema field names are a subset of current schema field names


I feel we could follow this Delta check in general. Right now I see few differences:

For the Delta check, they ensure current schema fields are a subset new schema fields, but we're doing the opposite.

They have checks on the nullability too.

I'm not familiar with the rationale behind the check either. Might worth to ask Ryan or Delta folks to make sure the change is safe.

andyl-db requested review from linzhou-db and charlenelyu-db December 19, 2024 00:34

andyl-db force-pushed the tolerate-schema-subset branch 2 times, most recently from bb68ecb to ef53c76 Compare January 6, 2025 23:19

linzhou-db reviewed Jan 7, 2025

View reviewed changes

charlenelyu-db reviewed Jan 7, 2025

View reviewed changes

andyl-db force-pushed the tolerate-schema-subset branch from ef53c76 to 781b9c3 Compare January 10, 2025 18:52

Improve strict schema string match with smarter matching

0384302

andyl-db force-pushed the tolerate-schema-subset branch from 781b9c3 to 0384302 Compare January 10, 2025 19:04

andyl-db requested a review from zsxwing January 10, 2025 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve strict schema string match with smarter matching #621

Improve strict schema string match with smarter matching #621

andyl-db commented Dec 19, 2024 •

edited

Loading

linzhou-db Jan 7, 2025

charlenelyu-db Jan 7, 2025

andyl-db Jan 10, 2025

linzhou-db Jan 7, 2025

charlenelyu-db Jan 7, 2025

Improve strict schema string match with smarter matching #621

Are you sure you want to change the base?

Improve strict schema string match with smarter matching #621

Conversation

andyl-db commented Dec 19, 2024 • edited Loading

linzhou-db Jan 7, 2025

Choose a reason for hiding this comment

charlenelyu-db Jan 7, 2025

Choose a reason for hiding this comment

andyl-db Jan 10, 2025

Choose a reason for hiding this comment

linzhou-db Jan 7, 2025

Choose a reason for hiding this comment

charlenelyu-db Jan 7, 2025

Choose a reason for hiding this comment

andyl-db commented Dec 19, 2024 •

edited

Loading