Issue #479: Error Control for Non-Deterministic Source Queries #512

osopardo1 · 2024-12-11T13:36:27Z

Issue Description

#414 revealed a significant limitation in the writing process for Spark.

The current approach requires multiple traversals over the DataFrame, including the following steps:

Collecting statistics from the data: Calculating stats such as min/max values for the columns to index and the count of elements.
Estimating the index: Determining how the data should be indexed.
Indexing rows: Assign rows to cubes
Writing data: Group rows by cubes and write them to files.

This repeated loading of the DataFrame introduces potential inconsistencies if the input data source is:

modified by another process or
built using non-deterministic functions that produce different results across executions

If either of these situations occurs, the results from one of the writing steps described above will not be reliable for the next. For instance, if a change occurs between steps 1 and 2 or steps 1 and 3, we may be unable to correctly position the row within the space. If a change to the source occurs after step 2, the estimated index may not be 100% correct for the changed data.

Proposed Solution

To address part of this issue, we propose introducing a new agent, the SparkPlanAnalyzer, to improve error handling for non-deterministic queries before processing the data.

Non-Deterministic Queries Unsupported

To properly define the types of Queries that aren't supported anymore, let's list the Non-Deterministic query plans that Spark can encounter:

LIMIT -> Applying a LIMIT clause twice (if the Source is not SORTED) would lead to different results.
SAMPLE -> The Spark Sample Command uses Random Sampling to extract the percentage of rows specified in the operation. Unless it is a Qbeast Table, we cannot ensure the determinism of the results.
FILTERING with a Non-Deterministic Column -> Using rand() or other types of non-deterministic predicates would lead to different results depending on the execution.
Indexing a Non-Deterministic Column -> Calculating statistics over a non-deterministic column would provoke a mismatch between Transformations computed in step 1 and the results from steps 2 and 3. No error should be raised if all indexing columns are deterministic

User Workaround

This approach provides users with 4 options:

Modify the query: Ensure that the query (for the indexing columns) is deterministic to avoid inconsistencies.
Materialize the query results before indexing. Instead of indexing the DF directly, save the data with another format before indexing.
Add columnStats using the .option method: By providing column statistics directly, users can mitigate these problems. However, this does not guarantee that the final written values will match those produced by the initial query.
Use more flexible Transformer Types: Changing the default LinearTransformation for numeric columns to Quantiles would map all values outside the min/max range to the extremes of the space.

Type of change

New feature.

Checklist:

Here is the list of things you should do before submitting this pull request:

New feature/bug fix has been committed following the Contribution guide.
Add logging to the code following the Contribution guide.
Add comments to the code (make it easier for the community!).
Change the documentation.
Add tests.
Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

This is tested with QbeastInputSourcesTest. Added tests for error control when using:

LIMIT
FILTER BY Non-Deterministic Columns
SAMPLE
Use of Non-deterministic Expressions on projection.

osopardo1 · 2024-12-13T11:11:16Z

This PR is blocked until we complete the #520 . Since the determinism of the DataFrame is part of the Revision Flow/DataFrame Analysis, we aim to have a proper logic component before introducing new changes.

# Conflicts: # core/src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala

core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala

…ision

src/test/scala/io/qbeast/spark/utils/QbeastInputSourcesTest.scala

core/src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala

osopardo1 · 2025-01-10T09:52:32Z

After some discussion, we agreed to:

Throw a Warning instead of an Error when indexing Non-Deterministic Columns or Queries. This Warning should advise the user that the query is non-deterministic, and that she/he can expect a failure during the process (if the data falls outside the Revision boundaries). It should also contain the two possible solutions:
- Add columnStats
- Materialize the data in a previous step
Group the Transformers that might be impacted by the non-determinism (in this case, just the LinearTransformer) and only apply checks to those included.
Move the Determinism checks to the SparkRevisionFactory component.
Error control on LinearTrasnformation.transform -> values outside [0.0,1.0) would cause the indexing process to stop. The error should also contain the three possible solutions: columnStats , quantiles, and Materialization.

…anges

core/src/main/scala/io/qbeast/spark/index/RowUtils.scala

core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala

core/src/main/scala/io/qbeast/core/transform/Transformer.scala

core/src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala

core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala

src/test/scala/io/qbeast/spark/index/SparkPlanAnalyzerTest.scala

osopardo1 · 2025-01-17T14:43:48Z

UPDATE:

We weren't confident enough with the solution, and we recently found out that Delta Lake has a similar issue when doing Merge INTO.

They solved it by materializing the data involved in the operation before executing the second pass. We can introduce that change in a second step. I would advocate to:

Add the logic to detect Undeterministic Sources.
Add the logic to materialize the data under demand. (This includes configuration of the Storage Level, enforcement..)

osopardo1 · 2025-01-28T07:24:09Z

Closing it due to inactivity. I would update the documentation to proceed in case of error.

osopardo1 added 6 commits December 10, 2024 16:05

WIP on undeterministic source querie

d70a2f9

Add analysis of undeterministic queries on OTreeDataAnalyzer

6fd400b

Update test

505b8b1

Add more sources

e1e4752

Add per column determinism check and left TODOs

b4d6cb1

Fix tests with rand()

385dcf1

osopardo1 added 5 commits December 18, 2024 13:49

Merge branch 'main' into 479-undeterministic-source-queries

efffcc6

# Conflicts: # core/src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala

Merge with new changes, fix test

224704f

Fix dataFrame

ce544e1

Isolate logic on SparkPlanAnalyzer

3ba56d8

Remove unused imports

7ec04af

Jiaweihu08 reviewed Dec 18, 2024

View reviewed changes

core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala Outdated Show resolved Hide resolved

Jiaweihu08 reviewed Dec 18, 2024

View reviewed changes

core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala Outdated Show resolved Hide resolved

Jiaweihu08 reviewed Dec 18, 2024

View reviewed changes

core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala Outdated Show resolved Hide resolved

osopardo1 added 2 commits December 19, 2024 16:56

Move QbeastRelation, move analysis plan method, use Trait and use Rev…

595b966

…ision

Remove unnecessary imports, fix test

8ed7393

osopardo1 marked this pull request as ready for review December 20, 2024 08:00

Jiaweihu08 requested changes Jan 9, 2025

View reviewed changes

Add Qbeast Sample test

d351eb3

Change error for Warning, add assertion in RowUtils, rollback test ch…

21ebb0c

…anges

osopardo1 requested a review from Jiaweihu08 January 14, 2025 06:49

Jiaweihu08 requested changes Jan 15, 2025

View reviewed changes

osopardo1 added 5 commits January 20, 2025 14:02

Change processing for SparkPlanAnalyzer

3602804

Merge branch 'main' into 479-undeterministic-source-queries

86f8189

Move bounding to the Transformation

c223a0e

Change DoublePassOTreeDataAnalyzer method

8badcfb

Change again

5ff4fa7

osopardo1 requested a review from Jiaweihu08 January 22, 2025 08:39

osopardo1 closed this Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #479: Error Control for Non-Deterministic Source Queries #512

Issue #479: Error Control for Non-Deterministic Source Queries #512

osopardo1 commented Dec 11, 2024 •

edited by Jiaweihu08

Loading

osopardo1 commented Dec 13, 2024

osopardo1 commented Jan 10, 2025 •

edited

Loading

osopardo1 commented Jan 17, 2025

osopardo1 commented Jan 28, 2025

Issue #479: Error Control for Non-Deterministic Source Queries #512

Issue #479: Error Control for Non-Deterministic Source Queries #512

Conversation

osopardo1 commented Dec 11, 2024 • edited by Jiaweihu08 Loading

Issue Description

Proposed Solution

Non-Deterministic Queries Unsupported

User Workaround

Type of change

Checklist:

How Has This Been Tested? (Optional)

osopardo1 commented Dec 13, 2024

osopardo1 commented Jan 10, 2025 • edited Loading

osopardo1 commented Jan 17, 2025

osopardo1 commented Jan 28, 2025

osopardo1 commented Dec 11, 2024 •

edited by Jiaweihu08

Loading

osopardo1 commented Jan 10, 2025 •

edited

Loading