-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #479: Error Control for Non-Deterministic Source Queries #512
Issue #479: Error Control for Non-Deterministic Source Queries #512
Conversation
This PR is blocked until we complete the #520 . Since the determinism of the DataFrame is part of the Revision Flow/DataFrame Analysis, we aim to have a proper logic component before introducing new changes. |
# Conflicts: # core/src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala
core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala
Outdated
Show resolved
Hide resolved
src/test/scala/io/qbeast/spark/utils/QbeastInputSourcesTest.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala
Outdated
Show resolved
Hide resolved
After some discussion, we agreed to:
|
core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/io/qbeast/spark/index/SparkPlanAnalyzer.scala
Outdated
Show resolved
Hide resolved
src/test/scala/io/qbeast/spark/index/SparkPlanAnalyzerTest.scala
Outdated
Show resolved
Hide resolved
UPDATE: We weren't confident enough with the solution, and we recently found out that Delta Lake has a similar issue when doing Merge INTO. They solved it by materializing the data involved in the operation before executing the second pass. We can introduce that change in a second step. I would advocate to:
|
Closing it due to inactivity. I would update the documentation to proceed in case of error. |
Issue Description
#414 revealed a significant limitation in the writing process for Spark.
The current approach requires multiple traversals over the
DataFrame,
including the following steps:This repeated loading of the
DataFrame
introduces potential inconsistencies if the input data source is:If either of these situations occurs, the results from one of the writing steps described above will not be reliable for the next. For instance, if a change occurs between steps 1 and 2 or steps 1 and 3, we may be unable to correctly position the row within the space. If a change to the source occurs after step 2, the estimated index may not be 100% correct for the changed data.
Proposed Solution
To address part of this issue, we propose introducing a new agent, the
SparkPlanAnalyzer
, to improve error handling for non-deterministic queries before processing the data.Non-Deterministic Queries Unsupported
To properly define the types of Queries that aren't supported anymore, let's list the Non-Deterministic query plans that Spark can encounter:
LIMIT
-> Applying a LIMIT clause twice (if the Source is not SORTED) would lead to different results.SAMPLE
-> The Spark Sample Command uses Random Sampling to extract the percentage of rows specified in the operation. Unless it is a Qbeast Table, we cannot ensure the determinism of the results.FILTERING
with a Non-Deterministic Column -> Usingrand()
or other types of non-deterministic predicates would lead to different results depending on the execution.User Workaround
This approach provides users with 4 options:
columnStats
using the.option
method: By providing column statistics directly, users can mitigate these problems. However, this does not guarantee that the final written values will match those produced by the initial query.LinearTransformation
for numeric columns toQuantiles
would map all values outside the min/max range to the extremes of the space.Type of change
New feature.
Checklist:
Here is the list of things you should do before submitting this pull request:
How Has This Been Tested? (Optional)
This is tested with
QbeastInputSourcesTest.
Added tests for error control when using: