-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added warming query stripping logic #837
Conversation
* Initial upgrade to lucene v9.7.0 * Address review comments
* Add support for vector search * Address review feedback
* Add stat metrics collector * Fix example-plugin test
* Update gradle and plugin versions, java 14->17 * Review feedback
* Fix node resolver for empty/missing files (#587) * Autogenerated JaCoCo coverage badge * Bump to v0.26.1 (#588) * Autogenerated JaCoCo coverage badge * parallel explain for nrtsearch (#589) * parallel explain for nrtsearch * fix nit * add explain for multi segment parallel * refactor isExplain in different contexts * make NRTSearch empty boolean query result constant score 1 (#592) * make NRTSearch empty boolean query result constant score 1 * fix tests * soft exception for fvh failures (#593) * Support unit in SortType (#590) create sort context and support unit in lat_lon distance sort * Autogenerated JaCoCo coverage badge * Upgrade dependencies for snyk check (#596) * Fix example-plugin test (#598) * Add page fault metrics (#599) --------- Co-authored-by: github_actions <runner@fv-az248-700.3twvhzoricxu3figbgb44chhog.cx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az887-622.kyhrjabtieueri5x0cgfkhfc1a.bx.internal.cloudapp.net> Co-authored-by: Tao Yu <[email protected]> Co-authored-by: waziqi89 <[email protected]> Co-authored-by: github_actions <runner@fv-az248-372.3twvhzoricxu3figbgb44chhog.cx.internal.cloudapp.net>
* Remove legacy Archiver and backup api * Avoid remote data commit for local index * Review comments
* Fix node resolver for empty/missing files (#587) * Autogenerated JaCoCo coverage badge * Bump to v0.26.1 (#588) * Autogenerated JaCoCo coverage badge * parallel explain for nrtsearch (#589) * parallel explain for nrtsearch * fix nit * add explain for multi segment parallel * refactor isExplain in different contexts * make NRTSearch empty boolean query result constant score 1 (#592) * make NRTSearch empty boolean query result constant score 1 * fix tests * soft exception for fvh failures (#593) * Support unit in SortType (#590) create sort context and support unit in lat_lon distance sort * Autogenerated JaCoCo coverage badge * Upgrade dependencies for snyk check (#596) * Fix example-plugin test (#598) * Add page fault metrics (#599) * Add additional merge metrics (#607) * Add live setting for verbose index metrics (#608) * Add live setting for verbose index metrics * Address review comment * Autogenerated JaCoCo coverage badge * Bump to v0.29.0 (#609) * Add live index settings override from config file (#610) * Autogenerated JaCoCo coverage badge * Add ability to update local index live settings (#611) * Add deadline cancellation for indexing (#612) * Add geo polygon query type (#613) * Add geo polygon query type * Detect closed polygons, update docs * Autogenerated JaCoCo coverage badge * Bump to v0.30.0 (#615) * Testing readthedocs web hooks (#616) * Add a bare minimum readthedocs config file (#617) * log more info when fvh failed (#618) log more info when fvh failed * Avoid calling query.toString() (#619) * add sts for web identity auth (#620) add sts for web identity auth * Add search diagnostics to deadline exceeded exceptions (#621) * Add search diagnostics to deadline exceeded exceptions --------- Co-authored-by: swekannan <[email protected]> * Fixes and spotless apply * Updated grpc-gateway --------- Co-authored-by: Andrew Prudhomme <[email protected]> Co-authored-by: github_actions <runner@fv-az248-700.3twvhzoricxu3figbgb44chhog.cx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az887-622.kyhrjabtieueri5x0cgfkhfc1a.bx.internal.cloudapp.net> Co-authored-by: Tao Yu <[email protected]> Co-authored-by: waziqi89 <[email protected]> Co-authored-by: github_actions <runner@fv-az248-372.3twvhzoricxu3figbgb44chhog.cx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az574-753.esnmgxn14wlejbqjnvhsbsbtxa.dx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az1272-720.grsihaubamwerhiryzjrxtypna.phxx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az736-601.alpbqrzxv30uzkvtn2qktnuusd.cx.internal.cloudapp.net> Co-authored-by: Mohammad Mohtasham <[email protected]> Co-authored-by: swekannan <[email protected]> Co-authored-by: swekannan <[email protected]>
* Bump lucene version to 9.8 * Rename KnnCollector to NrtsearchKnnCollector * Get max dimensions from codec * Calling leafCollector.finish after collection is successful
* Bump lucene version to 9.9.0 * Replaced all Lucene95 codecs with Lucene99 * Replaced rewrite(IndexReader) with rewrite(IndexSearcher) * Added hasBlocks=False to SegmentInfo constructor * Switched to Lucene99 and Completion99PostingsFormat * Fixed VectorFieldDefTest * Fixed explain test * Updated postings format to Completion99 in ContextSuggestFieldDef
* Bump lucene version to 9.10 * Compute numHitsToCollect before creating Collector
* Add Term/TermInSet query support for DATE_TIME field * Unify doc value query building
* Update to lucene 10.0.0 * Update to lucene 10.1.0
* Skip logging when warming * Added unit test
Implementation for index_prefixes
* Trigger logging with no hits * Undo wild import * Add comment * Minor change in comment
* Add some documentation for vector search * Address review feedback * Address review feedback
* bootstrap metrics for v1
} | ||
|
||
public SearchRequest stripWarmingQuery(SearchRequest.Builder builder, int warmingCount) { | ||
int probability = random.nextInt(99); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about your intention, but sharing the same random result may lead the case multiple assertions are mutual inclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to guarantee virtualField will be stripped when functionScore is stripped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we don’t know exactly how many queries will be backed up—we only define a maximum using maxWarmingQueries
. Because of this, I wanted to introduce a percentage-based configuration (e.g., 50% of queries should skip "rescorer") instead of relying on a fixed count, since we don’t have an exact number of backed-up queries.
Currently, we read the warming queries line by line , which means we would need to read the entire file to determine the exact number of queries. One option would be to load the whole file at once to count the lines, but I assumed the current approach was chosen to optimize memory usage.
To address this, I implemented a solution where queries are stripped probabilistically based on the new maxPerc configurations, maintaining the existing readLine(). Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probability stripping makes sense.
I am more curious about the logic on using the same int probability
for multiple assertions.
For example, if we have virtualField 50% functionScore 70%. Then if the probability is < 50, then it's guaranteed < 70. In other words, the query which strips functionScore always strips virtualField. Those ain't independent events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh got it, yeah, I agree with you. I will move it into the shouldStripQuery
method so it is generated for every check and avoid sharing the probability
} | ||
|
||
private void stripFunctionScoreScript(SearchRequest.Builder builder) { | ||
if (builder.hasQuery() && builder.getQuery().hasFunctionScoreQuery()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasFunctionScoreQuery
looks too magical to me. Didn't look into details, but I am afraid this is a method provided by protoc, which only checks at the top level. It will miss the case when the functionScoreQuery is wrapped by booleanQuery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh true, I need to cover the other use cases. I will double check if this will get too complex. If so, I could skip this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can recursively check all boolean clauses (neglect other possibility for now) and replace the functionScoreQuery's script into the dummy one {lang": "js", "source": "1.0"}
int maxFunctionScoreScriptStrippingPerc, | ||
int maxVirtualFieldsStrippingPerc, | ||
int maxFacetsStrippingPerc) { | ||
this.maxRescorerStrippingPerc = maxRescorerStrippingPerc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder what's the main purpose of separating the stripping into multiple subtasks? If it's used to analyze the different impact on each task, I get it. But in the "real" PR, I think it makes sense to "strip all when < maxStrippedWarmingQueriesPerc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, mainly to first experiment and understand which query stripping logic can actually help; but also, I believe that each use case might need different query stripping.
I will reopen another PR to properly branch out from profiling branch |
Pleas don't branch out from profiling itself. It is behind the latest v0. You may want to create a "profiling_backup" branch at current head, and just force push to update the profiling |
No description provided.