Added warming query stripping logic #837

fragosoluana · 2025-03-18T02:39:33Z

No description provided.

* Initial upgrade to lucene v9.7.0 * Address review comments

* Add support for vector search * Address review feedback

* Add stat metrics collector * Fix example-plugin test

* Update gradle and plugin versions, java 14->17 * Review feedback

This reverts commit 612a8e6.

* Fix node resolver for empty/missing files (#587) * Autogenerated JaCoCo coverage badge * Bump to v0.26.1 (#588) * Autogenerated JaCoCo coverage badge * parallel explain for nrtsearch (#589) * parallel explain for nrtsearch * fix nit * add explain for multi segment parallel * refactor isExplain in different contexts * make NRTSearch empty boolean query result constant score 1 (#592) * make NRTSearch empty boolean query result constant score 1 * fix tests * soft exception for fvh failures (#593) * Support unit in SortType (#590) create sort context and support unit in lat_lon distance sort * Autogenerated JaCoCo coverage badge * Upgrade dependencies for snyk check (#596) * Fix example-plugin test (#598) * Add page fault metrics (#599) --------- Co-authored-by: github_actions <runner@fv-az248-700.3twvhzoricxu3figbgb44chhog.cx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az887-622.kyhrjabtieueri5x0cgfkhfc1a.bx.internal.cloudapp.net> Co-authored-by: Tao Yu <[email protected]> Co-authored-by: waziqi89 <[email protected]> Co-authored-by: github_actions <runner@fv-az248-372.3twvhzoricxu3figbgb44chhog.cx.internal.cloudapp.net>

* Remove legacy Archiver and backup api * Avoid remote data commit for local index * Review comments

* Fix node resolver for empty/missing files (#587) * Autogenerated JaCoCo coverage badge * Bump to v0.26.1 (#588) * Autogenerated JaCoCo coverage badge * parallel explain for nrtsearch (#589) * parallel explain for nrtsearch * fix nit * add explain for multi segment parallel * refactor isExplain in different contexts * make NRTSearch empty boolean query result constant score 1 (#592) * make NRTSearch empty boolean query result constant score 1 * fix tests * soft exception for fvh failures (#593) * Support unit in SortType (#590) create sort context and support unit in lat_lon distance sort * Autogenerated JaCoCo coverage badge * Upgrade dependencies for snyk check (#596) * Fix example-plugin test (#598) * Add page fault metrics (#599) * Add additional merge metrics (#607) * Add live setting for verbose index metrics (#608) * Add live setting for verbose index metrics * Address review comment * Autogenerated JaCoCo coverage badge * Bump to v0.29.0 (#609) * Add live index settings override from config file (#610) * Autogenerated JaCoCo coverage badge * Add ability to update local index live settings (#611) * Add deadline cancellation for indexing (#612) * Add geo polygon query type (#613) * Add geo polygon query type * Detect closed polygons, update docs * Autogenerated JaCoCo coverage badge * Bump to v0.30.0 (#615) * Testing readthedocs web hooks (#616) * Add a bare minimum readthedocs config file (#617) * log more info when fvh failed (#618) log more info when fvh failed * Avoid calling query.toString() (#619) * add sts for web identity auth (#620) add sts for web identity auth * Add search diagnostics to deadline exceeded exceptions (#621) * Add search diagnostics to deadline exceeded exceptions --------- Co-authored-by: swekannan <[email protected]> * Fixes and spotless apply * Updated grpc-gateway --------- Co-authored-by: Andrew Prudhomme <[email protected]> Co-authored-by: github_actions <runner@fv-az248-700.3twvhzoricxu3figbgb44chhog.cx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az887-622.kyhrjabtieueri5x0cgfkhfc1a.bx.internal.cloudapp.net> Co-authored-by: Tao Yu <[email protected]> Co-authored-by: waziqi89 <[email protected]> Co-authored-by: github_actions <runner@fv-az248-372.3twvhzoricxu3figbgb44chhog.cx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az574-753.esnmgxn14wlejbqjnvhsbsbtxa.dx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az1272-720.grsihaubamwerhiryzjrxtypna.phxx.internal.cloudapp.net> Co-authored-by: github_actions <runner@fv-az736-601.alpbqrzxv30uzkvtn2qktnuusd.cx.internal.cloudapp.net> Co-authored-by: Mohammad Mohtasham <[email protected]> Co-authored-by: swekannan <[email protected]> Co-authored-by: swekannan <[email protected]>

* Bump lucene version to 9.8 * Rename KnnCollector to NrtsearchKnnCollector * Get max dimensions from codec * Calling leafCollector.finish after collection is successful

* Bump lucene version to 9.9.0 * Replaced all Lucene95 codecs with Lucene99 * Replaced rewrite(IndexReader) with rewrite(IndexSearcher) * Added hasBlocks=False to SegmentInfo constructor * Switched to Lucene99 and Completion99PostingsFormat * Fixed VectorFieldDefTest * Fixed explain test * Updated postings format to Completion99 in ContextSuggestFieldDef

* Bump lucene version to 9.10 * Compute numHitsToCollect before creating Collector

* Add Term/TermInSet query support for DATE_TIME field * Unify doc value query building

* Update to lucene 10.0.0 * Update to lucene 10.1.0

* Skip logging when warming * Added unit test

Implementation for index_prefixes

* Trigger logging with no hits * Undo wild import * Add comment * Minor change in comment

* Add some documentation for vector search * Address review feedback * Address review feedback

* bootstrap metrics for v1

waziqi89 · 2025-03-18T23:42:53Z

src/main/java/com/yelp/nrtsearch/server/warming/WarmingQueryStripping.java

+    }
+
+    public SearchRequest stripWarmingQuery(SearchRequest.Builder builder, int warmingCount) {
+        int probability = random.nextInt(99);


not sure about your intention, but sharing the same random result may lead the case multiple assertions are mutual inclusion.

You want to guarantee virtualField will be stripped when functionScore is stripped?

My understanding is that we don’t know exactly how many queries will be backed up—we only define a maximum using maxWarmingQueries. Because of this, I wanted to introduce a percentage-based configuration (e.g., 50% of queries should skip "rescorer") instead of relying on a fixed count, since we don’t have an exact number of backed-up queries.

Currently, we read the warming queries line by line , which means we would need to read the entire file to determine the exact number of queries. One option would be to load the whole file at once to count the lines, but I assumed the current approach was chosen to optimize memory usage.

To address this, I implemented a solution where queries are stripped probabilistically based on the new maxPerc configurations, maintaining the existing readLine(). Let me know your thoughts.

probability stripping makes sense.
I am more curious about the logic on using the same int probability for multiple assertions.
For example, if we have virtualField 50% functionScore 70%. Then if the probability is < 50, then it's guaranteed < 70. In other words, the query which strips functionScore always strips virtualField. Those ain't independent events.

oh got it, yeah, I agree with you. I will move it into the shouldStripQuery method so it is generated for every check and avoid sharing the probability

waziqi89 · 2025-03-18T23:47:02Z

src/main/java/com/yelp/nrtsearch/server/warming/WarmingQueryStripping.java

+    }
+
+    private void stripFunctionScoreScript(SearchRequest.Builder builder) {
+        if (builder.hasQuery() && builder.getQuery().hasFunctionScoreQuery()) {


hasFunctionScoreQuery looks too magical to me. Didn't look into details, but I am afraid this is a method provided by protoc, which only checks at the top level. It will miss the case when the functionScoreQuery is wrapped by booleanQuery

Oh true, I need to cover the other use cases. I will double check if this will get too complex. If so, I could skip this for now.

I think we can recursively check all boolean clauses (neglect other possibility for now) and replace the functionScoreQuery's script into the dummy one {lang": "js", "source": "1.0"}

waziqi89 · 2025-03-18T23:51:03Z

src/main/java/com/yelp/nrtsearch/server/warming/WarmingQueryStripping.java

+            int maxFunctionScoreScriptStrippingPerc,
+            int maxVirtualFieldsStrippingPerc,
+            int maxFacetsStrippingPerc) {
+        this.maxRescorerStrippingPerc = maxRescorerStrippingPerc;


Wonder what's the main purpose of separating the stripping into multiple subtasks? If it's used to analyze the different impact on each task, I get it. But in the "real" PR, I think it makes sense to "strip all when < maxStrippedWarmingQueriesPerc"

Yeah, mainly to first experiment and understand which query stripping logic can actually help; but also, I believe that each use case might need different query stripping.

fragosoluana · 2025-03-20T17:39:21Z

I will reopen another PR to properly branch out from profiling branch

waziqi89 · 2025-03-20T22:27:27Z

I will reopen another PR to properly branch out from profiling branch

Pleas don't branch out from profiling itself. It is behind the latest v0. You may want to create a "profiling_backup" branch at current head, and just force push to update the profiling

vim345 and others added 30 commits July 21, 2023 08:55

Updated nrtsearch version to 1.0.0-SNAPSHOT (#584)

d8612ea

Initial upgrade to lucene v9.7.0 (#585)

8a0b042

* Initial upgrade to lucene v9.7.0 * Address review comments

Autogenerated JaCoCo coverage badge

a0edff4

Add support for vector search (#586)

ff6137f

* Add support for vector search * Address review feedback

Add extra diagnostics for vector search (#591)

6f971c3

Autogenerated JaCoCo coverage badge

7d779a7

Handle multi value vector fields in Nrtsearch (#594)

612a8e6

Fix NPE loading vector with no data from index (#595)

f88cd85

Autogenerated JaCoCo coverage badge

2732700

Add stat metrics collector (#597)

235e2b2

* Add stat metrics collector * Fix example-plugin test

Update gradle and plugin versions, java 14->17 (#600)

c5270cf

* Update gradle and plugin versions, java 14->17 * Review feedback

Fix issue with large user/group ids with tar (#601)

7278ad8

Revert "Handle multi value vector fields in Nrtsearch (#594)" (#603)

d8d664b

This reverts commit 612a8e6.

Autogenerated JaCoCo coverage badge

26bbde0

Deprecate tokenize parameter in field definition (#602)

3027a0c

Fix default IndexOptions for text based fields (#606)

83ee570

Remove legacy Archiver and backup api (#605)

8b0e255

* Remove legacy Archiver and backup api * Avoid remote data commit for local index * Review comments

Autogenerated JaCoCo coverage badge

ba8df4e

Autogenerated JaCoCo coverage badge

7844a97

Lucene 9.8 (#624)

4159cb5

* Bump lucene version to 9.8 * Rename KnnCollector to NrtsearchKnnCollector * Get max dimensions from codec * Calling leafCollector.finish after collection is successful

Lucene 9.10 upgrade (#626)

1c534fe

* Bump lucene version to 9.10 * Compute numHitsToCollect before creating Collector

Make client:sourcesJar depend on generateProto task (#630)

ba4ab9c

enable defaultAwsCredentialChain when botoCfgPath is neglected (#645)

86ce4b7

Merge branch 'master' into sync_master

c4c7850

Fix example-plugin test

5731c27

Resolve conflict

a64ddcb

Merge branch 'sync_master' into 1.0.0-SNAPSHOT

f9b45a3

aprudhomme and others added 20 commits January 6, 2025 11:12

Add Term/TermInSet query support for DATE_TIME field (#797)

6caea92

* Add Term/TermInSet query support for DATE_TIME field * Unify doc value query building

Add vector search support for fields in nested objects (#796)

e94ac3a

Add ExactVectorQuery (#803)

4b4dbd8

Add range query support for _ID and ATOM fields (#804)

bb431d0

Clean up sort field validation (#809)

fd2415a

Restructure term query validation (#808)

1c2d664

Update to lucene 10.1.0 (#791)

e8772e1

* Update to lucene 10.0.0 * Update to lucene 10.1.0

Fix sorting of single valued numeric fields (#811)

2108ef0

Bump to v1.0.0-beta.2 (#813)

6c98c5d

fvh adaptor to highlight the top boost only. (#815)

5a0619f

Add some conditions to skip hits logging (#817)

f12984b

* Skip logging when warming * Added unit test

Add support for index_prefix (#816)

49f16bc

Implementation for index_prefixes

Bump to v1.0.0-beta.4 (#824)

edc5378

Trigger log with no hits (#826)

d5a6bd3

* Trigger logging with no hits * Undo wild import * Add comment * Minor change in comment

Add some documentation for vector search (#829)

7655790

* Add some documentation for vector search * Address review feedback * Address review feedback

Fix headings in vector search doc (#830)

3a1bf7a

bootstrap metrics for v1 (#831)

aa40949

* bootstrap metrics for v1

fix unit in bootstrap metrics (#832)

94ecb6a

change precision on bootstrap metrics (#833)

e9b3081

Added warming query stripping logic

2f1da2f

fragosoluana assigned waziqi89 and sarthakn7 Mar 18, 2025

waziqi89 reviewed Mar 18, 2025

View reviewed changes

fragosoluana changed the base branch from main to profiling March 20, 2025 17:32

fragosoluana closed this Mar 20, 2025

fragosoluana deleted the luanafragoso_stripped_warming_queries branch March 21, 2025 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added warming query stripping logic #837

Added warming query stripping logic #837

fragosoluana commented Mar 18, 2025

waziqi89 Mar 18, 2025

waziqi89 Mar 18, 2025

fragosoluana Mar 18, 2025

waziqi89 Mar 19, 2025 •

edited

Loading

fragosoluana Mar 20, 2025

waziqi89 Mar 18, 2025

fragosoluana Mar 19, 2025

waziqi89 Mar 19, 2025

waziqi89 Mar 18, 2025

fragosoluana Mar 20, 2025 •

edited

Loading

fragosoluana commented Mar 20, 2025

waziqi89 commented Mar 20, 2025

Added warming query stripping logic #837

Added warming query stripping logic #837

Conversation

fragosoluana commented Mar 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waziqi89 Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fragosoluana Mar 20, 2025 • edited Loading

Choose a reason for hiding this comment

fragosoluana commented Mar 20, 2025

waziqi89 commented Mar 20, 2025

waziqi89 Mar 19, 2025 •

edited

Loading

fragosoluana Mar 20, 2025 •

edited

Loading