SOLR-17525: Text Embedder Query Parser #2809

alessandrobenedetti · 2024-10-29T11:23:29Z

https://issues.apache.org/jira/browse/SOLR-17525

Description

Scope of this issue is to integrate a new module able to use LLM (through managed services) to enhance aspect of Apache Solr.
Specifically this first Pull Request relates to handle embedding models and automatic text vectorisation in Solr.

Solution

The functionality has been introduced through LangChain4J (https://docs.langchain4j.dev).
The are several aspects I would like feedback on:

I used inversion of control to integrate support for multiple embedding models with minimal impact on the code. It works and it's decently clean, but I would love your feedback on this
Embedding models are accessed through client APIS and http requests are made to embed text using external services.
To do that I added security exceptions in both 'solr/server/etc/security.policy' and 'gradle/testing/randomization/policies/solr-tests.policy'.
It works but I have no idea if it's acceptable or the best way to do it

Tests

model storage tests added
query parsing tests added

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

alessandrobenedetti · 2024-10-29T12:35:31Z

I'll keep polishing it and finalise the documentation, but I think it's ready for review!

alessandrobenedetti · 2024-10-29T12:44:53Z

Just as a reminder, currently the check fails with:
"* What went wrong:
Execution failed for task ':solr:modules:llm:analyzeClassesDependencies'.

Dependency analysis found issues.
usedUndeclaredArtifacts

dev.langchain4j:langchain4j-core:0.35.0@jar
unusedDeclaredArtifacts
dev.langchain4j:langchain4j-cohere:0.35.0@jar
dev.langchain4j:langchain4j-hugging-face:0.35.0@jar
dev.langchain4j:langchain4j-mistral-ai:0.35.0@jar
dev.langchain4j:langchain4j-open-ai:0.35.0@jar
dev.langchain4j:langchain4j:0.35.0@jar
"
I'll work on it in the next few days

dsmiley

Why does this module introduce a competing "model store" to Solr's existing "file store"? The "file store" was developed with "models" in mind, in addition to initially being developed for plugin packages.

gradle/testing/randomization/policies/solr-tests.policy

solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc

alessandrobenedetti · 2024-10-30T05:41:14Z

Why does this module introduce a competing "model store" to Solr's existing "file store"? The "file store" was developed with "models" in mind, in addition to initially being developed for plugin packages.

Let me elaborate here:
I took inspiration from the Learning To Rank module (ltr) where I was familiar with the model store:
org.apache.solr.ltr.store.ModelStore
similar to the file store but specialised in learning to rank models and their management and singleton instantiation .
The embeddingModel store does pretty much an equivalent job and grants direct access to embedding models so that the query parser or other components (like the update request processor in the works).

Given that, I am not that familiar with the file store, so if it can help in having a better and cleaner solution I'll be more than happy to take a look at it (after the 5th of November)

epugh · 2024-10-30T11:42:13Z

I would love to get a walk through on this exciting feature at the next community meetup...

epugh

I went through and mostly commented on the code. Having said that, I don't know that I grok the high level vision, since that's hard to convey with just a PR. I would like to learn more about what langchain4j brings to the party.. Do we see it being used more widely in Solr? To expose more thigns in Solr to Langchain4j? Or just as an api to some embedding models?

solr/modules/llm/src/test-files/solr/collection1/conf/stopwords.txt

solr/modules/llm/src/test-files/solr/collection1/conf/synonyms.txt

solr/modules/llm/src/test/org/apache/solr/llm/TestLlmBase.java

epugh · 2024-10-30T11:46:59Z

solr/modules/llm/src/test/org/apache/solr/llm/TestLlmBase.java

+      }
+      Files.delete(mstore);
+    }
+    if (!solrconfig.equals("solrconfig-llm.xml")) {


this appears to be some magic? isn't the way we set things up in a way so that you control which solrconfig you are using? Maybe I don't know the RestTestBase enough, but genrally I just specify what configs etc to use...

I cleaned up the tests a bit, before I resolve the comment, take a look if it's any better now!

solr/modules/llm/src/test/org/apache/solr/llm/TestLlmBase.java

epugh · 2024-10-30T11:49:32Z

solr/modules/llm/src/test/org/apache/solr/llm/TestLlmBase.java

+  }
+
+  /** produces a model encoded in json * */
+  public static String getModelInJson(String name, String className, String params) {


don't we use different approaches to building JSON elsewhere? One of my goals for Solr code bases is to have more consistency across all the areas... I know that isn't as helpful to the creator of the new code, like in this case, but in two years, when someone else has to come along and understand things, then it really pays off!

Do you have any pointer? In the meantime I keep this open and look up myself right now if I can find anything

I suggest not using StringBuilder here; it's needlessly verbose. I think IntelliJ will switch it for you.

epugh · 2024-10-30T11:51:23Z

solr/modules/llm/src/test/org/apache/solr/llm/TestLlmBase.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class TestLlmBase extends RestTestBase {


Is RestTestBase our future? @dsmiley didn't you have somewhat of a vision for where we are going with our hierarchy of testing classes? I am intrigued to learn more abut RestTestBase, it's new to me. However, if it isn't part of our vision for how we handle testing, then I am nervous about depending on it for another new module!

I leave this comment open for @dsmiley to comment.
I took inspiration from the Learning To Rank module and it made possible with a decent effort to test the embedding model store, but I'm open to refactor it!

It's used a ton so I sympathize with continuing to use it until there's renewed energy in more test renovation. Even then, it would likely stay. It's not deprecated even though its parent class is.

epugh · 2024-10-30T12:00:10Z

solr/modules/llm/src/test/org/apache/solr/llm/store/rest/TestModelManager.java

+    assertJPut(ManagedEmbeddingModelStore.REST_END_POINT, model, "/responseHeader/status==0");
+    // success
+    final String multipleModels =
+        "[{ name:\"testModel3\", class:\""


here we do string building for json, which I think is our dominant pattern... personally not having all the .appends makes it more readable, even with the escaping around the double quotes!

mmm I am missing the point here, can you tell me how you think this can be improved with an example? happy to do it then!

He was thinking-out-loud, not telling you to do differently or even implied. If I may do the same, I'll say I look forward to us using text blocks, now available on main due to Java 21. I suppose you will back-port this to 9.x so I suppose you won't bother using Java 21 features yet.

epugh · 2024-10-30T12:01:41Z

solr/solr-ref-guide/modules/query-guide/pages/dense-vector-search.adoc

@@ -242,9 +242,9 @@ client.add(Arrays.asList(d1, d2));

 == Query Time

-Apache Solr provides two query parsers that work with dense vector fields, that each support different ways of matching documents based on vector similarity: The `knn` query parser, and the `vectorSimilarity` query parser.
+Apache Solr provides three query parsers that work with dense vector fields, that each support different ways of matching documents based on vector similarity: The `knn` query parser, the `vectorSimilarity` query parser and the `embed` query parser.


We should be thinking about adding some call out text of "Here is when you use knn, here is when you use vectorSimilarity, and here is when you use embed". I'd personally like to see our ref guide move beyond being just a reference to look things up and also explain more the "why".

Done, take a look and if enough feel free to resolve this conversation

solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc

settings.gradle

solr/core/src/java/org/apache/solr/schema/DenseVectorField.java

solr/modules/llm/build.gradle

solr/modules/llm/src/java/org/apache/solr/llm/store/package-info.java

solr/modules/llm/src/java/org/apache/solr/llm/store/rest/package-info.java

dsmiley · 2024-11-01T13:28:12Z

Our codebase is vast; I sympathize with almost nobody knowing about the file-store (I'd count me, @noblepaul , @chatman as knowing). As a project maintainer, I'm just concerned about growing duplicative mechanisms. Heck, maybe the blob-store is yet another, albeit we've identified that one as one to go. An early design review might have elucidated such an option before your sunk cost here Alessandro. @cpoerschke I tagged you for your review because I believe you have a good familiarity of LTR for comparison to its model store to see if you also think it makes sense to clone that one, but you might not know about the file store that I am proposing for this use-case.

alessandrobenedetti · 2024-11-06T11:27:59Z

Thanks all for the feedback, in order:

@epugh I am finalising the review of your comments, I'll add considerations/resolve them one by one
@cpoerschke thanks for the commit suggestions, I merged all of them, tests are green
@dsmiley no problem David! The problem when you are not a team working full time on a project is that sometimes I end up with some limited time allocated to contributing and I can't wait async. I'll take a look today to the model store you propose and let you know my feedback here, super happy to switch to it and avoid duplication if it's fit for purpose!

epugh · 2024-11-06T16:54:19Z

When you look at the model store, it would be great to hear where it comes up short and we could look at improving it. What would make it an "easy" decision to adopt!??

alessandrobenedetti · 2024-11-06T17:43:06Z

I'm taking a look at the file-store from a code perspective and it's harder than expected, I'll keep digging but @dsmiley do you have any code references or tests that show the usage of this component?

Are we talking about: org.apache.solr.filestore ?

alessandrobenedetti · 2024-11-06T19:16:37Z

I was able to find:
org.apache.solr.filestore.ClusterFileStore -> implements ClusterFileStoreApis
org.apache.solr.filestore.NodeFileStore -> implements NodeFileStoreApis

org.apache.solr.filestore.DistribFileStore -> implements FileStore

it's not clear how they differ and there's no documentation I could easily find.
Given that I ended up finding : req.getCoreContainer().getFileStore() and it returns a org.apache.solr.filestore.DistribFileStore, so by best bet is that the file-store @dsmiley is referring is: org.apache.solr.filestore.DistribFileStore
Now, that FileStore has a get method 'org.apache.solr.filestore.DistribFileStore#get' that surprisingly doesn't return anything (return type is void) but has a convoluted way of consuming the output rather than returning it.

Anyway, I'm progressing this route and I should be able to have additional insights tomorrow.
My first feedback on this follows:

it took me a decent amount of time to implement the functionality taking inspiration from the LTR module, it's taking way more time to do it via the FileStore.
It's not easy to find information about it on the Reference guide, and it was not intuitive to understand the right code to look at.
If that's the route we want new developers to follow for storing models and similar managed resources in Solr, I suspect we need to make it way easier to understand and interact with.
JavaDocs were pretty much missing and code examples or tests not easy to be found.

N.B. these are just pragmatic observations to improve it, no offence to whoever designed and implemented this (I have not even checked)

alessandrobenedetti · 2024-11-07T10:48:11Z

As I was progressing on the route of the FileStore, a thought came to my mind:
With the current implementation the embedding model store guarantees a singleton life for each model uploaded, and what's available is the object: org.apache.solr.llm.embedding.SolrEmbeddingModel.
an object able to encode text to vector calling behind the scenes a REST API.
It's currently a lightweight object, but nothing prevents a future contributor to contribute the in-process model for example (https://docs.langchain4j.dev/integrations/embedding-models/in-process).

The embeddingModelStore currently handles the instantiation part, so when you access a model from a query parser or an update request processor (next on my to-do list), you get the object, ready to be used from the store, with no need to instantiate the object again (that could be expensive).

From what I'm seeing If I understood correctly, the FileStore will only store the configuration file for the model, so can easily access it but if we want the singleton mechanism for the model object we need to implement it somewhere, if not every time we use the query parser we need to instantiate the model, from the Json stored in the FileStore.

So... I'm not convinced anymore that I should spend effort in that direction for this specific use case, using the fileStore and deleting the embedding model store was seducing, but if I need to implement an additional mechanism to handle models to be singleton on top of it, I don't see much benefit especially from development time perspective.

Please correct me if I missed something or if you believe the current implementation won't work in certain scenarios.
I'm super happy to spend the effort to make a cleaner contribution, but if it's a lot of effort only for a "nice to have", I don't think I have that luxury right now.

Don't take it as provocative, it's just a genuine perspective of someone with limited time to dedicate to the project, if I was paid full-time on this I wouldn't have any problem in pursuing nice to haves.

alessandrobenedetti · 2024-11-07T11:03:55Z

When you look at the model store, it would be great to hear where it comes up short and we could look at improving it. What would make it an "easy" decision to adopt!??

My partial feedback after 3-4 hours working on that:

it's difficult to find the FileStore, there are multiple with very similar names, it's not easy to identify which one to use
JavaDocs explaining the store way of functioning are missing ( or very difficult to find, I couldn't)
there are plenty of elaborated comments in the code that make code readability even more difficult (comments with observations, suggestions, exclamation marks etc)
the get method, to get a file from the store is void and appears unnecessarily complicated to interact with
void get(String path, Consumer filecontent, boolean getMissing) throws IOException;
getMissing is not clear what it does and actually it's ignored in the only implementation I found (and also mispelled not in camelCase)
The implementation is not readable and seems to require way too much time to be understood for just a 'get' method.
Not sure if null is returned for example
it's very hard to find code examples that show how to upload and get files from it, also exhaustive tests were missing or hard to find, I couldn't find much
there are APIs to interact with it, but they seem oriented to an external usage rather than to use the store from within Solr?
Reference Guide talks about packages and modules but doesn't elaborate much on the FileStore

This is a genuine not provocative feedback from someone with a decent experience with Solr and Solr codebase, I assume that a new developer would struggle even more in using it?
@epugh hope this is helpful

solr/modules/llm/src/java/org/apache/solr/llm/search/TextEmbedderQParserPlugin.java

alessandrobenedetti · 2024-11-19T08:58:16Z

hi @malliaridis, first of all nice to meet you!
When rebasing on upstream I ended up with the new dependency approach, given it's not released I had to dig a bit in the Pull Request for documentation and ended up with my latest commit (that doesn't work).
I am no Gradle expert, I can definitely spend some time on this in the next few days but given you worked on https://issues.apache.org/jira/browse/SOLR-17406, maybe can fix what I didn't do correctly?
Thanks in advance!

malliaridis · 2024-11-19T17:38:19Z

It's nice meeting new people working on this project :D

I have no push permission to push the fixes I've created. There are two issues in the referencing, one the libs.langchain4j and the other one was libs.apache.solr.testframework.

The first one is actually a tricky one and I will have to provide a note in the documentation in case this happens again. When you have in libs.versions.toml langchain4j and langchain4j-cohere (for example), then you have to reference the first dependency as libs.langchain4j.asProvider(), otherwise it is just a group of dependencies, not the dependency itself. An alternative solution in this case is to simply add another keyword and extend langchain4j in libs.version.toml, like I did. Usually for "core" libraries that have simple names like in this case, you can use core or the module name itself (langchain4j-langchain4j).

The second dependency libs.apache.solr.testframework could not be resolved, either because you mean libs.apache.lucene.testframework or project(':solr:test-framework'). I assumed the second one, but if wrong, please update it.

If you update the permissions of this PR, I can push these changes together with a cleanup of the newly added license files (there is another issue there) and and update locks / checksums.

alessandrobenedetti · 2024-11-19T19:21:37Z

hi @malliaridis, my pleasure to know you!
Thanks for the detailed explanation!
I just added you to the branch, you should be able to write!
If you are willing to help with that, it would be much appreciated!

thanks!

malliaridis · 2024-11-19T20:03:47Z

@alessandrobenedetti If you need further assistance with the rest of the issues, just let me know. :)

alessandrobenedetti · 2024-11-21T08:09:46Z

Thanks @malliaridis , your help has been gold. Tests are green again and the deps are in line with the new expectations!
I'll wait until next Thursday (After Search Solutions London 24) max for some additional feedback and then merge and move on to the next steps!
Excited to have this first milestone in! :D

dsmiley

Is there a non-web-API way to configure a model? If not, why not? The vast majority of Solr's configuration is in solrconfig.xml, schema.xml, and some related config files linked from either, or in solr.xml for node level. Much of it can be manipulated with an API but nobody is forced to use such APIs to use the features, except for a few odd exceptions that I don't think we should emulate. Using an API is awkward for users that want to version-control their complete configuration and/or want to completely configure their Solr service in Docker, possibly overlayed with simple name-value apir config. We do all that where I work; no run-time manipulation post-deployment. Not even config in ZK (thanks to FileSystemConfigSetService). "Immutable Deployment" is the philosophy.

solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java

solr/modules/llm/src/java/org/apache/solr/llm/store/EmbeddingModelStore.java

solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc

dsmiley · 2024-11-25T05:19:55Z

solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc

+To run a query that embeds your query text, using a model you previously uploaded is simple:
+
+[source,text]
+?q={!text_to_vector model=a-model f=vector topK=10}hello world query


Did you consider adding capabilities to KnnQParser instead of defining a new one? Subclassing seems to lose the fact that we are doing KNN. I suppose the issue is modularization, as KNN is in Solr-Core but what you are adding is elsewhere. Nonetheless, some interfaces could be in solr-core.

I considered that, but proceeded this way for simplicity, happy for others to evlove this later!

Co-authored-by: David Smiley <[email protected]>

alessandrobenedetti · 2024-11-29T17:26:53Z

Is there a non-web-API way to configure a model? If not, why not? The vast majority of Solr's configuration is in solrconfig.xml, schema.xml, and some related config files linked from either, or in solr.xml for node level. Much of it can be manipulated with an API but nobody is forced to use such APIs to use the features, except for a few odd exceptions that I don't think we should emulate. Using an API is awkward for users that want to version-control their complete configuration and/or want to completely configure their Solr service in Docker, possibly overlayed with simple name-value apir config. We do all that where I work; no run-time manipulation post-deployment. Not even config in ZK (thanks to FileSystemConfigSetService). "Immutable Deployment" is the philosophy.

Done everything except this bit, I agree that it takes an extra step to sopurce control this as you'll need to source control it and then push it to the model store (like it happens with Learning To Rank).
I'm not sure it's a blocket though, I need to be cautious about the time I have left to work on this.
Can't this be a future improvement? Contributed by others?

dsmiley · 2024-11-29T17:40:50Z

I agree that it takes an extra step to sopurce control this as you'll need to source control it and then push it to the model store (like it happens with Learning To Rank).
I'm not sure it's a blocket though, I need to be cautious about the time I have left to work on this.
Can't this be a future improvement? Contributed by others?

Ok. I am curious why you did it this way though? I know you saw LTR doing it but it adds significant scope/complexity to this PR that otherwise wouldn't be there; it would have been merged by now and with less code and less of your limited time. I view that as the MVP and model CRUD as a bonus -- come later; more to debate. Put differently, do you see runtime manipulation as an important feature/approach for its viability vs editing solrconfig.xml and sending a RELOAD command?

dsmiley

So that we don't loose sight of the "knn" aspect of query execution, maybe the query parser should be registered at the name "knn_text_to_vector"

dsmiley · 2024-11-29T17:42:51Z

solr/modules/llm/src/java/org/apache/solr/llm/texttovector/model/SolrTextToVectorModel.java

@@ -63,12 +64,21 @@ public static SolrEmbeddingModel getInstance(
      var builder = modelClass.getMethod("builder").invoke(null);
      if (params != null) {
        /**


This isn't javadoc; don't use double-asterisk

dsmiley · 2024-11-29T17:53:17Z

solr/solr-ref-guide/modules/query-guide/pages/text-to-vector.adoc

@@ -258,7 +259,7 @@ http://localhost:8983/solr/techproducts/schema/embedding-model-store

 ----

-=== Running an embedding Query
+=== Running an Text to Vector Query


Suggested change

=== Running an Text to Vector Query

=== Running a Text-to-Vector Query

alessandrobenedetti · 2024-11-29T18:00:30Z

Sure, I thought that it was handy to have the vectorisation models as a managed resource, that can be added/deleted/viewed on the fly via REST.
Also thought managed resources were a natural evolution of the old "let's put it in the solrconfig.xml or separate files approach".
Those were my motivations, not much more

github-actions bot added documentation Improvements or additions to documentation jetty-server tool:build tests cat:search cat:schema dependencies Dependency upgrades labels Oct 29, 2024

alessandrobenedetti requested review from epugh and anshumg October 29, 2024 12:34

dsmiley reviewed Oct 29, 2024

View reviewed changes

gradle/testing/randomization/policies/solr-tests.policy Outdated Show resolved Hide resolved

solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc Outdated Show resolved Hide resolved

epugh reviewed Oct 30, 2024

View reviewed changes

dsmiley requested a review from cpoerschke October 30, 2024 18:38