-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17525: Text Embedder Query Parser #2809
base: main
Are you sure you want to change the base?
Conversation
I'll keep polishing it and finalise the documentation, but I think it's ready for review! |
Just as a reminder, currently the check fails with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this module introduce a competing "model store" to Solr's existing "file store"? The "file store" was developed with "models" in mind, in addition to initially being developed for plugin packages.
solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc
Outdated
Show resolved
Hide resolved
Let me elaborate here: Given that, I am not that familiar with the file store, so if it can help in having a better and cleaner solution I'll be more than happy to take a look at it (after the 5th of November) |
I would love to get a walk through on this exciting feature at the next community meetup... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through and mostly commented on the code. Having said that, I don't know that I grok the high level vision, since that's hard to convey with just a PR. I would like to learn more about what langchain4j brings to the party.. Do we see it being used more widely in Solr? To expose more thigns in Solr to Langchain4j? Or just as an api to some embedding models?
solr/modules/llm/src/test-files/solr/collection1/conf/stopwords.txt
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/test-files/solr/collection1/conf/synonyms.txt
Outdated
Show resolved
Hide resolved
} | ||
Files.delete(mstore); | ||
} | ||
if (!solrconfig.equals("solrconfig-llm.xml")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears to be some magic? isn't the way we set things up in a way so that you control which solrconfig you are using? Maybe I don't know the RestTestBase enough, but genrally I just specify what configs etc to use...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned up the tests a bit, before I resolve the comment, take a look if it's any better now!
} | ||
|
||
/** produces a model encoded in json * */ | ||
public static String getModelInJson(String name, String className, String params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we use different approaches to building JSON elsewhere? One of my goals for Solr code bases is to have more consistency across all the areas... I know that isn't as helpful to the creator of the new code, like in this case, but in two years, when someone else has to come along and understand things, then it really pays off!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any pointer? In the meantime I keep this open and look up myself right now if I can find anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not using StringBuilder here; it's needlessly verbose. I think IntelliJ will switch it for you.
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class TestLlmBase extends RestTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is RestTestBase
our future? @dsmiley didn't you have somewhat of a vision for where we are going with our hierarchy of testing classes? I am intrigued to learn more abut RestTestBase
, it's new to me. However, if it isn't part of our vision for how we handle testing, then I am nervous about depending on it for another new module!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave this comment open for @dsmiley to comment.
I took inspiration from the Learning To Rank module and it made possible with a decent effort to test the embedding model store, but I'm open to refactor it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used a ton so I sympathize with continuing to use it until there's renewed energy in more test renovation. Even then, it would likely stay. It's not deprecated even though its parent class is.
assertJPut(ManagedEmbeddingModelStore.REST_END_POINT, model, "/responseHeader/status==0"); | ||
// success | ||
final String multipleModels = | ||
"[{ name:\"testModel3\", class:\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we do string building for json, which I think is our dominant pattern... personally not having all the .appends
makes it more readable, even with the escaping around the double quotes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm I am missing the point here, can you tell me how you think this can be improved with an example? happy to do it then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He was thinking-out-loud, not telling you to do differently or even implied. If I may do the same, I'll say I look forward to us using text blocks, now available on main due to Java 21. I suppose you will back-port this to 9.x so I suppose you won't bother using Java 21 features yet.
@@ -242,9 +242,9 @@ client.add(Arrays.asList(d1, d2)); | |||
|
|||
== Query Time | |||
|
|||
Apache Solr provides two query parsers that work with dense vector fields, that each support different ways of matching documents based on vector similarity: The `knn` query parser, and the `vectorSimilarity` query parser. | |||
Apache Solr provides three query parsers that work with dense vector fields, that each support different ways of matching documents based on vector similarity: The `knn` query parser, the `vectorSimilarity` query parser and the `embed` query parser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be thinking about adding some call out text of "Here is when you use knn, here is when you use vectorSimilarity, and here is when you use embed". I'd personally like to see our ref guide move beyond being just a reference to look things up and also explain more the "why".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, take a look and if enough feel free to resolve this conversation
solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/schema/DenseVectorField.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/schema/DenseVectorField.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/store/package-info.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/store/rest/package-info.java
Outdated
Show resolved
Hide resolved
Our codebase is vast; I sympathize with almost nobody knowing about the file-store (I'd count me, @noblepaul , @chatman as knowing). As a project maintainer, I'm just concerned about growing duplicative mechanisms. Heck, maybe the blob-store is yet another, albeit we've identified that one as one to go. An early design review might have elucidated such an option before your sunk cost here Alessandro. @cpoerschke I tagged you for your review because I believe you have a good familiarity of LTR for comparison to its model store to see if you also think it makes sense to clone that one, but you might not know about the file store that I am proposing for this use-case. |
Thanks all for the feedback, in order:
|
When you look at the model store, it would be great to hear where it comes up short and we could look at improving it. What would make it an "easy" decision to adopt!?? |
I'm taking a look at the file-store from a code perspective and it's harder than expected, I'll keep digging but @dsmiley do you have any code references or tests that show the usage of this component? Are we talking about: org.apache.solr.filestore ? |
I was able to find: org.apache.solr.filestore.DistribFileStore -> implements FileStore it's not clear how they differ and there's no documentation I could easily find. Anyway, I'm progressing this route and I should be able to have additional insights tomorrow.
N.B. these are just pragmatic observations to improve it, no offence to whoever designed and implemented this (I have not even checked) |
As I was progressing on the route of the FileStore, a thought came to my mind: The embeddingModelStore currently handles the instantiation part, so when you access a model from a query parser or an update request processor (next on my to-do list), you get the object, ready to be used from the store, with no need to instantiate the object again (that could be expensive). From what I'm seeing If I understood correctly, the FileStore will only store the configuration file for the model, so can easily access it but if we want the singleton mechanism for the model object we need to implement it somewhere, if not every time we use the query parser we need to instantiate the model, from the Json stored in the FileStore. So... I'm not convinced anymore that I should spend effort in that direction for this specific use case, using the fileStore and deleting the embedding model store was seducing, but if I need to implement an additional mechanism to handle models to be singleton on top of it, I don't see much benefit especially from development time perspective. Please correct me if I missed something or if you believe the current implementation won't work in certain scenarios. Don't take it as provocative, it's just a genuine perspective of someone with limited time to dedicate to the project, if I was paid full-time on this I wouldn't have any problem in pursuing nice to haves. |
My partial feedback after 3-4 hours working on that:
This is a genuine not provocative feedback from someone with a decent experience with Solr and Solr codebase, I assume that a new developer would struggle even more in using it? |
solr/modules/llm/src/java/org/apache/solr/llm/search/TextEmbedderQParserPlugin.java
Outdated
Show resolved
Hide resolved
4cbafdc
to
c9d38ac
Compare
hi @malliaridis, first of all nice to meet you! |
It's nice meeting new people working on this project :D I have no push permission to push the fixes I've created. There are two issues in the referencing, one the libs.langchain4j and the other one was The first one is actually a tricky one and I will have to provide a note in the documentation in case this happens again. When you have in The second dependency If you update the permissions of this PR, I can push these changes together with a cleanup of the newly added license files (there is another issue there) and and update locks / checksums. |
hi @malliaridis, my pleasure to know you! thanks! |
@alessandrobenedetti If you need further assistance with the rest of the issues, just let me know. :) |
Thanks @malliaridis , your help has been gold. Tests are green again and the deps are in line with the new expectations! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a non-web-API way to configure a model? If not, why not? The vast majority of Solr's configuration is in solrconfig.xml, schema.xml, and some related config files linked from either, or in solr.xml for node level. Much of it can be manipulated with an API but nobody is forced to use such APIs to use the features, except for a few odd exceptions that I don't think we should emulate. Using an API is awkward for users that want to version-control their complete configuration and/or want to completely configure their Solr service in Docker, possibly overlayed with simple name-value apir config. We do all that where I work; no run-time manipulation post-deployment. Not even config in ZK (thanks to FileSystemConfigSetService). "Immutable Deployment" is the philosophy.
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/store/EmbeddingModelStore.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/store/EmbeddingModelStore.java
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc
Outdated
Show resolved
Hide resolved
solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc
Outdated
Show resolved
Hide resolved
To run a query that embeds your query text, using a model you previously uploaded is simple: | ||
|
||
[source,text] | ||
?q={!text_to_vector model=a-model f=vector topK=10}hello world query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider adding capabilities to KnnQParser instead of defining a new one? Subclassing seems to lose the fact that we are doing KNN. I suppose the issue is modularization, as KNN is in Solr-Core but what you are adding is elsewhere. Nonetheless, some interfaces could be in solr-core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that, but proceeded this way for simplicity, happy for others to evlove this later!
Co-authored-by: David Smiley <[email protected]>
Co-authored-by: David Smiley <[email protected]>
Co-authored-by: David Smiley <[email protected]>
Done everything except this bit, I agree that it takes an extra step to sopurce control this as you'll need to source control it and then push it to the model store (like it happens with Learning To Rank). |
Ok. I am curious why you did it this way though? I know you saw LTR doing it but it adds significant scope/complexity to this PR that otherwise wouldn't be there; it would have been merged by now and with less code and less of your limited time. I view that as the MVP and model CRUD as a bonus -- come later; more to debate. Put differently, do you see runtime manipulation as an important feature/approach for its viability vs editing solrconfig.xml and sending a RELOAD command? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that we don't loose sight of the "knn" aspect of query execution, maybe the query parser should be registered at the name "knn_text_to_vector"
@@ -63,12 +64,21 @@ public static SolrEmbeddingModel getInstance( | |||
var builder = modelClass.getMethod("builder").invoke(null); | |||
if (params != null) { | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't javadoc; don't use double-asterisk
@@ -258,7 +259,7 @@ http://localhost:8983/solr/techproducts/schema/embedding-model-store | |||
|
|||
---- | |||
|
|||
=== Running an embedding Query | |||
=== Running an Text to Vector Query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=== Running an Text to Vector Query | |
=== Running a Text-to-Vector Query |
Sure, I thought that it was handy to have the vectorisation models as a managed resource, that can be added/deleted/viewed on the fly via REST. |
https://issues.apache.org/jira/browse/SOLR-17525
Description
Scope of this issue is to integrate a new module able to use LLM (through managed services) to enhance aspect of Apache Solr.
Specifically this first Pull Request relates to handle embedding models and automatic text vectorisation in Solr.
Solution
The functionality has been introduced through LangChain4J (https://docs.langchain4j.dev).
The are several aspects I would like feedback on:
To do that I added security exceptions in both 'solr/server/etc/security.policy' and 'gradle/testing/randomization/policies/solr-tests.policy'.
It works but I have no idea if it's acceptable or the best way to do it
Tests
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.