Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17525: Text Embedder Query Parser #2809

Open
wants to merge 65 commits into
base: main
Choose a base branch
from

Conversation

alessandrobenedetti
Copy link
Contributor

@alessandrobenedetti alessandrobenedetti commented Oct 29, 2024

https://issues.apache.org/jira/browse/SOLR-17525

Description

Scope of this issue is to integrate a new module able to use LLM (through managed services) to enhance aspect of Apache Solr.
Specifically this first Pull Request relates to handle embedding models and automatic text vectorisation in Solr.

Solution

The functionality has been introduced through LangChain4J (https://docs.langchain4j.dev).
The are several aspects I would like feedback on:

  • I used inversion of control to integrate support for multiple embedding models with minimal impact on the code. It works and it's decently clean, but I would love your feedback on this
  • Embedding models are accessed through client APIS and http requests are made to embed text using external services.
    To do that I added security exceptions in both 'solr/server/etc/security.policy' and 'gradle/testing/randomization/policies/solr-tests.policy'.
    It works but I have no idea if it's acceptable or the best way to do it

Tests

  • model storage tests added
  • query parsing tests added

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@alessandrobenedetti
Copy link
Contributor Author

I'll keep polishing it and finalise the documentation, but I think it's ready for review!

@alessandrobenedetti
Copy link
Contributor Author

Just as a reminder, currently the check fails with:
"* What went wrong:
Execution failed for task ':solr:modules:llm:analyzeClassesDependencies'.

Dependency analysis found issues.
usedUndeclaredArtifacts

  • dev.langchain4j:langchain4j-core:0.35.0@jar
    unusedDeclaredArtifacts
  • dev.langchain4j:langchain4j-cohere:0.35.0@jar
  • dev.langchain4j:langchain4j-hugging-face:0.35.0@jar
  • dev.langchain4j:langchain4j-mistral-ai:0.35.0@jar
  • dev.langchain4j:langchain4j-open-ai:0.35.0@jar
  • dev.langchain4j:langchain4j:0.35.0@jar
    "
    I'll work on it in the next few days

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this module introduce a competing "model store" to Solr's existing "file store"? The "file store" was developed with "models" in mind, in addition to initially being developed for plugin packages.

@alessandrobenedetti
Copy link
Contributor Author

Why does this module introduce a competing "model store" to Solr's existing "file store"? The "file store" was developed with "models" in mind, in addition to initially being developed for plugin packages.

Let me elaborate here:
I took inspiration from the Learning To Rank module (ltr) where I was familiar with the model store:
org.apache.solr.ltr.store.ModelStore
similar to the file store but specialised in learning to rank models and their management and singleton instantiation .
The embeddingModel store does pretty much an equivalent job and grants direct access to embedding models so that the query parser or other components (like the update request processor in the works).

Given that, I am not that familiar with the file store, so if it can help in having a better and cleaner solution I'll be more than happy to take a look at it (after the 5th of November)

@epugh
Copy link
Contributor

epugh commented Oct 30, 2024

I would love to get a walk through on this exciting feature at the next community meetup...

Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through and mostly commented on the code. Having said that, I don't know that I grok the high level vision, since that's hard to convey with just a PR. I would like to learn more about what langchain4j brings to the party.. Do we see it being used more widely in Solr? To expose more thigns in Solr to Langchain4j? Or just as an api to some embedding models?

}
Files.delete(mstore);
}
if (!solrconfig.equals("solrconfig-llm.xml")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to be some magic? isn't the way we set things up in a way so that you control which solrconfig you are using? Maybe I don't know the RestTestBase enough, but genrally I just specify what configs etc to use...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up the tests a bit, before I resolve the comment, take a look if it's any better now!

}

/** produces a model encoded in json * */
public static String getModelInJson(String name, String className, String params) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we use different approaches to building JSON elsewhere? One of my goals for Solr code bases is to have more consistency across all the areas... I know that isn't as helpful to the creator of the new code, like in this case, but in two years, when someone else has to come along and understand things, then it really pays off!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any pointer? In the meantime I keep this open and look up myself right now if I can find anything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not using StringBuilder here; it's needlessly verbose. I think IntelliJ will switch it for you.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class TestLlmBase extends RestTestBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RestTestBase our future? @dsmiley didn't you have somewhat of a vision for where we are going with our hierarchy of testing classes? I am intrigued to learn more abut RestTestBase, it's new to me. However, if it isn't part of our vision for how we handle testing, then I am nervous about depending on it for another new module!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave this comment open for @dsmiley to comment.
I took inspiration from the Learning To Rank module and it made possible with a decent effort to test the embedding model store, but I'm open to refactor it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used a ton so I sympathize with continuing to use it until there's renewed energy in more test renovation. Even then, it would likely stay. It's not deprecated even though its parent class is.

assertJPut(ManagedEmbeddingModelStore.REST_END_POINT, model, "/responseHeader/status==0");
// success
final String multipleModels =
"[{ name:\"testModel3\", class:\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we do string building for json, which I think is our dominant pattern... personally not having all the .appends makes it more readable, even with the escaping around the double quotes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm I am missing the point here, can you tell me how you think this can be improved with an example? happy to do it then!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He was thinking-out-loud, not telling you to do differently or even implied. If I may do the same, I'll say I look forward to us using text blocks, now available on main due to Java 21. I suppose you will back-port this to 9.x so I suppose you won't bother using Java 21 features yet.

@@ -242,9 +242,9 @@ client.add(Arrays.asList(d1, d2));

== Query Time

Apache Solr provides two query parsers that work with dense vector fields, that each support different ways of matching documents based on vector similarity: The `knn` query parser, and the `vectorSimilarity` query parser.
Apache Solr provides three query parsers that work with dense vector fields, that each support different ways of matching documents based on vector similarity: The `knn` query parser, the `vectorSimilarity` query parser and the `embed` query parser.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be thinking about adding some call out text of "Here is when you use knn, here is when you use vectorSimilarity, and here is when you use embed". I'd personally like to see our ref guide move beyond being just a reference to look things up and also explain more the "why".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, take a look and if enough feel free to resolve this conversation

settings.gradle Outdated Show resolved Hide resolved
@dsmiley
Copy link
Contributor

dsmiley commented Nov 1, 2024

Our codebase is vast; I sympathize with almost nobody knowing about the file-store (I'd count me, @noblepaul , @chatman as knowing). As a project maintainer, I'm just concerned about growing duplicative mechanisms. Heck, maybe the blob-store is yet another, albeit we've identified that one as one to go. An early design review might have elucidated such an option before your sunk cost here Alessandro. @cpoerschke I tagged you for your review because I believe you have a good familiarity of LTR for comparison to its model store to see if you also think it makes sense to clone that one, but you might not know about the file store that I am proposing for this use-case.

@alessandrobenedetti
Copy link
Contributor Author

Thanks all for the feedback, in order:

  1. @epugh I am finalising the review of your comments, I'll add considerations/resolve them one by one
  2. @cpoerschke thanks for the commit suggestions, I merged all of them, tests are green
  3. @dsmiley no problem David! The problem when you are not a team working full time on a project is that sometimes I end up with some limited time allocated to contributing and I can't wait async. I'll take a look today to the model store you propose and let you know my feedback here, super happy to switch to it and avoid duplication if it's fit for purpose!

@epugh
Copy link
Contributor

epugh commented Nov 6, 2024

When you look at the model store, it would be great to hear where it comes up short and we could look at improving it. What would make it an "easy" decision to adopt!??

@alessandrobenedetti
Copy link
Contributor Author

alessandrobenedetti commented Nov 6, 2024

I'm taking a look at the file-store from a code perspective and it's harder than expected, I'll keep digging but @dsmiley do you have any code references or tests that show the usage of this component?

Are we talking about: org.apache.solr.filestore ?

@alessandrobenedetti
Copy link
Contributor Author

I was able to find:
org.apache.solr.filestore.ClusterFileStore -> implements ClusterFileStoreApis
org.apache.solr.filestore.NodeFileStore -> implements NodeFileStoreApis

org.apache.solr.filestore.DistribFileStore -> implements FileStore

it's not clear how they differ and there's no documentation I could easily find.
Given that I ended up finding : req.getCoreContainer().getFileStore() and it returns a org.apache.solr.filestore.DistribFileStore, so by best bet is that the file-store @dsmiley is referring is: org.apache.solr.filestore.DistribFileStore
Now, that FileStore has a get method 'org.apache.solr.filestore.DistribFileStore#get' that surprisingly doesn't return anything (return type is void) but has a convoluted way of consuming the output rather than returning it.

Anyway, I'm progressing this route and I should be able to have additional insights tomorrow.
My first feedback on this follows:

  1. it took me a decent amount of time to implement the functionality taking inspiration from the LTR module, it's taking way more time to do it via the FileStore.
    It's not easy to find information about it on the Reference guide, and it was not intuitive to understand the right code to look at.
    If that's the route we want new developers to follow for storing models and similar managed resources in Solr, I suspect we need to make it way easier to understand and interact with.
    JavaDocs were pretty much missing and code examples or tests not easy to be found.

N.B. these are just pragmatic observations to improve it, no offence to whoever designed and implemented this (I have not even checked)

@alessandrobenedetti
Copy link
Contributor Author

alessandrobenedetti commented Nov 7, 2024

As I was progressing on the route of the FileStore, a thought came to my mind:
With the current implementation the embedding model store guarantees a singleton life for each model uploaded, and what's available is the object: org.apache.solr.llm.embedding.SolrEmbeddingModel.
an object able to encode text to vector calling behind the scenes a REST API.
It's currently a lightweight object, but nothing prevents a future contributor to contribute the in-process model for example (https://docs.langchain4j.dev/integrations/embedding-models/in-process).

The embeddingModelStore currently handles the instantiation part, so when you access a model from a query parser or an update request processor (next on my to-do list), you get the object, ready to be used from the store, with no need to instantiate the object again (that could be expensive).

From what I'm seeing If I understood correctly, the FileStore will only store the configuration file for the model, so can easily access it but if we want the singleton mechanism for the model object we need to implement it somewhere, if not every time we use the query parser we need to instantiate the model, from the Json stored in the FileStore.

So... I'm not convinced anymore that I should spend effort in that direction for this specific use case, using the fileStore and deleting the embedding model store was seducing, but if I need to implement an additional mechanism to handle models to be singleton on top of it, I don't see much benefit especially from development time perspective.

Please correct me if I missed something or if you believe the current implementation won't work in certain scenarios.
I'm super happy to spend the effort to make a cleaner contribution, but if it's a lot of effort only for a "nice to have", I don't think I have that luxury right now.

Don't take it as provocative, it's just a genuine perspective of someone with limited time to dedicate to the project, if I was paid full-time on this I wouldn't have any problem in pursuing nice to haves.

@alessandrobenedetti
Copy link
Contributor Author

alessandrobenedetti commented Nov 7, 2024

When you look at the model store, it would be great to hear where it comes up short and we could look at improving it. What would make it an "easy" decision to adopt!??

My partial feedback after 3-4 hours working on that:

  1. it's difficult to find the FileStore, there are multiple with very similar names, it's not easy to identify which one to use
  2. JavaDocs explaining the store way of functioning are missing ( or very difficult to find, I couldn't)
  3. there are plenty of elaborated comments in the code that make code readability even more difficult (comments with observations, suggestions, exclamation marks etc)
  4. the get method, to get a file from the store is void and appears unnecessarily complicated to interact with
    void get(String path, Consumer filecontent, boolean getMissing) throws IOException;
    getMissing is not clear what it does and actually it's ignored in the only implementation I found (and also mispelled not in camelCase)
    The implementation is not readable and seems to require way too much time to be understood for just a 'get' method.
    Not sure if null is returned for example
  5. it's very hard to find code examples that show how to upload and get files from it, also exhaustive tests were missing or hard to find, I couldn't find much
  6. there are APIs to interact with it, but they seem oriented to an external usage rather than to use the store from within Solr?
  7. Reference Guide talks about packages and modules but doesn't elaborate much on the FileStore

This is a genuine not provocative feedback from someone with a decent experience with Solr and Solr codebase, I assume that a new developer would struggle even more in using it?
@epugh hope this is helpful

@alessandrobenedetti
Copy link
Contributor Author

hi @malliaridis, first of all nice to meet you!
When rebasing on upstream I ended up with the new dependency approach, given it's not released I had to dig a bit in the Pull Request for documentation and ended up with my latest commit (that doesn't work).
I am no Gradle expert, I can definitely spend some time on this in the next few days but given you worked on https://issues.apache.org/jira/browse/SOLR-17406, maybe can fix what I didn't do correctly?
Thanks in advance!

@malliaridis
Copy link
Contributor

It's nice meeting new people working on this project :D

I have no push permission to push the fixes I've created. There are two issues in the referencing, one the libs.langchain4j and the other one was libs.apache.solr.testframework.

The first one is actually a tricky one and I will have to provide a note in the documentation in case this happens again. When you have in libs.versions.toml langchain4j and langchain4j-cohere (for example), then you have to reference the first dependency as libs.langchain4j.asProvider(), otherwise it is just a group of dependencies, not the dependency itself. An alternative solution in this case is to simply add another keyword and extend langchain4j in libs.version.toml, like I did. Usually for "core" libraries that have simple names like in this case, you can use core or the module name itself (langchain4j-langchain4j).

The second dependency libs.apache.solr.testframework could not be resolved, either because you mean libs.apache.lucene.testframework or project(':solr:test-framework'). I assumed the second one, but if wrong, please update it.

If you update the permissions of this PR, I can push these changes together with a cleanup of the newly added license files (there is another issue there) and and update locks / checksums.

@alessandrobenedetti
Copy link
Contributor Author

hi @malliaridis, my pleasure to know you!
Thanks for the detailed explanation!
I just added you to the branch, you should be able to write!
If you are willing to help with that, it would be much appreciated!

thanks!

@malliaridis
Copy link
Contributor

@alessandrobenedetti If you need further assistance with the rest of the issues, just let me know. :)

@alessandrobenedetti
Copy link
Contributor Author

Thanks @malliaridis , your help has been gold. Tests are green again and the deps are in line with the new expectations!
I'll wait until next Thursday (After Search Solutions London 24) max for some additional feedback and then merge and move on to the next steps!
Excited to have this first milestone in! :D

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a non-web-API way to configure a model? If not, why not? The vast majority of Solr's configuration is in solrconfig.xml, schema.xml, and some related config files linked from either, or in solr.xml for node level. Much of it can be manipulated with an API but nobody is forced to use such APIs to use the features, except for a few odd exceptions that I don't think we should emulate. Using an API is awkward for users that want to version-control their complete configuration and/or want to completely configure their Solr service in Docker, possibly overlayed with simple name-value apir config. We do all that where I work; no run-time manipulation post-deployment. Not even config in ZK (thanks to FileSystemConfigSetService). "Immutable Deployment" is the philosophy.

To run a query that embeds your query text, using a model you previously uploaded is simple:

[source,text]
?q={!text_to_vector model=a-model f=vector topK=10}hello world query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider adding capabilities to KnnQParser instead of defining a new one? Subclassing seems to lose the fact that we are doing KNN. I suppose the issue is modularization, as KNN is in Solr-Core but what you are adding is elsewhere. Nonetheless, some interfaces could be in solr-core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, but proceeded this way for simplicity, happy for others to evlove this later!

@alessandrobenedetti
Copy link
Contributor Author

Is there a non-web-API way to configure a model? If not, why not? The vast majority of Solr's configuration is in solrconfig.xml, schema.xml, and some related config files linked from either, or in solr.xml for node level. Much of it can be manipulated with an API but nobody is forced to use such APIs to use the features, except for a few odd exceptions that I don't think we should emulate. Using an API is awkward for users that want to version-control their complete configuration and/or want to completely configure their Solr service in Docker, possibly overlayed with simple name-value apir config. We do all that where I work; no run-time manipulation post-deployment. Not even config in ZK (thanks to FileSystemConfigSetService). "Immutable Deployment" is the philosophy.

Done everything except this bit, I agree that it takes an extra step to sopurce control this as you'll need to source control it and then push it to the model store (like it happens with Learning To Rank).
I'm not sure it's a blocket though, I need to be cautious about the time I have left to work on this.
Can't this be a future improvement? Contributed by others?

@dsmiley
Copy link
Contributor

dsmiley commented Nov 29, 2024

I agree that it takes an extra step to sopurce control this as you'll need to source control it and then push it to the model store (like it happens with Learning To Rank).
I'm not sure it's a blocket though, I need to be cautious about the time I have left to work on this.
Can't this be a future improvement? Contributed by others?

Ok. I am curious why you did it this way though? I know you saw LTR doing it but it adds significant scope/complexity to this PR that otherwise wouldn't be there; it would have been merged by now and with less code and less of your limited time. I view that as the MVP and model CRUD as a bonus -- come later; more to debate. Put differently, do you see runtime manipulation as an important feature/approach for its viability vs editing solrconfig.xml and sending a RELOAD command?

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that we don't loose sight of the "knn" aspect of query execution, maybe the query parser should be registered at the name "knn_text_to_vector"

@@ -63,12 +64,21 @@ public static SolrEmbeddingModel getInstance(
var builder = modelClass.getMethod("builder").invoke(null);
if (params != null) {
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't javadoc; don't use double-asterisk

@@ -258,7 +259,7 @@ http://localhost:8983/solr/techproducts/schema/embedding-model-store

----

=== Running an embedding Query
=== Running an Text to Vector Query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
=== Running an Text to Vector Query
=== Running a Text-to-Vector Query

@alessandrobenedetti
Copy link
Contributor Author

Sure, I thought that it was handy to have the vectorisation models as a managed resource, that can be added/deleted/viewed on the fly via REST.
Also thought managed resources were a natural evolution of the old "let's put it in the solrconfig.xml or separate files approach".
Those were my motivations, not much more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:search dependencies Dependency upgrades documentation Improvements or additions to documentation jetty-server tests tool:build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants