Replies: 1 comment
-
It's done and works beautifully!
Documented everything on Huggingface. Check it out! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I had an idea that with a simple new mode "avoid new chunking" one could prepare SemanticFinder for a "universal index" setup.
Idea
The main idea is to have one very large (but highly compressed) index with most common English:
global
global warming
green house gas
and so on. Maybe also with very common expressions or similar. Might need a liguist here.
I don't know how much memory it might probably take to include e.g. the 99% most common words in Enlish (and duplets and triplets). I think it's somehow possible, especially considering drastic compressions that are currently discussed (binary embeddings, Matryoshka etc.). However, considering memory efficiency, it might only make sense if the index stays under the actual model size. Example: it's pointless to keep a 1000 Mb index for a 100 Mb model analyzing a 1000 word text. Speedwise that might still make sense however.
Application
The advantage would be that very large texts wuld not need to be chunked at all. The chunks also wouldn't need to be inferenced. Instead it would become a substring search and the user query embedding would only run against the words/duplets/triplets that are already in the "Universal Index", hence allowing to index extremely large docs (or any docs) on the fly.
Steps
Beta Was this translation helpful? Give feedback.
All reactions