Universal Index Idea #48

do-me · 2024-03-26T18:02:44Z

do-me
Mar 26, 2024
Maintainer

I had an idea that with a simple new mode "avoid new chunking" one could prepare SemanticFinder for a "universal index" setup.

Idea

The main idea is to have one very large (but highly compressed) index with most common English:

single words global
duplets global warming
triplets green house gas
and so on. Maybe also with very common expressions or similar. Might need a liguist here.

I don't know how much memory it might probably take to include e.g. the 99% most common words in Enlish (and duplets and triplets). I think it's somehow possible, especially considering drastic compressions that are currently discussed (binary embeddings, Matryoshka etc.). However, considering memory efficiency, it might only make sense if the index stays under the actual model size. Example: it's pointless to keep a 1000 Mb index for a 100 Mb model analyzing a 1000 word text. Speedwise that might still make sense however.

Application

The advantage would be that very large texts wuld not need to be chunked at all. The chunks also wouldn't need to be inferenced. Instead it would become a substring search and the user query embedding would only run against the words/duplets/triplets that are already in the "Universal Index", hence allowing to index extremely large docs (or any docs) on the fly.

Steps

Get list of all English words with usage statistics
Get list of duplets/triplets
Use 99% or any percentage that seems reasonable for such an index
Add "avoid new chunking" and instead substring search mode in SemanticFinder
Create universal index with SemanticFinder (e.g. one word per line, then split by line sep)
Load this universal index and test with large document

do-me · 2024-03-29T12:44:39Z

do-me
Mar 29, 2024
Maintainer Author

It's done and works beautifully!
Thanks to the new experimental expert settings for the "universal index" two more new use cases are possible:

Translation
English synonym search

Documented everything on Huggingface. Check it out!
Feedback very welcome.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Universal Index Idea #48

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Universal Index Idea #48

do-me Mar 26, 2024 Maintainer

Idea

Application

Steps

Replies: 1 comment

do-me Mar 29, 2024 Maintainer Author

do-me
Mar 26, 2024
Maintainer

do-me
Mar 29, 2024
Maintainer Author