[Feature] Semantic Chunking #45

do-me · 2024-03-05T19:14:04Z

do-me
Mar 5, 2024
Maintainer

I saw this post about semantic chunking using sliding chunk windows and think this is a cool approach.

In a nutshell you use sliding chunk windows and calculate their embeddings and their distance to each other. Using this score you are trying to identify semantic "break points" i.e. where you have the largest delta.

Together with minimum and maximum chunk size that would make for a great alternative to the current chunking functions!

Langchain implemented this already and gives some good explanation in their docs.

do-me · 2024-03-07T19:16:48Z

do-me
Mar 7, 2024
Maintainer Author

RAPTOR might also be a good choice. The idea is to create a tree-like structure from chunks by using dimensionality reduction & e.g. k-means for clustering. Afterwards these clusters are clustered again and so on.

Copying from LinkedIn:

At retrieval time, the authors propose two algorithms:

Tree Traversal: Identify the top-k chunks at each level of the tree and send them to the LLM. So if we have a tree of depth 2 and we set k = 3, we’d send 6 chunks to the LLM, 3 from each level.

Collapsed Tree Traversal → Kind of like regular RAG, the tree is flattened and we retrieve chunks until a specific token limit is reached to maximize the use of the LLM’s context window.

For more read: https://lnkd.in/eXxwX8xK

It might be a good choice too to combine both ideas, e.g. use semantic chunking (which seems better than relying on error-prone dim reduction) and create a tree-like structure from these chunks instead.

0 replies

do-me · 2024-03-07T20:06:27Z

do-me
Mar 7, 2024
Maintainer Author

@VarunNSrivastava do you maybe have other ideas too or did you see any promising demos or similar?

0 replies

varunneal · 2024-03-07T21:05:25Z

varunneal
Mar 7, 2024
Collaborator

Interesting! I'll dig into these demos. I spent a lot of time with a very similar approach, and found it a bit uneffective and computationally expensive. I found that breaking at punctuation was a bit more effective...

1 reply

do-me Mar 8, 2024
Maintainer Author

Awesome, great that you have some experience in that already. Super curious about your insights on this!

do-me · 2024-08-29T10:06:45Z

do-me
Aug 29, 2024
Maintainer Author

Linking current plans:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Semantic Chunking #45

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Feature] Semantic Chunking #45

do-me Mar 5, 2024 Maintainer

Replies: 4 comments · 1 reply

do-me Mar 7, 2024 Maintainer Author

do-me Mar 7, 2024 Maintainer Author

varunneal Mar 7, 2024 Collaborator

do-me Mar 8, 2024 Maintainer Author

do-me Aug 29, 2024 Maintainer Author

do-me
Mar 5, 2024
Maintainer

Replies: 4 comments 1 reply

do-me
Mar 7, 2024
Maintainer Author

do-me
Mar 7, 2024
Maintainer Author

varunneal
Mar 7, 2024
Collaborator

do-me Mar 8, 2024
Maintainer Author

do-me
Aug 29, 2024
Maintainer Author