Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Sako M authored and Sako M committed Nov 11, 2024
1 parent 80b05df commit 24ecf36
Show file tree
Hide file tree
Showing 7 changed files with 195 additions and 281 deletions.
6 changes: 6 additions & 0 deletions .flox/env.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"owner": "sakomws",
"name": "01_intro_workshop",
"floxhub_url": "https://hub.flox.dev/",
"version": 1
}
5 changes: 5 additions & 0 deletions .flox/env.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"rev": "3b334d83592c53a2d702702e763a3970f42e35aa",
"local_rev": null,
"version": 1
}
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ celerybeat.pid
*.sage.py

# Environments
**.env
.env
.venv
env/
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@

# Sessions

### Name: Sako M
### Name: [Sako M](https://linkedin.com/in/sakom)
- **Company**: Gladly
- **Bio**: Sako attended 40+ hackathons in just 9 months, will be setting the stage by diving into Hackathon Building Blocks and showing how to set yourself up for hackathon success! From ideation to execution, learn the strategies that make ideas fly!
- **Bio**: Sako attended 45+ hackathons in just 9 months, will be setting the stage by diving into Hackathon Building Blocks and showing how to set yourself up for hackathon success! From ideation to execution, learn the strategies that make ideas fly!

#### Session
- **Title**: Intro to workshop / Building blocks for hackathons
Expand Down
245 changes: 181 additions & 64 deletions sessions/05_chunking_strategies/README.md
Original file line number Diff line number Diff line change
@@ -1,98 +1,215 @@
# Welcome to the Vector Database Workshop!

# Session Title: Chunking Strategies for Retrieval Applications with Weaviate

Welcome to the **Chunking Strategies for Retrieval Applications with Weaviate** session of the DevFest AI Workshop! This session will guide you through [brief session purpose or topic].
In this workshop, we're going to use Weaviate to explore and learn about the power of vector search. Here are three challenges that will get you started with the basics of retrieval for LLM applications that leverage vector search for RAG.

---

## Session Overview
## Challenge 0: Generate a vector embedding using OpenAI API

**Instructor:** [Instructor Name]
Try out this "get_embedding()" function. You can create a vector embedding for any text using this function.

**Duration:** [Duration, e.g., 30 minutes]
```python
from openai import OpenAI
client = OpenAI(api_key="INSERT YOUR API KEY")

**Objective:**
- [Objective 1: Describe what participants will learn or accomplish in this session]
- [Objective 2: Highlight any practical applications or outcomes]

By the end of this session, you will have a deeper understanding of [specific concepts/skills].
def get_embedding(text, model="text-embedding-3-small"):
text = text.replace("\n", " ")
return client.embeddings.create(input = [text], model=model).data[0].embedding

---
vector = get_embedding("A piece of text that we're going to vectorize")

## Prerequisites
print(vector)
```

This embedding is a 1536 (by default) dimensional vector that semantically represents your input text. When you vectorize all of your chunks and you store them in Weaviate, they'll be plotted into an HNSW vector index which allows us to do vector search on all of our data to find semantically related data to our query.

This is the magic RAG, and ultimately Agentic RAG.

The great thing about Weaviate is that we have direct integrations to various model providers and inference platforms like OpenAI, Google, FriendlAI, and many more.

When you store a chunk in Weaviate, you can automatically have the client library generate a vector of your chunk before it gets stored in the Weaviate cluster.

## Challenge 1: Create a Weaviate Cluster and a Chunk Collection

### Create a Weaviate Cluster and Create a Chunk Collection
Alright, we're in a chunking workshop today to learn about text splitting strategies, but first, we'll need a place to store those chunks and then do vector search over them.

* Weaviate Quickstart
* [Weaviate Cloud Console](https://console.weaviate.cloud/)
* Create a free Weaviate Cluster and use the Weaviate Cloud Console to create a `chunk` Collection
![create a collection](collection_creation.png)

**Note** This is a naive collection object with a single attribute. As you get comfortable with collections in Weaviate, you can create collections with multiple attributes, create vectors for individual attributes called [named vectors](https://weaviate.io/developers/weaviate/config-refs/schema/multi-vector) to enhance searchability experiences.

For today, we'll keep things simple and just create a collection with the single `content` attribute which will take on our chunks.

### Let's chunk!

```python
text = '''🛠️ What to Expect:
Inspiring Keynotes: Hear from some of the most influential voices in the tech world, sharing insights on the latest in AI, machine learning, mobile development, cloud computing, and more.
Hands-On Workshops: Roll up your sleeves and dive into hands-on sessions designed to enhance your skills with Google technologies, such as Flutter, TensorFlow, Google Cloud, and Android development.
Tech Talks: Get an inside look at the latest trends, innovations, and best practices from industry experts and Google Developer Experts (GDEs).
Networking Opportunities: Connect with like-minded developers, industry leaders, and hiring managers. Whether you're looking for collaboration opportunities, career advice, or just want to share your passion for technology, this is the place to be!'''
# Create a list that will hold your chunks
chunks = []

- Basic knowledge of [relevant knowledge, e.g., Python, AI concepts].
- [Any tools or packages participants should have set up before this session, e.g., "Install packages listed in requirements.txt."]
chunk_size = 35 # Characters

---
# Run through the a range with the length of your text and iterate every chunk_size you want
for i in range(0, len(text), chunk_size):
chunk = text[i:i + chunk_size]
chunks.append(chunk)

## Agenda
for chunk in chunks:
print(chunk)
```

Congrats! You've just chunked the text from the "What to expect" from GDG DevFest intro. This is just the beginning. You can always be in control of your chunking strategy in this way, however just like throughout software tools, we have abstractions so you don't have to write all of this yourself.

Notice the chunks that are printed out are split exactly by some character count and have no overlap? This is a fairly naive chunking strategy and shouldn't be used in production. It's good for testing quickly, but usually not very good for recall. Remember the purpose of chunking is to improve our ability to retrieve highly relevant chunks to our queries.

When you add some overlap, you might capture semantic meaning from a prior chunk of the following chunk within the chunk itself, but there's not too much fanciness and enhancements to your retrievals when using character based text splitting.

1. **Introduction**
- [Brief description of the introduction phase, e.g., "Overview of human-in-the-loop concepts."]
### Recursive Chunking with LangChain

2. **Hands-on Activity**
- [Brief outline of the hands-on activity, e.g., "Building an AI agent with assistant-ui to integrate human feedback."]
Recursive chunking is a great first place to evaluate your chunks for retrieval, after character level text splitting. This would be a great place to start chunking for real, and potentially in production.

3. **Q&A and Discussion**
- [Description, e.g., "Open floor for questions on implementing real-time human feedback mechanisms in AI."]
The idea behind recursive text splitting is that we have a list of (definable) separators. For simplicity, we'll pick the following which is also outlined in Greg Kamradt's 5 Levels of Chunking Jupyter notebook.

---
- "\n\n" - Double new line, or most commonly paragraph breaks
- "\n" - New lines
- " " - Spaces
- "" - Characters

## Instructions
This approach first chunks the text by the first separator, "\n\n". Once complete, it evaluates all the chunks to see if the text is short enough to fit the defined chunk size. If the chunk is too large, it'll then go through the chunks that are too large and separate by the next separator "\n". It keeps doing this with all the chunks until all chunks are the appropriate size.

Let's execute a recursive chunking with langchain. First we need to install some dependencies, ideally to a python virtual environment.

### Step 1: Clone the Workshop Repository
If you haven't cloned the repository already, run:
```bash
git clone https://github.com/[your-username]/devfest-ai-workshop.git
cd devfest-ai-workshop/sessions/session_5
python3 -m venv venv
source venv/bin/activate
pip install langchain
```

### Step 2: Set Up Environment
- Activate your environment (if using Conda):
```bash
conda activate workshop_env
```
- Install any necessary dependencies:
```bash
pip install -r requirements.txt
```
```python

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """Google has its origins in "BackRub", a research project that was begun in 1996 by Larry Page and Sergey Brin when they were both PhD students at Stanford University in Stanford, California.[2] The project initially involved an unofficial "third founder", Scott Hassan, the lead programmer who wrote much of the code for the original Google Search engine, but he left before Google was officially founded as a company;[3][4] Hassan went on to pursue a career in robotics and founded the company Willow Garage in 2006.[5][6] Craig Nevill-Manning was also invited to join Google at its formation but declined and then joined a little later on.[7]
In the search of a dissertation theme, Larry Page had been considering among other things exploring the mathematical properties of the World Wide Web, understanding its link structure as a huge graph.[8] His supervisor, Terry Winograd, encouraged him to pick this idea (which Larry Page later recalled as "the best advice I ever got"[9]) and Larry Page focused on the problem of finding out which web pages link to a given page, based on the consideration that the number and nature of such backlinks was valuable information about that page (with the role of citations in academic publishing in mind).[8] Page told his ideas to Hassan, who began writing the code to implement Page's ideas.[3]
### Step 3: Open the Jupyter Notebook
Navigate to the Jupyter notebook for this session:
1. Launch Jupyter Notebook:
```bash
jupyter notebook
```
2. Open the file `session_5.ipynb`.
The research project was nicknamed "BackRub", and it was soon joined by Brin, who was supported by a National Science Foundation Graduate Fellowship.[10] The two had first met in the summer of 1995, when Page was part of a group of potential new students that Brin had volunteered to give a tour around the campus and nearby San Francisco.[8] Both Brin and Page were working on the Stanford Digital Library Project (SDLP). The SDLP's goal was "to develop the enabling technologies for a single, integrated and universal digital library" and it was funded through the National Science Foundation, among other federal agencies.[10][11][12][13] Brin and Page were also part of a computer science research team at Stanford University that received funding from Massive Digital Data Systems (MDDS), a program managed for the Central Intelligence Agency (CIA) and the National Security Agency (NSA) by large intelligence and military contractors.[14]
### Step 4: Complete the Hands-on Exercise
Follow the instructions in the notebook for each part of the exercise:
Page's web crawler began exploring the web in March 1996, with Page's own Stanford home page serving as the only starting point.[8] To convert the backlink data that is gathered for a given web page into a measure of importance, Brin and Page developed the PageRank algorithm.[8] While analyzing BackRub's output which, for a given URL, consisted of a list of backlinks ranked by importance, the pair realized that a search engine based on PageRank would produce better results than existing techniques (existing search engines at the time essentially ranked results according to how many times the search term appeared on a page).[8][15]
- **Exercise 1:** [Brief description, e.g., “Load and preprocess data for building a chatbot.”]
- **Exercise 2:** [Brief description, e.g., “Use assistant-ui to add human feedback capabilities.”]
- **Exercise 3:** [Brief description, e.g., “Evaluate chatbot performance with human-in-the-loop inputs.”]
Convinced that the pages with the most links to them from other highly relevant Web pages must be the most relevant pages associated with the search, Page and Brin tested their thesis as part of their studies and laid the foundation for their search engine.[16] The first version of Google was released in August 1996 on the Stanford website. It used nearly half of Stanford's entire network bandwidth.[17]
"""


# the larger chunks means
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap=0)


chunked_text = text_splitter.create_documents([text])

for chunk in chunked_text:
print(chunk)

```

---
We'll be using these chunks in Weaviate for RAG, however know there are many other forms of chunking beyond recursive chunking that will better fit other formats of data. For example, document level splitting -- which takes into account the structure of the document, such as tables in a PDF. There are tools like Unstructured which gives you the ability to extract data from a PDF, or embedding models like CoPali which leverage a a vision model to evaluate the shape and size of the document to help generate embeddings.

## Additional Resources
More recently, the industry has been exploring semantic chunking which evaluates the distances of each chunk to determine similarity, and ensures that chunks that are similar to one another are chenked together. And further looking, agentic chunking has become an interesting topic as well.

- **Documentation:**
- [Link to relevant documentation, e.g., `assistant-ui` documentation](https://link-to-docs.com)

- **Further Reading:**
- [List of related papers, articles, or links]

---
## Challenge 3: Store the chunks in Weaviate and apply RAG

## Solutions
Great, let's store the chunks in Weaviate and execute a simle RAG, or Generative Search query.

For each chunk that was created above, let's store them in our `chunk` collection in Weaviate.

Follow the Weaviate Quickstart to download the Weaviate python client library, `weaviate-client`. Then grab the Cluster URL and the Cluster API key of the cluster you created in challenge 1.


### Add your chunks

```python

import weaviate
from weaviate.classes.init import Auth
import requests, json, os

# Best practice: store your credentials in environment variables
wcd_url = os.environ["WCD_URL"]
wcd_api_key = os.environ["WCD_API_KEY"]
openai_api_key = os.environ["OPENAI_APIKEY"]

client = weaviate.connect_to_weaviate_cloud(
cluster_url=wcd_url, # Replace with your Weaviate Cloud URL
auth_credentials=Auth.api_key(wcd_api_key), # Replace with your Weaviate Cloud key
headers={"X-OpenAI-Api-Key": openai_api_key}, # Replace with your OpenAI API key
)



chunk_collection = client.collections.get("Chunk")

with chunk_collection.batch.dynamic() as batch:
for chunk in chunks: #Note these chunks come from the recursive chunking step in challenge 2
batch.add_object({
"content": chunk,
})

client.close() # Free up resources
```


Congrats, you've loaded the vector database and generated vectors automatically for each of the chunks. Let's do a generative search, or RAG query.


```python


import weaviate
from weaviate.classes.init import Auth
import os

# Best practice: store your credentials in environment variables
wcd_url = os.environ["WCD_URL"]
wcd_api_key = os.environ["WCD_API_KEY"]
openai_api_key = os.environ["OPENAI_APIKEY"]

client = weaviate.connect_to_weaviate_cloud(
cluster_url=wcd_url, # Replace with your Weaviate Cloud URL
auth_credentials=Auth.api_key(wcd_api_key), # Replace with your Weaviate Cloud key
headers={"X-OpenAI-Api-Key": openai_api_key}, # Replace with your OpenAI API key
)

chunk_collection = client.collections.get("Chunk")

response = chunk_collection.generate.near_text(
query="google history",
limit=3,
single_prompt="Summarize google's history based on the context."

)

print(response.generated) # Inspect the generated text

client.close() # Free up resources

```

If you need help with any part of the session, refer to the [solution file](../solutions/session_5_solution.ipynb) in the `solutions` folder.

---
## Congrats!

## Contact
You've just executed a RAG on your recursively chunked text. To go further, try loading any document into your Weaviate cluster using the recursive chunking strategy and see good the summaries are.

If you have questions during the workshop, please reach out to **[Instructor’s Name]** or open an issue in the repository.
At the end of the day, we need to ensure our chunks are optimized for the best possible retrievals. This is why evaluations are important when we are chunking our documents, but chunking is not part of todays workshop, if you'd like to learn more, connect with me on LinkedIn! https://linkedin.com/in/itsajchan.

Happy coding!
Thanks for building with me today!
File renamed without changes
Loading

0 comments on commit 24ecf36

Please sign in to comment.