Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LiteLLM as a representation model #2213

Merged
merged 5 commits into from
Dec 10, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion bertopic/cluster/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ class BaseCluster:

```python
from bertopic import BERTopic
from bertopic.dimensionality import BaseCluster
from bertopic.cluster import BaseCluster

empty_cluster_model = BaseCluster()

Expand Down
8 changes: 8 additions & 0 deletions bertopic/representation/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,13 @@
msg = "`pip install openai` \n\n"
OpenAI = NotInstalled("OpenAI", "openai", custom_msg=msg)

# LiteLLM Generator
try:
from bertopic.representation._litellm import LiteLLM
except ModuleNotFoundError:
msg = "`pip install litellm` \n\n"
LiteLLM = NotInstalled("LiteLLM", "litellm", custom_msg=msg)

# LangChain Generator
try:
from bertopic.representation._langchain import LangChain
Expand Down Expand Up @@ -63,6 +70,7 @@
"Cohere",
"OpenAI",
"LangChain",
"LiteLLM",
"LlamaCPP",
"VisualRepresentation",
]
176 changes: 176 additions & 0 deletions bertopic/representation/_litellm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
import time
from litellm import completion
import pandas as pd
from scipy.sparse import csr_matrix
from typing import Mapping, List, Tuple, Any
from bertopic.representation._base import BaseRepresentation
from bertopic.representation._utils import retry_with_exponential_backoff


DEFAULT_PROMPT = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]
Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""


class LiteLLM(BaseRepresentation):
"""Using the LiteLLM API to generate topic labels.

For an overview of models see:
https://docs.litellm.ai/docs/providers

Arguments:
model: Model to use. Defaults to OpenAI's "gpt-3.5-turbo".
generator_kwargs: Kwargs passed to `litellm.completion`.
prompt: The prompt to be used in the model. If no prompt is given,
`self.default_prompt_` is used instead.
NOTE: Use `"[KEYWORDS]"` and `"[DOCUMENTS]"` in the prompt
to decide where the keywords and documents need to be
inserted.
delay_in_seconds: The delay in seconds between consecutive prompts
in order to prevent RateLimitErrors.
exponential_backoff: Retry requests with a random exponential backoff.
A short sleep is used when a rate limit error is hit,
then the requests is retried. Increase the sleep length
if errors are hit until 10 unsuccesfull requests.
If True, overrides `delay_in_seconds`.
nr_docs: The number of documents to pass to LiteLLM if a prompt
with the `["DOCUMENTS"]` tag is used.
diversity: The diversity of documents to pass to LiteLLM.
Accepts values between 0 and 1. A higher
values results in passing more diverse documents
whereas lower values passes more similar documents.

Usage:

To use this, you will need to install the openai package first:
MaartenGr marked this conversation as resolved.
Show resolved Hide resolved

`pip install litellm`

Then, get yourself an API key of any provider (for instance OpenAI) and use it as follows:

```python
import os
from bertopic.representation import LiteLLM
from bertopic import BERTopic

# set ENV variables
os.environ["OPENAI_API_KEY"] = "your-openai-key"

# Create your representation model
representation_model = LiteLLM(model="gpt-3.5-turbo")

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```

You can also use a custom prompt:

```python
prompt = "I have the following documents: [DOCUMENTS] \nThese documents are about the following topic: '"
representation_model = LiteLLM(model="gpt", prompt=prompt)
```
""" # noqa: D301

def __init__(
self,
model: str = "gpt-3.5-turbo",
prompt: str = None,
generator_kwargs: Mapping[str, Any] = {},
delay_in_seconds: float = None,
exponential_backoff: bool = False,
nr_docs: int = 4,
diversity: float = None,
):
self.model = model
self.prompt = prompt if prompt else DEFAULT_PROMPT
self.default_prompt_ = DEFAULT_PROMPT
self.delay_in_seconds = delay_in_seconds
self.exponential_backoff = exponential_backoff
self.nr_docs = nr_docs
self.diversity = diversity

self.generator_kwargs = generator_kwargs
if self.generator_kwargs.get("model"):
self.model = generator_kwargs.get("model")
if self.generator_kwargs.get("prompt"):
del self.generator_kwargs["prompt"]

def extract_topics(
self, topic_model, documents: pd.DataFrame, c_tf_idf: csr_matrix, topics: Mapping[str, List[Tuple[str, float]]]
) -> Mapping[str, List[Tuple[str, float]]]:
"""Extract topics.

Arguments:
topic_model: A BERTopic model
documents: All input documents
c_tf_idf: The topic c-TF-IDF representation
topics: The candidate topics as calculated with c-TF-IDF

Returns:
updated_topics: Updated topic representations
"""
# Extract the top n representative documents per topic
repr_docs_mappings, _, _, _ = topic_model._extract_representative_docs(
c_tf_idf, documents, topics, 500, self.nr_docs, self.diversity
)

# Generate using a (Large) Language Model
updated_topics = {}
for topic, docs in repr_docs_mappings.items():
prompt = self._create_prompt(docs, topic, topics)

# Delay
if self.delay_in_seconds:
time.sleep(self.delay_in_seconds)

messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
]
kwargs = {"model": self.model, "messages": messages, **self.generator_kwargs}
if self.exponential_backoff:
response = chat_completions_with_backoff(**kwargs)
else:
response = completion(**kwargs)
label = response["choices"][0]["message"]["content"].strip().replace("topic: ", "")

updated_topics[topic] = [(label, 1)]

return updated_topics

def _create_prompt(self, docs, topic, topics):
keywords = list(zip(*topics[topic]))[0]

# Use the Default Chat Prompt
if self.prompt == DEFAULT_PROMPT:
prompt = self.prompt.replace("[KEYWORDS]", " ".join(keywords))
prompt = self._replace_documents(prompt, docs)

# Use a custom prompt that leverages keywords, documents or both using
# custom tags, namely [KEYWORDS] and [DOCUMENTS] respectively
else:
prompt = self.prompt
if "[KEYWORDS]" in prompt:
prompt = prompt.replace("[KEYWORDS]", " ".join(keywords))
if "[DOCUMENTS]" in prompt:
prompt = self._replace_documents(prompt, docs)

return prompt

@staticmethod
def _replace_documents(prompt, docs):
to_replace = ""
for doc in docs:
to_replace += f"- {doc[:255]}\n"
prompt = prompt.replace("[DOCUMENTS]", to_replace)
return prompt


def chat_completions_with_backoff(**kwargs):
return retry_with_exponential_backoff(
completion,
)(**kwargs)
7 changes: 6 additions & 1 deletion docs/algorithm/algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,12 @@ The following models are implemented in `bertopic.representation`:
* `PartOfSpeech`
* `KeyBERTInspired`
* `ZeroShotClassification`
* `TextGeneration`
* `TextGeneration` (HuggingFace)
* `Cohere`
* `OpenAI`
* `LangChain`
* `LiteLLM`
* `LlamaCPP`

!!! tip Models
There are roughly two sets of models. **First** are the non-generative set of models that you can find [here](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html). These include models that focus on enhancing the keywords in the topic representations. **Second** are the generative models that attempt to label or summarize the topics instead. You can find an overview of [implemented LLMs here](https://maartengr.github.io/BERTopic/getting_started/representation/llm).
3 changes: 3 additions & 0 deletions docs/api/backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `Backends`

::: bertopic.backend
3 changes: 0 additions & 3 deletions docs/api/backends/base.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/backends/cohere.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/backends/openai.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/backends/word_doc.md

This file was deleted.

File renamed without changes.
3 changes: 3 additions & 0 deletions docs/api/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `BaseCluster`

::: bertopic.cluster._base.BaseCluster
File renamed without changes.
3 changes: 0 additions & 3 deletions docs/api/onlinecv.md

This file was deleted.

3 changes: 3 additions & 0 deletions docs/api/plotting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `Plotting`

::: bertopic.plotting
3 changes: 0 additions & 3 deletions docs/api/representation/base.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/representation/cohere.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/representation/generation.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/representation/keybert.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/representation/langchain.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/representation/mmr.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/representation/openai.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/representation/pos.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/representation/zeroshot.md

This file was deleted.

3 changes: 3 additions & 0 deletions docs/api/representations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `Representations`

::: bertopic.representation
3 changes: 3 additions & 0 deletions docs/api/vectorizers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `Vectorizers`

::: bertopic.vectorizers._online_cv.OnlineCountVectorizer
29 changes: 29 additions & 0 deletions docs/getting_started/representation/llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -377,6 +377,7 @@ topic_model = BERTopic(representation_model=representation_model, verbose=True)
"""
```


## **OpenAI**

Instead of using a language model from 🤗 transformers, we can use external APIs instead that
Expand Down Expand Up @@ -469,6 +470,34 @@ The above is not constrained to just creating a short description or summary of
If you want to have multiple representations of a single topic, it might be worthwhile to also check out [**multi-aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling with BERTopic.


## **LiteLLM**

An amazing framework to simplify connecting to external LLMs, is [LiteLLM](https://docs.litellm.ai). This package allows you to connect to OpenAI, Cohere, Anthropic, etc. all within one package. This makes iteration and testing out different models a breeze!

o start with, we first need to install `litellm`:

```bash
pip install litellm
```

After installation, usage is straightforward and you can select any model found in their [docs](https://docs.litellm.ai/docs/providers).
Let's show an example with OpenAI:

```python
import os
from bertopic import BERTopic
from bertopic.representation import LiteLLM

# set ENV variables
os.environ["OPENAI_API_KEY"] = "MY_KEY"

# Create your representation model
representation_model = LiteLLM(model="gpt-4o-mini")

# Create our BERTopic model
topic_model = BERTopic(representation_model=representation_model, verbose=True)
```

## **LangChain**

[Langchain](https://github.com/hwchase17/langchain) is a package that helps users with chaining large language models.
Expand Down
29 changes: 6 additions & 23 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,29 +57,12 @@ nav:
- API:
- BERTopic: api/bertopic.md
- Sub-models:
- Backends:
- Base: api/backends/base.md
- Word Doc: api/backends/word_doc.md
- OpenAI: api/backends/openai.md
- Cohere: api/backends/cohere.md
- Dimensionality Reduction:
- Base: api/dimensionality/base.md
- Clustering:
- Base: api/cluster/base.md
- Vectorizers:
- cTFIDF: api/ctfidf.md
- OnlineCountVectorizer: api/onlinecv.md
- Topic Representation:
- Base: api/representation/base.md
- MaximalMarginalRelevance: api/representation/mmr.md
- KeyBERT: api/representation/keybert.md
- PartOfSpeech: api/representation/pos.md
- Text Generation:
- 🤗 Transformers: api/representation/generation.md
- LangChain: api/representation/langchain.md
- Cohere: api/representation/cohere.md
- OpenAI: api/representation/openai.md
- Zero-shot Classification: api/representation/zeroshot.md
- 1. Backends: api/backends.md
- 2. Dimensionality Reduction: api/dimensionality.md
- 3. Clustering: api/cluster.md
- 4. Vectorizers: api/vectorizers.md
- 5. c-TF-IDF: api/ctfidf.md
- 6. Fine-Tune Topic Representation: api/representations.md
- Plotting:
- Barchart: api/plotting/barchart.md
- Documents: api/plotting/documents.md
Expand Down