Semantics

Requirements

For semantics functionality, you need to install

pip install scikit-learn numpy

Clustering

UralicNLP can cluster documents into semantically meaningful categories using LLM embeddings.

from uralicNLP.llm import get_llm
from uralicNLP import semantics

llm = get_llm("roneneldan/TinyStories-33M")
texts = ["dogs are funny", "cats play around", "cars go fast", "planes fly around", "parrots like to eat", "eagles soar in the skies", "moon is big", "saturn is a planet"]
semantics.cluster(texts, llm)
>>[['dogs are funny', 'parrots like to eat', 'moon is big'], ['cats play around', 'cars go fast', 'planes fly around', 'eagles soar in the skies'], ['saturn is a planet']]

This method will cluster texts into semantically similar clusters. You can use whichever LLM you want (see more in the LLM documentation).

If you need to get the indices instead of the actual texts, you can pass return_ids=True.

semantics.cluster(texts, llm, return_ids=True)
>>[[0, 4, 6], [1, 2, 3, 5], [7]]

These indices are relative to the texts list that is passed to the method.

UralicNLP is an open-source Python library by Mika Hämäläinen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantics

Requirements

Clustering

Clone this wiki locally