Skip to content

Semantic text clustering using sentence embeddings and agglomerative clustering.

License

Notifications You must be signed in to change notification settings

cobanov/semaclust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

semaclust

semaclust (semantic + clustering) is a lightweight Python package for semantic text clustering using sentence embeddings and agglomerative clustering.

Features

  • SentenceTransformer-based text encoding
  • Agglomerative clustering with configurable thresholds
  • Easily map or replace similar text values

Installation

pip install git+https://github.com/cobanov/semaclust.git

Usage

# Create clusterer
clusterer = TextClusterer()

texts = ["New York", "Los Angeles", "San Francisco", "new york city", "LA", "San Fran"]
# Get clusters
clusters = clusterer.cluster(texts)
print("Clusters:", clusters)

# Clusters: {1: ['New York', 'new york city'], 2: ['Los Angeles', 'LA'], 0: ['San Francisco', 'San Fran']}
# Get replacement map
replacement_map = clusterer.get_replacement_map(texts)
print("\nReplacement map:", replacement_map)

# Replacement map: {'New York': 'New York', 'new york city': 'New York', 'Los Angeles': 'Los Angeles', 'LA': 'Los Angeles', 'San Francisco': 'San Francisco', 'San Fran': 'San Francisco'}
# Replace values
replaced_texts = clusterer.replace_values(texts)
print("\nReplaced texts:", replaced_texts)

# Replaced texts: ['New York', 'Los Angeles', 'San Francisco', 'New York', 'Los Angeles', 'San Francisco']

About

Semantic text clustering using sentence embeddings and agglomerative clustering.

Topics

Resources

License

Stars

Watchers

Forks

Languages