(Stuff I've encountered so far)
- BAAI/bge-m3 - In my testing, this is currently the best sentence embedding model available for Georgian.
- Davit6174/georgian-distilbert-mlm · Hugging Face
- jnz/electra-ka · Hugging Face
- More on jnz's profile
- My models:
- XLM-Roberta-Base fine-tuned on WikiANN for NER
- Floret (fasttext adaptation) word embedding model trained on a part of the MC4 dataset
- XLM-Roberta-Base for masked language modeling, fine-tuned on a private dataset of Georgian news titles.
Some projects working on Georgian NLP tools:
- ანბანი ჻ Anbani Georgia · GitHub
- "Collection of Web and AI tools designed to equip Georgian Language and Alphabet with the challenges of digital age ჻ ᲐᲜᲑᲐᲜᲘ"
- nano - bare-bones Georgian script converter
- anbani.py - Georgian Python toolkit for NLP, Transliteration and more
- TextArt - Georgian Text Art Generator
- anbani.js - Multifunctional javascript toolkit for Georgian Alphabet - Anbani
- word-embeddings
- anbani.db - Various Georgian datasets
- https://qartnlp.iliauni.edu.ge/
- screeve · GitHub
- lemmatizer - Lemmatization functionality for Georgian language.
- postagger - Part-of-speech tagging functionality for Georgian language.
- embeddings - pre-trained embeddings for Georgian language
- WikiANN - NER dataset.
- MC4 is a cleaned version of Common Crawl and contains 15+ GB of Georgian text. I've found it fairly useful in my experiments.
- Georgia is a highly inflectional language (a lot of word forms, see the Sparklis link above), which has implications for text embedding. FastText is probably the best non-transformer embedding model for inflectional languages thanks to subword embeddings - see Comprehensive Evaluation of Word Embeddings for Highly Inflectional Language.
- I'm seeking input from other researchers and practitioners on best practices and useful resources for doing NLP in Georgian. Please contribute what you can, especially general wisdom.