NLP resources for the Georgian language

(Stuff I've encountered so far)

Models

BAAI/bge-m3 - In my testing, this is currently the best sentence embedding model available for Georgian.
Davit6174/georgian-distilbert-mlm · Hugging Face
jnz/electra-ka · Hugging Face
- More on jnz's profile
My models:
- XLM-Roberta-Base fine-tuned on WikiANN for NER
- Floret (fasttext adaptation) word embedding model trained on a part of the MC4 dataset
- XLM-Roberta-Base for masked language modeling, fine-tuned on a private dataset of Georgian news titles.

Tools

Some projects working on Georgian NLP tools:

ანბანი ჻ Anbani Georgia · GitHub
- "Collection of Web and AI tools designed to equip Georgian Language and Alphabet with the challenges of digital age ჻ ᲐᲜᲑᲐᲜᲘ"
- nano - bare-bones Georgian script converter
- anbani.py - Georgian Python toolkit for NLP, Transliteration and more
- TextArt - Georgian Text Art Generator
- anbani.js - Multifunctional javascript toolkit for Georgian Alphabet - Anbani
- word-embeddings
- anbani.db - Various Georgian datasets
https://qartnlp.iliauni.edu.ge/
screeve · GitHub
- lemmatizer - Lemmatization functionality for Georgian language.
- postagger - Part-of-speech tagging functionality for Georgian language.
- embeddings - pre-trained embeddings for Georgian language

Datasets

WikiANN - NER dataset.
MC4 is a cleaned version of Common Crawl and contains 15+ GB of Georgian text. I've found it fairly useful in my experiments.

Linguistic resources

[Sparklis] Kartu-Verbs: A Semantic Web Base of Inflected Georgian Verb Forms to Bypass Georgian Verb Lemmatization Issues
- Paper

Research notes and questions

Georgia is a highly inflectional language (a lot of word forms, see the Sparklis link above), which has implications for text embedding. FastText is probably the best non-transformer embedding model for inflectional languages thanks to subword embeddings - see Comprehensive Evaluation of Word Embeddings for Highly Inflectional Language.

Contributions

I'm seeking input from other researchers and practitioners on best practices and useful resources for doing NLP in Georgian. Please contribute what you can, especially general wisdom.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP resources for the Georgian language

Models

Tools

Datasets

Linguistic resources

Research notes and questions

Contributions

About

Releases

Packages

alexamirejibi/awesome-geo-nlp

Folders and files

Latest commit

History

Repository files navigation

NLP resources for the Georgian language

Models

Tools

Datasets

Linguistic resources

Research notes and questions

Contributions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages