Babylonian Finite-State Morphology v. 2.0
For lemmatization and POS-tagging, use BabyLemmatizer 2.0. It can produce unambiguous lemmatization and POS-tagging directly from transliteration.
For using the transducer you need Foma https://fomafst.github.io/. For evaluation you will need the HFST PyPi https://pypi.org/project/hfst/.
One way to run BabyFST is to use Foma's flookup. For example, create input.txt
file that contains transcribed words one word per line:
šarru
kaspam
iddin
Now run cat input.txt | ./flookup -x akkadian.foma > output.txt
where akkadian.foma
is a compiled transducer file. The results can be read from output.txt
. Analyses will be separated by empty line. See example script in /eval/evaluate-data.sh
.
Use files in src
folder.
Disambiguation, unify similar verb classes, get rid of unnecessary meta-symbols, fix Assyrian vowel harmony, split lexicon into dialects
If you use BabyFST, redistribute or modify it, cite the paper below.
@inproceedings{sahala-etal-2020-babyfst,
title = "{B}aby{FST} - Towards a Finite-State Based Computational Model of Ancient Babylonian",
author = "Sahala, Aleksi and
Silfverberg, Miikka and
Arppe, Antti and
Lind{\'e}n, Krister",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.479",
pages = "3886--3894",
abstract = "Akkadian is a fairly well resourced extinct language that does not yet have a comprehensive morphological analyzer available. In this paper we describe a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language, that can achieve a coverage up to 97.3{\%} and recall up to 93.7{\%} on lemmatization and POS-tagging task on token level from a transcribed input. Since Akkadian word forms exhibit a high degree of morphological ambiguity, in that only 20.1{\%} of running word tokens receive a single unambiguous analysis, we attempt a first pass at weighting our finite-state transducer, using existing extensive Akkadian corpora which have been partially validated for their lemmas and parts-of-speech but not the entire morphological analyses. The resultant weighted finite-state transducer yields a moderate improvement so that for 57.4{\%} of the word tokens the highest ranked analysis is the correct one. We conclude with a short discussion on how morphological ambiguity in the analysis of Akkadian could be further reduced with improvements in the training data used in weighting the finite-state transducer as well as through other, context-based techniques.",
language = "English",
ISBN = "979-10-95546-34-4",
}
In Section 2.4, the paper should read: 1.4 million words occurring in texts labeled to contain Akkadian have been lemmatized or POS-tagged. Thus this figure also contains languages other than Akkadian in multilingual texts, and words that have not been given a lemma because they are too broken, numbers, etc. In reality about 1.3 million words in lemmatized/POS-tagged texts are explicitly labeled as Akkadian, and of these all are not given lemmas due to the aforementioned reasons.
This is an unfortunate error that occurs in several word count statements about Korp-Oracc due to my mistake in a Google Docs sheet about the 2019 Korp-Oracc data.
See also ./eval
for revised results with explicitly labeled data without overlapping inputs.
This piece of software would not have been possible without the hard work of dozens of Assyriologists lemmatizing the Akkadian texts in the Oracc corpus: Jamie Novotny, Laurie Pearce, John Carnahan, Philip Jones, Alexa Bartelmus, Cristopher Bravo, Frauke Weierhäuser, Giulia Lentini, Jay Cristostomo, Joshua Jeffers, Melanie Groß, Mikko Luukko, Nathan Morello, Poppy Tushingham, Talia Prussin (and many others whose names I do not know).