This repository accompanies the paper "Adopting a computer-assisted approach to historical language comparison: defining early steps in a Dogon languages comparative work" by Promise Dodzi Kpoglu. The repository contains both the data and the source code used in the paper's experiments. The code, files, and illustrations are available on the master
branch of this repository.
All data used for experiments are stored in the files
folder.
The original data, as curated by the field linguist, is named original_data.tsv
. This data is also available on the Dogon and Bangime Linguistics project site, accessible via this link. The data has been curated in CLDF (Cross-Linguistic Data Format) and is publicly available via this link.
The processed data, after manual processing, is named data.tsv
. Each row represents a word, and the columns are as follows:
Column | Info |
---|---|
ID | Unique identifier |
VARID | Variant form identifier |
DOCULECT | Language name |
GLOSS | Meaning of the form as used by language users |
FRENCH | Gloss translation in French |
ENGLISH_SHORT | Reduced gloss in English |
FRENCH_SHORT | Reduced gloss in French |
ENGLISH_CATEGORY | Categorization of reduced gloss into designated categories |
FRENCH_CATEGORY | Categorization of reduced gloss in French into designated categories |
VALUE_ORG | Original form noted by field-linguist |
SINGULAR | Singular form of the word, where necessary |
PLURAL | Plural form of the word, where necessary |
FORM | 'Consensus' form chosen for verbs |
PARSED_FORM | Proposed segmentation of 'consensus' form |
RECONSTRUCTION | Proposed reconstruction |
CONCEPT | Standardized reference of gloss |
POS | Part of speech of the word |
This is the data obtained after semi-manual processing. Each row represents a word, and the columns are as follows:
Column | Info |
---|---|
DOCULECT | Language name |
GLOSS | Meaning of the form as used by language users |
IPA | Standardized representation of the word in IPA |
The scripts
folder contains all the Python scripts needed to obtain the results reported in the paper.
utils.py
: Contains various classes defined to help clean and segment words.functions.py
: Calls on classes defined inutils.py
and defines various functions to clean the original data.cleaning_data.py
: Calls various functions infunctions.py
to clean the original data and outputscleaned_data.tsv
.data_statistics.py
: Analyzes various components of the data and outputs results to theillustrations
folder.cognates_alignments.py
: Automatically determines cognates in the data and performs alignment analysis. Outputs files into thefiles
folder.clustering.py
: Accepts the results ofcognates_alignments.py
and performs clustering and analysis. Results are outputted into theillustrations
folder.
This folder contains:
coverage_plot.png
: The result of the analysis carried out on the data.mutual_coverage.png
: The result of the mutual coverage analysis of the data.heatmap.png
: A heatmap illustrating the weighted distances between languages.tree.png
: A phylogenetic tree illustrating the relationships between the various Dogon languages.
To obtain the same results reported in the paper:
- Clone this repository and run
pip install -r requirements.txt
. - Switch to the
master
branch by running the commandgit checkout master
.
There are two ways to obtain the results:
- Run
make all
on the command line to automatically run all scripts. - Run the scripts manually:
python cleaning_data.py
: Runs the segmentation rules inutils.py
on the manually processed datadata.tsv
by calling various functions infunctions.py
.python data_statistics.py
: Produces an analysis ofcleaned_data.tsv
, outputtingcoverage_plot.png
, a graph of every language's coverage,mutual_coverage.png
, which gives an idea of length and breadth coverage in the data, and the number of items on the command line.python cognates_alignments.py
: Outputslexstat.tsv
andalignment_2.html
, which are cognate clustering results and alignment results, respectively.python clustering.py
: Takeslexstat.tsv
as input to outputtree.png
, a phylogenetic relationship based on cognacy, andheatmap.png
, a heatmap of aggregated pairwise distances between languages.
This work is based on data from Heath et al.'s "Dogon Comparative Wordlist" (2016).
Special thanks to all BANG project members for their invaluable contributions to this project.
This paper is part of the ERC-funded project: BANG - The Mysterious Bang: A Language and Population Isolate Unlocks the Secrets of Interior West Africa's Lost Ethnolinguistic Diversity.
- CORDIS Number: 101045195
- Project ID: 101045195