MorphBPE: Morphologically-Aware Arabic Tokenizer

Implementation of the MorphBPE tokenizer from the paper "Fanar: An Arabic-Centric Multimodal Generative AI Platform". MorphBPE combines morphological segmentation with byte-pair encoding (BPE) to create a tokenizer that respects Arabic morphological structure.

Key Features

🔍 Morpheme-boundary aware BPE merges
🚀 Optimized vocabulary size (75 × 1024 = 76,800 tokens)
🔄 Arabic text preprocessing pipeline
📊 Built-in evaluation metrics:
- Morphological Alignment Score
- Fertility Score
💾 Model serialization support

Installation

pip install -r requirements.txt
mkdir -p ~/.ssh
curl https://curl.se/ca/cacert.pem -o ~/.ssh/cert.pem

Quick Start

from morphbpe import MorphBPE, preprocess_arabic

# Initialize tokenizer
tokenizer = MorphBPE(
    vocab_size=76800,  # 75 * 1024 as per paper
    farasa_api_key="YOUR_API_KEY"  # Get from https://farasa.qcri.org
)

# Train on Arabic corpus
corpus = [
    "الرحمن الرحيم مالك يوم الدين",
    "بسم الله الرحمن الرحيم"
]

# Preprocess and train
preprocessed_corpus = [preprocess_arabic(text) for text in corpus]
tokenizer.train(preprocessed_corpus)

# Tokenize text
text = "الرحمن الرحيم"
tokens = tokenizer.tokenize(text)

# Evaluate
alignment_score = tokenizer.morphological_alignment_score(text)
print(f"Morphological Alignment: {alignment_score:.2f}")

Algorithm

MorphBPE follows Algorithm 1 from the paper:

Initialize vocabulary with individual Arabic characters
Segment training corpus using morphological segmentation
While target vocabulary size not reached:
- Compute byte-pair frequencies
- Find most frequent pair that respects morpheme boundaries
- Merge pair into new symbol
- Update vocabulary

Evaluation Metrics

The implementation includes three key evaluation metrics from the paper:

Morphological Alignment Score: Measures how well tokenization aligns with morphological segmentation using dynamic programming
Fertility: Ratio of tokens produced compared to whitespace tokenization
Perplexity: Available when integrated with language models

Preprocessing

The tokenizer implements the paper's preprocessing pipeline for Arabic text:

Diacritic removal (while preserving in vocabulary)
Script normalization (alef, teh marbuta, etc.)
Morphological segmentation using Farasa

Acknowledgments

Farasa Segmenter for Arabic morphological analysis
The Fanar team for the algorithm specification

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
morphbpe		morphbpe
tests		tests
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MorphBPE: Morphologically-Aware Arabic Tokenizer

Key Features

Installation

Quick Start

Algorithm

Evaluation Metrics

Preprocessing

Acknowledgments

About

Releases

Packages

Languages

h9-tec/MorphBPE

Folders and files

Latest commit

History

Repository files navigation

MorphBPE: Morphologically-Aware Arabic Tokenizer

Key Features

Installation

Quick Start

Algorithm

Evaluation Metrics

Preprocessing

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages