Implementation of the MorphBPE tokenizer from the paper "Fanar: An Arabic-Centric Multimodal Generative AI Platform". MorphBPE combines morphological segmentation with byte-pair encoding (BPE) to create a tokenizer that respects Arabic morphological structure.
- 🔍 Morpheme-boundary aware BPE merges
- 🚀 Optimized vocabulary size (75 × 1024 = 76,800 tokens)
- 🔄 Arabic text preprocessing pipeline
- 📊 Built-in evaluation metrics:
- Morphological Alignment Score
- Fertility Score
- 💾 Model serialization support
pip install -r requirements.txt
mkdir -p ~/.ssh
curl https://curl.se/ca/cacert.pem -o ~/.ssh/cert.pem
from morphbpe import MorphBPE, preprocess_arabic
# Initialize tokenizer
tokenizer = MorphBPE(
vocab_size=76800, # 75 * 1024 as per paper
farasa_api_key="YOUR_API_KEY" # Get from https://farasa.qcri.org
)
# Train on Arabic corpus
corpus = [
"الرحمن الرحيم مالك يوم الدين",
"بسم الله الرحمن الرحيم"
]
# Preprocess and train
preprocessed_corpus = [preprocess_arabic(text) for text in corpus]
tokenizer.train(preprocessed_corpus)
# Tokenize text
text = "الرحمن الرحيم"
tokens = tokenizer.tokenize(text)
# Evaluate
alignment_score = tokenizer.morphological_alignment_score(text)
print(f"Morphological Alignment: {alignment_score:.2f}")
MorphBPE follows Algorithm 1 from the paper:
- Initialize vocabulary with individual Arabic characters
- Segment training corpus using morphological segmentation
- While target vocabulary size not reached:
- Compute byte-pair frequencies
- Find most frequent pair that respects morpheme boundaries
- Merge pair into new symbol
- Update vocabulary
The implementation includes three key evaluation metrics from the paper:
- Morphological Alignment Score: Measures how well tokenization aligns with morphological segmentation using dynamic programming
- Fertility: Ratio of tokens produced compared to whitespace tokenization
- Perplexity: Available when integrated with language models
The tokenizer implements the paper's preprocessing pipeline for Arabic text:
- Diacritic removal (while preserving in vocabulary)
- Script normalization (alef, teh marbuta, etc.)
- Morphological segmentation using Farasa
- Farasa Segmenter for Arabic morphological analysis
- The Fanar team for the algorithm specification