Skip to content

h9-tec/MorphBPE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MorphBPE: Morphologically-Aware Arabic Tokenizer

Implementation of the MorphBPE tokenizer from the paper "Fanar: An Arabic-Centric Multimodal Generative AI Platform". MorphBPE combines morphological segmentation with byte-pair encoding (BPE) to create a tokenizer that respects Arabic morphological structure.

Key Features

  • 🔍 Morpheme-boundary aware BPE merges
  • 🚀 Optimized vocabulary size (75 × 1024 = 76,800 tokens)
  • 🔄 Arabic text preprocessing pipeline
  • 📊 Built-in evaluation metrics:
    • Morphological Alignment Score
    • Fertility Score
  • 💾 Model serialization support

Installation

pip install -r requirements.txt
mkdir -p ~/.ssh
curl https://curl.se/ca/cacert.pem -o ~/.ssh/cert.pem

Quick Start

from morphbpe import MorphBPE, preprocess_arabic

# Initialize tokenizer
tokenizer = MorphBPE(
    vocab_size=76800,  # 75 * 1024 as per paper
    farasa_api_key="YOUR_API_KEY"  # Get from https://farasa.qcri.org
)

# Train on Arabic corpus
corpus = [
    "الرحمن الرحيم مالك يوم الدين",
    "بسم الله الرحمن الرحيم"
]

# Preprocess and train
preprocessed_corpus = [preprocess_arabic(text) for text in corpus]
tokenizer.train(preprocessed_corpus)

# Tokenize text
text = "الرحمن الرحيم"
tokens = tokenizer.tokenize(text)

# Evaluate
alignment_score = tokenizer.morphological_alignment_score(text)
print(f"Morphological Alignment: {alignment_score:.2f}")

Algorithm

MorphBPE follows Algorithm 1 from the paper:

  1. Initialize vocabulary with individual Arabic characters
  2. Segment training corpus using morphological segmentation
  3. While target vocabulary size not reached:
    • Compute byte-pair frequencies
    • Find most frequent pair that respects morpheme boundaries
    • Merge pair into new symbol
    • Update vocabulary

Evaluation Metrics

The implementation includes three key evaluation metrics from the paper:

  1. Morphological Alignment Score: Measures how well tokenization aligns with morphological segmentation using dynamic programming
  2. Fertility: Ratio of tokens produced compared to whitespace tokenization
  3. Perplexity: Available when integrated with language models

Preprocessing

The tokenizer implements the paper's preprocessing pipeline for Arabic text:

  • Diacritic removal (while preserving in vocabulary)
  • Script normalization (alef, teh marbuta, etc.)
  • Morphological segmentation using Farasa

Acknowledgments

  • Farasa Segmenter for Arabic morphological analysis
  • The Fanar team for the algorithm specification

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages