Awesome Persian NLP

A curated list of tools and research related to Persian NLP.

Tools

Part-of-Speech Tagger

farsiNLPTools - Open-source dependency parser, part-of-speech tagger, and text normalizer for Farsi (Persian).
Hazm - Persian NLP Toolkit.
Persian Language Model for HunPoS - HunPoS (Halacsy et al, 2007) is an open source reimplementation of the statistical part-of-speech tagger Trigrams'n Tags, also called TnT (Brants, 2000) allowing the user to tune the tagger by using different feature settings.
Maryam Tavafi POS Tagger - This software includes implementation of a Persian part of speech tagger based on Structured Support Vector Machines.
Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.
Persianp Toolbox - Multi-purpose persian NLP toolbox.
UM-wtlab pos tagger - This software is a C# implementation of the Viberbi and Brill part-of-speech taggers.
RDRPOSTagger - Provides a pre-trained part-of-speech (POS) tagging model for Persian. This POS tagging toolkit is implemented in both Python and Java.
jPTDP - Provides a pre-trained model for joint POS tagging and dependency parsing for Persian.
Parsivar - A Language Processing Toolkit for Persian.

Language Detection

Google language detect (python port) - Light Weight language detector, its performance for persian is excellent.

Tokenization & Segmentation

Hazm - Persian NLP Toolkit.
polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).
tok-tok - Tok-tok is a fast, simple, multilingual tokenizer(single .pl file).
segmental - You can train your model based on plain-text corpus for text segmentation by powerful deep learning platform.
Persian Sentence Segmenter and Tokenizer: SeTPer - Regex based sentence segmenter.
Farsi-Verb-Tokenizer - Tokenizes Farsi Verbs.
Parsivar - A Language Processing Toolkit for Persian.
ParsiAnalyzer - Persian Analyzer For Elasticsearch.
ParsiNorm - Persain Text Pre-Proceesing Tool.
Persian Tools - An anthology of a variety of tools for the Persian language in Python.

Normalizer And Text Cleaner

Hazm - Persian NLP Toolkit.
Persian Pre-processor: PrePer - Another signle .pl tools that normals your persian text.
virastar - Cleaning up Persian text!.replace double dash to ndash and triple dash to mdash, replace English numbers with their Persian equivalent, correct :;,.?! spacing (one space after and no space before), replace English percent sign to its Persian equivalent and many other normalization. Virastar is written by ruby and has python port.
Virastyar - A collection of C# libraries for Persian text processing (Spell Checking, Purification, Punctuation Correction, Persian Character Standardization, Pinglish Conversion & ...).
Parsivar - A Language Processing Toolkit for Persian (Has Half-Space Normalizer and Pinglish Conversion).
ParsiAnalyzer - Persian Analyzer For Elasticsearch.
ParsiNorm - Persain Text Pre-Proceesing Tool.
Persian Tools - An anthology of a variety of tools for the Persian language in Python.

Translator

SPL - Semantic Parser Localizer toolkit can be used to translate text between any language pairs for which an NMT model exists. We currently support Marian models and Google Translate. In general, for translations to or from Persian, Google Translate has higher quality.

Transliterator

Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.

Morphological Analysis

polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).

Stemmer

Hazm - Persian NLP Toolkit.
PersianStemmer - (Java, Delphi,C# and Python) - PersianStemmer is a longest-match stemming algorithm that is based on pattern matching. It uses a knowledge base which consist of a collection of rules named "patterns". Furthermore, the exceptions and problems in the Persian morphology have been studied, and a solution is presented for each of them. So our stemmer evaluated. Its result was much better than the previous stemmers.
Perstem - Perstem is a Persian (Farsi) stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger. Inflexional morphemes are separated or removed from their stems. Perstem can also tokenize and transliterate between various character set encodings and romanizations.
polyglot - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).
Parsivar - A Language Processing Toolkit for Persian.
ParsiAnalyzer - Persian Analyzer For Elasticsearch.

Sentiment Analysis

polyglot (polarity) - Natural language pipeline that supports massive multilingual applications (like lokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part of speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 Languages)).

Spell Checking

async_faspell - Persian spellchecker. An algorithm that suggests words for misspelled words.

Dependency Parser

Hazm - Persian NLP Toolkit.

Shallow Parser

Hazm - Persian NLP Toolkit.
Parsivar - A Language Processing Toolkit for Persian.

Information Extraction

Baaz - Open information extraction from Persian web.

Text To Speech Preprocessing

ParsiNorm - Persain Text Pre-Proceesing Tool.
Persian Tools - An anthology of a variety of tools for the Persian language in Python.

Text To Speech

AlisterTA TTS - A convolutional sequence to sequence model for Persian text to speech based on Tachibana et al with a few modifications.

Persian Phonemizer

persian_phonemizer - A tool for translating Persian text to IPA (International Phonetic Alphabet).

MISC

petit - Convert alphabet-written numbers to digit-form.

Keyphrase Extractor

Perke - Perke is a Python keyphrase extraction package for Persian language. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models.

Speech Recognition

Vosk - Vosk is an offline open source speech recognition toolkit. It enables speech recognition for 20+ languages and dialects. Supports Persian.
m3hrdadfi/wav2vec - Persian speech recognition model based on XLS-R.

Metrics

Rouge - Full Python ROUGE Score Implementation (not a wrapper).

Datasets

Part-of-Speech Tagger

Bijankhan Corpus - Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags.
Mojgan Seraji Corpus - Uppsala Persian Corpus (UPC) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in this table.
Large-Scale Colloquial Persian - Large Scale Colloquial Persian Dataset (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in English (EN), German (DE), Czech (CS), Italian (IT) and Hindi (HI) spoken languages. Learn more about this project at LSCP webpage.

Named Entity Recognition

ArmanPersoNERCorpus - The dataset includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format.
FarsiYar PersianNER - The dataset includes about 25,000,000 tokens and about 1,000,000 Persian sentences in total based on Persian Wikipedia Corpus. The NER tags are in IOB format. More than 1000 volunteers contributed tag improvements to this dataset via web panel or android app. They release updated tags every two weeks.
Workshop on NLP Solutions for Under Resourced Languages (NSURL) 2019 - Task 7 dataset - contains a medium size NER corpus with 7 classes of named entities (person, location and organization, money, percent, dates, and time). This corpus contains more than 700 news documents.

Relation Extraction

PERLEX - The first Persian dataset for relation extraction, which is an expert translated version of the “Semeval-2010-Task-8” dataset. Link to the relevant publication.

Dependency Parsing

Persian Syntactic Dependency Treebank - This treebank is supplied for free noncommercial use. For commercial uses feel free to contact us. The number of annotated sentences is 29,982 sentences including samples from almost all verbs of the Persian valency lexicon.
Uppsala Persian Dependency Treebank: UPDT - Dependency-based syntactically annotated corpus.
Pretrained model.
Universal Dependencies 1.3 - Multi lingual corpus that holds IOB gold data for dependency parsing
HamleDT 3.0 - HArmonized Multi-LanguagE Dependency Treebank is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. This version uses Universal Dependencies as the common annotation style.
Large-Scale Colloquial Persian - Large Scale Colloquial Persian Dataset (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in English (EN), German (DE), Czech (CS), Italian (IT) and Hindi (HI) spoken languages. Learn more about this project at LSCP webpage.
MULTEXT-East - The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the word; (3) the MSD, the morphosyntactic description of the word-form, i.e., its fine-grained PoS tag, as defined in the MULTEXT-East morphosyntactic specifications.

Text Categorization and Classification

Hamshahri - Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems.
Bijankhan Corpus - Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags.
Digikala Magazine (DigiMag) - A total of 8,515 articles scraped from Digikala Online Magazine. This dataset includes seven different classes Video Games, Shopping Guide, Health Beauty, Science Technology, General, Art Cinema, and Books Literature.
Persian News - A dataset of various news articles scraped from different online news agencies’ websites. The total number of articles is 16,438, spread over eight different classes, Economic, International, Political, Science Technology, Cultural Art, Sport, and Medical.

Spell Checking

FAspell - FASpell dataset was developed for the evaluation of spell checking algorithms. It contains a set of pairs of misspelled Persian words and their corresponding corrected forms similar to the ASpell dataset used for English.
Persian-Spell-checker - We're collecting persian words' dictionary (verbs, nouns, and etc.) for Persian spell checker.
Grammar and context sensitive spell checker - It is a real-world test set for grammatical errors and context sensitive spelling errors for Persian language. This test set contains 1100 context sensitive errors and was collected from Persian Blogs.
Spell Checker - Test set for spelling errors for Persian language.
CPG: Corpus of Persian Grammatical Errors - It is a fully-annotated corpus of grammatical errors collected from 700 essays written by learners of Persian language in Dehkhoda Lexicon Institute & International Centre for Persian Studies and Imam Khomeini International University.
PerSpellData - Comprehensive parallel dataset for persian non-word and real-word errors.
HeKasre - Detect and correct misspelled "e" sound in Persian (aka Farsi) writing (especially in an informal setting).
Lilak - Persian Spell Checking Dictionary.

Textual Entailment

FarsTail - FarsTail is a dataset of textual entailment (also known as natural language inference, NLI) and it includes 10,367 samples in the Persian language. Here is the relevant paper.

Textual Thematic Similarity

Wikipedia Section Sentences - It implements the idea from Dor et al., 2018, Learning Thematic Similarity Metric Using Triplet Networks for Persian. The dataset includes 205,768 examples that covered 21,515 articles and 34,298 sections.
Wiki Triplet - A triplet-objective dataset extracted from Wikipedia Section Sentences into a triplet-form of anchor ( $a$ ), positive ( $p$ ) and negative ( $n$ ) examples. It covers 191,929 samples.
Wiki D-Similar - Wiki D-Similar is another form of thematic similarity dataset with 137,402 records that tags pairs of sentences into a form of similar or dissimilar.

Persian Poems And Classic Texts

Farsi Poem Corpus - This corpus consists of text documents for 48 Persian poets. The corpus comes in three formats; original, normalized (only 32 main Farsi alphabet), and stop words removed. The corpus consists of 1,216,286 mesras of Farsi poems and 8,102,119 words from which 148,588 are unique.

Sentiment Analysis

NRC-Persian-Lexicon - NRC Word-Emotion Association Lexicon useful for persian sentiment analysis.
Digikala OpenData Sentiment - data contains 100,000 samples of users' comments in favor of products labeled as no_idea, not_recommended , and recommended .
Pars-ABSA - Manually annotated Persian dataset, verified by 3 native Persian speakers. The dataset consists of 5,114 positive, 3,061 negative and 1,827 neutral data samples from 5,602 unique reviews.
PerSent - This dataset presents real-valued polarity labels, in the range from -1 to 1, for thousands of Persian words and expressions.
DeepSentiPers - Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus
SentiPers - Documents in SentiPers are manually annotated at different levels.
LexiPers - An ontology based sentiment lexicon for Persian.
SnappFood - Snappfood (an online food delivery company) user comments containing 70,000 comments with two labels (i.e. polarity classification), Happy and Sad.
MirasOpinion - This repository contains information about MirasOpinion dataset, which is the largest Persian sentiment analysis dataset up to this date, alongside a demo file which contains 20 documents with their corresponding labels.
EmoPars - A Collection of 30K Emotion-Annotated Persian Social MediaTexts.
ArmanEmo - A non-commercial version of ArmanEMO,A high quality human-labeled emotion dataset of more than 7000 Persian sentences labeled for seven categories, Labels are based on Ekman's six basic emotions (Anger, Fear, Happiness, Hatred, Sadness, Wonder) and another category (Other).
Persian tweets emotional dataset.

Summarization

Wiki Summary - Wiki Summary is a summarization dataset extracted from Persian Wikipedia into the form of articles and highlights.

Question Answering

PersianQA - Persian Question Answering (PersianQA) Dataset is a reading comprehension dataset on Persian Wikipedia. The crowd-sourced dataset consists of more than 9,000 entries. Each entry can be either an impossible-to-answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question. Much like the SQuAD2.0 dataset, the impossible or unanswerable questions can be utilized to create a system which "knows that it doesn't know the answer".
ParsVQA-Caps - A Benchmark for Visual Question Answering and Image Captioning in Persian.
MeDiaPQA - A Question-Answering Dataset on Persian Medical Dialogues
Persian QA Wikipedia - A Question Anwering Dataset on wikipedia paragraphs.

Irony - Insult

MirasIrony - The irony dataset is constructed from Persian tweets. 2942 tweets are labeled in total.

Corpus

TEP: Tehran English-Persian Parallel Corpus - First free English-Persian corpus.
OPUS: the open parallel corpus - OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.
Tanzil project - Tanzil project is a collection of Quran translations to many languages, including Persian.
Large-Scale Colloquial Persian - Large Scale Colloquial Persian Dataset (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in English (EN), German (DE), Czech (CS), Italian (IT) and Hindi (HI) spoken languages. Learn more about this project at LSCP webpage.
MIZAN - a Persian-English parallel corpus with about 1 million sentence pairs collected from masterpieces of literature. Here is the relevant paper.
PEPC: Parallel English-Persian Corpus Extracted from Wikipedia - a collection of parallel sentences in English and Persian languages extracted from Wikipedia documents using a bidirectional translation method.
Bible corpus - A multilingual parallel corpus created from translations of the Bible. The corpus also contains a Persian Bibble.
Transliteration - Transliteration extracted from a Persian novel book which written in both Arabic and Dabire. According to the first and last words of each sentence, we manually aligned the sentences of this book. Then for checking and eliminating the errors, the length of the sentences in both sides was compared. Finally, our parallel corpus with 13933 sentence pairs, 155623 words in Persian text and 170702 Dabire words was created.
TMC: Tehran Monolingual Corpus - The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing. The corpus is extracted from Hamshahri Corpus and ISNA news agency website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps.
VOA Persian Corpus - A medium-sized corpus of 7.9 million words, 2003-2008. The corpus is in the public domain, so no copyright restrictions.
MirasText: Automatically Extracted Text Persian Corpus (about 12GB).
A large collection of Persian raw text - About 80GB Persian raw text, collected from a variety of sources, particularly CommonCrawl.
W2C – Web to Corpus – Corpora - A set of corpora for 120 languages automatically collected from wikipedia and the web.
dotIR Collection - dotIR is a standard Persian test collection that is suitable for evaluation of web information retrieval algorithms in Iranian web.dotIR Contains many Persian web pages including their text, links, metadata, etc that are stored in XML format. It is prepared in such a way to be a good representative of Iranian web.It is A good test bed for evaluation of link based information retrieval algorithms. It includes enough Queries and relevance judgments for a valid evaluation.It is not very large, so that it does not require high processing resources.
Hamshahri - Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems.
Prallel Gold Data from Wikipedia - This dataset contains parallel sentences, which are tagged from 33 wikipedia pages.
PREDICT - Persian REverse DICTionary.
Iranian politicians twitter dataset persian.
iPerUDT - Informal Persian Universal Dependency Treebank.
Tasnim News - Tasnim news dataset.
Asriran News - Asriran news dataset.
Isna News Agency - Isna news agency dataset.
Ensani.ir Abstrats - Abstracts extracted from ensani.ir.
Tarjoman Articles - Extracted Tarjoman Articles.

Stop Word Lists

Persian stopwords collection - A collection of Persian stop words list.
Hazm stop words - Stop words list, good for IR.
mhbashari stopword list - Experimental list of stopwords that is suitable for topic modelling and word embedding.
Persian, Arabic and English stopwords - A collection of stopwords on three languages.
Pers_Word - Persian stopwords generated with fastText.
Persian Swear Words - This is a to-be-complete list of Persian Swears you can use in your production to filter unwanted content. Wordlist is available in JSON format.

Knowledge Bases

FarsBase - FarsBase the first Persian multi-source knowledge graph, which isspecifically designed for semantic search engines to support Persian knowledge. FarsBase uses a diverse set of hybrid and flexible techniques to extract and integrate knowledge from various sources, such as Wikipedia, Web tables and unstructured texts. Here is the relevant paper.

Intent Detection & Slot filling

Persian-ATIS - A Persian Benchmark for Joint Intent Detection and Slot Filling.

Paraphrasing

ExaPPC - A Large-Scale Persian Paraphrase Detection Corpus.

Text-to-Speech(TTS)

A farsi to finglish dataset.

MISC

ParsiNLU - A collection natural language understanding datasets for Persian. Here is the relevant paper..
PersianStemmingDataset - PersianStemmingDataset is consist of two manually stemmed persian corpora and an evalution tools in order to compute stemming evaluatin metrics.
PersPred - PersPred, is the first online multilingual syntactic and semantic database of Persian compound verbs (complex predicates), developed by the members of the research unit Mondes iranien et indien (CNRS, Sorbonne Nouvelle, Inalco, EPHE) within the ANR-DFG project PERGRAM (2008-2012) and the LR4.1 work package of the Strand 6 of the Labex Empirical Foundations of Linguistics (EFL).
Popularity Prediction - It is Tabnak and Alef Datasets which are the most famous online news agencies in Iran. This dataset includes content, title, date, category and number of comments per each news. Besides popularity of these websites, the wide range of news categories they cover and they have the multilevel commenting structure.
Conversation Threads Prediction - It consists of five Datasets. These datasets have been crawled from 5 websites, including Thestandard , Alef , ENENews , Russianblog and Courantblogs Datasets (XML format).
Iran and COVID-19 on Social Media - Content analysis of Persian Tweets during COVID-19 pandemic in Iran using NLP
SBU-WSD-Corpus - A Sense Annotated Corpus for Persian All-Words Word Sense Disambiguation.

Models

Named Entity Recognition

ParsBERT-NER - It is a fine-tuned model based on ParsBERT (a monolingual Persian language model) on a vast range of dataset PEYMA, ARMAN, and PEYMA+ARMAN.
ALBERT-NER - It is a fine-tuned on PEYMA and ARMAN dataset based on ALBERT Language Model.

Text Classification

Sentiment Analysis

Summarization

BERT2BERT - BERT2BERT is the first pre-trained summarization model trained on Wiki Summary based on ParsBERT.

Question Answering

Embeddings

Farsi Poem word2vec model - This is a word2vec model deveoped based on a corpus of 48 Persian poets. The corpus consists of 1,216,286 mesras of Farsi poems and 8,102,119 words from which 148,588 are unique.
Sentence Transformers - ST is a collection of vector representations for sentences and paragraphs (also known as sentence embeddings). ST models are based on transformer networks like ParsBERT, ALBERT (soon). They are tuned based on Textual Thematic Similarity datasets such that sentences with similar meanings are close in vector space.

Language Model

ParsBERT: Transformer-based Model for Persian Language Understanding) - It is a monolingual language model based on Google’s BERT architecture for the Persian Language only! This model is pre-trained on a large Persian corpus with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 2M documents. A large subset of this corpus was crawled manually.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language - ALBERT is the first attempt on ALBERT for the Persian Language. The model was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words, like the way we did for ParsBERT.

Grapheme to Phoneme

g2p_fa - A Persian Grapheme to Phoneme model using LSTM implemented in pytorch.
Persian_g2p - A seq-to-seq model for Persian (Farsi) Grapheme To Phoneme mapping.
G2P - Attention Based Grapheme To Phoneme.

Repositories

Summarization

Persian-Summarization - Statistical and semantical text summarizer in Persian language.

Sentiment

Persian Sentiment Analysis - Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی ) is a simple ready to use project that use Python to create the model and Also it's include a very good IPython Tutorial.

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

License

sir-kokabi/awesome-persian-nlp-ir

Folders and files

Latest commit

History

Repository files navigation

Awesome Persian NLP

Contents

Tools

Part-of-Speech Tagger

Language Detection

Tokenization & Segmentation

Normalizer And Text Cleaner

Translator

Transliterator

Morphological Analysis

Stemmer

Sentiment Analysis

Spell Checking

Dependency Parser

Shallow Parser

Information Extraction

Text To Speech Preprocessing

Text To Speech

Persian Phonemizer

MISC

Keyphrase Extractor

Speech Recognition

Metrics

Datasets

Part-of-Speech Tagger

Named Entity Recognition

Relation Extraction

Dependency Parsing

Text Categorization and Classification

Spell Checking

Textual Entailment

Textual Thematic Similarity

Persian Poems And Classic Texts

Sentiment Analysis

Summarization

Question Answering

Irony - Insult

Corpus

Stop Word Lists

Knowledge Bases

Intent Detection & Slot filling

Paraphrasing

Text-to-Speech(TTS)

MISC

Models

Named Entity Recognition

Text Classification

Sentiment Analysis

Summarization

Question Answering

Multiple-Choice QA

Reading Comprehension

Translation

Textual Entailment

Query Paraphrasing

Embeddings

Language Model

Grapheme to Phoneme

Repositories

Summarization

Sentiment

Papers and Books

Sentiment

Opinion Mining

Text Classification

Semantic Similarity

Question Answering

MISC

Spell Checker

Books

Contribute

About

Topics

Resources

Packages