How do I tokenize Persian text in Python?

Shekar provides WordTokenizer and SentenceTokenizer. Example: from shekar import WordTokenizer; tokenizer = WordTokenizer(); tokens = list(tokenizer(text)).

Does Shekar support Persian POS tagging and NER?

Yes. Shekar includes a transformer-based POSTagger (using ALBERT) that follows Universal Dependencies POS tags, and a NER module that detects persons, locations, organizations, and dates in Persian text.

Shekar Python Library

Simplifying Persian NLP for Modern Applications

Shekar is an open-source Python library for Persian (Farsi) Natural Language Processing, inspired by فارسی شکر است (Persian is Sugar) by Mohammad Ali Jamalzadeh. It provides fast, modular, and production-ready tools for Persian text processing: normalization, tokenization, POS tagging, NER, embeddings, spell checking, sentiment analysis, and dependency parsing, all in a single lightweight package.

GitHub PyPI Documentation Paper (JOSS)

Installation

Install Shekar via pip. Works on Windows, Linux, and macOS (including Apple Silicon M1/M2/M3).

$ pip install shekar

For NVIDIA GPU acceleration:

$ pip install shekar && pip uninstall -y onnxruntime && pip install onnxruntime-gpu

Why Shekar?

Advanced Normalization

Follows official guidelines from the Academy of Persian Language and Literature. Handles Arabic Unicode variants, diacritics, spacing, punctuation, and informal writing.

Blazing Fast

Optimized for large-scale batch processing and real-time use. Lightweight ONNX models enable fast CPU inference with minimal dependencies.

Modular Pipelines

Compose independent preprocessing operators using the | operator. Build custom pipelines for any task without touching what you don't need.

Production-Ready

Backed by hundreds of test cases and 95%+ code coverage. Used in both research and real-world applications.

Transformer-Based Models

POS tagging, NER, dependency parsing, and classification are powered by fine-tuned ALBERT models, quantized for efficient CPU inference.

Built-In Web Interface

Explore all NLP capabilities interactively with shekar serve -p 8080. No coding required.

Capabilities

Shekar covers the full Persian NLP pipeline in one package:

Text Normalization Word Tokenization Sentence Tokenization POS Tagging Named Entity Recognition Word Embeddings (FastText) Contextual Embeddings (ALBERT) Stemming Lemmatization Spell Checking Dependency Parsing Sentiment Analysis Offensive Language Detection Informal Language Classification Keyword Extraction Transliteration (Farsi ↔ Tajik) Text Augmentation Word Cloud

Examples

Text Normalization

Normalizes Persian text following official Academy of Persian Language guidelines. Handles Arabic Unicode variants, spacing, diacritics, and informal writing.

from shekar import Normalizer

normalizer = Normalizer()
text = "«فارسی شِکَر است » نام داستان ڪوتاه طنز    آمێزی از محمد علی جمالــــزاده می   باشد."
print(normalizer(text))
# «فارسی شکر است» نام داستان کوتاه طنزآمیزی از محمد‌علی جمالزاده می‌باشد.

Tokenization

Word and sentence tokenizers using Unicode-aware rules for Persian text.

from shekar import WordTokenizer

tokenizer = WordTokenizer()
tokens = list(tokenizer("چه سیب‌های قشنگی! حیات نشئهٔ تنهایی است."))
print(tokens)
# ['چه', 'سیب‌های', 'قشنگی', '!', 'حیات', 'نشئهٔ', 'تنهایی', 'است', '.']

Part-of-Speech Tagging

Transformer-based POS tagger (ALBERT) following Universal Dependencies tags.

from shekar import POSTagger

pos_tagger = POSTagger()
result = pos_tagger("نوروز جشن سال نو ایرانی است.")
for word, tag in result:
    print(f"{word}: {tag}")
# نوروز: PROPN  |  جشن: NOUN  |  سال: NOUN  |  نو: ADJ  |  ایرانی: ADJ  |  است: AUX

Named Entity Recognition (NER)

Detects persons, locations, organizations, and dates using a quantized ALBERT model.

from shekar import NER

ner = NER()
entities = ner("شاهرخ مسکوب به سال ۱۳۰۴ در بابل زاده شد و در دانشگاه تهران تحصیل کرد.")
for text, label in entities:
    print(f"{text} → {label}")
# شاهرخ مسکوب → PER  |  سال ۱۳۰۴ → DAT  |  بابل → LOC  |  دانشگاه تهران → ORG

Word Embeddings

Static FastText embeddings (100d or 300d) and contextual ALBERT embeddings (768d).

from shekar.embeddings import WordEmbedder

embedder = WordEmbedder(model="fasttext-d100")
similar = embedder.most_similar("کتاب", top_n=5)
print(similar)
# [('کتابچه', 0.91), ('نشریه', 0.89), ('رمان', 0.88), ...]

Spell Checking

Detects and corrects misspelled Persian words with ranked suggestions.

from shekar import SpellChecker

spell_checker = SpellChecker()
print(spell_checker("سسلام بر ششما ددوست من"))
# سلام بر شما دوست من
print(spell_checker.suggest("درود"))
# ['درود', 'درصد', 'ورود', 'درد', 'درون']

Sentiment Analysis

Binary sentiment classifier (positive / negative) using a fine-tuned ALBERT model.

from shekar.classification import SentimentClassifier

classifier = SentimentClassifier()
print(classifier("سریال قصه‌های مجید عالی بود!"))
# ('positive', 0.992)
print(classifier("فیلم ۳۰۰ افتضاح بود!"))
# ('negative', 0.933)

Dependency Parsing

Syntactic dependency parsing with printable tree visualization following Universal Dependencies.

from shekar import DependencyParser

parser = DependencyParser()
result = parser("ما با آنچه می‌سازیم ایرانی هستیم.")
parser.print_tree(result)
# ROOT
# └── [root] هستیم
#     ├── [nsubj] ما
#     ├── [obl] آنچه
#     │   ├── [case] با
#     │   └── [acl] می‌سازیم
#     ├── [xcomp] ایرانی
#     └── [punct] .

Custom Preprocessing Pipelines

Compose any preprocessing operators with | to build lightweight, task-specific pipelines.

from shekar.preprocessing import EmojiRemover, PunctuationRemover, StopWordRemover

pipeline = EmojiRemover() | PunctuationRemover() | StopWordRemover()
print(pipeline("ز ایران دلش یاد کرد و بسوخت! 🌍🇮🇷"))
# ایران دلش یاد کرد بسوخت

Frequently Asked Questions

What is the Shekar Python library?

Shekar is an open-source Python library for Persian (Farsi) NLP. It covers the full pipeline from text normalization and tokenization to transformer-based POS tagging, NER, dependency parsing, and sentiment analysis, all in a single lightweight package.

How do I install Shekar?

Run pip install shekar in your terminal. It works on Windows, Linux, and macOS, including Apple Silicon. Requires Python 3.8+.

How do I normalize Persian text in Python?

Use from shekar import Normalizer; n = Normalizer(); result = n(text). The normalizer applies rules from the Academy of Persian Language and Literature, handling Arabic Unicode variants, diacritics, spacing, punctuation, and informal register.

Is there a Persian NLP library that supports POS tagging and NER?

Yes. Shekar includes transformer-based (ALBERT) POS tagging following Universal Dependencies tags, and a NER module detecting persons, locations, organizations, and dates in Persian text.

Is Shekar free and open source?

Yes. Shekar is released under the MIT License and is free for both research and commercial use. Source code: github.com/amirivojdan/shekar.

How do I cite Shekar in a research paper?

Cite the JOSS paper: Amirivojdan, A. (2025). Shekar: A Python Toolkit for Persian Natural Language Processing. Journal of Open Source Software, 10(114), 9128. DOI: 10.21105/joss.09128.

Citation

If you use Shekar in your research, please cite:

@article{Amirivojdan_Shekar,
  author  = {Amirivojdan, Ahmad},
  doi     = {10.21105/joss.09128},
  journal = {Journal of Open Source Software},
  month   = oct,
  number  = {114},
  pages   = {9128},
  title   = {{Shekar: A Python Toolkit for Persian Natural Language Processing}},
  url     = {https://joss.theoj.org/papers/10.21105/joss.09128},
  volume  = {10},
  year    = {2025}
}

Links

GitHub Repository PyPI Package Documentation JOSS Paper