Shekar Python Library
Simplifying Persian NLP for Modern Applications
Shekar is an open-source Python library for Persian (Farsi) Natural Language Processing, inspired by فارسی شکر است (Persian is Sugar) by Mohammad Ali Jamalzadeh. It provides fast, modular, and production-ready tools for Persian text processing: normalization, tokenization, POS tagging, NER, embeddings, spell checking, sentiment analysis, and dependency parsing, all in a single lightweight package.
Installation
Install Shekar via pip. Works on Windows, Linux, and macOS (including Apple Silicon M1/M2/M3).
For NVIDIA GPU acceleration:
Why Shekar?
Advanced Normalization
Follows official guidelines from the Academy of Persian Language and Literature. Handles Arabic Unicode variants, diacritics, spacing, punctuation, and informal writing.
Blazing Fast
Optimized for large-scale batch processing and real-time use. Lightweight ONNX models enable fast CPU inference with minimal dependencies.
Modular Pipelines
Compose independent preprocessing operators using the | operator. Build custom pipelines for any task without touching what you don't need.
Production-Ready
Backed by hundreds of test cases and 95%+ code coverage. Used in both research and real-world applications.
Transformer-Based Models
POS tagging, NER, dependency parsing, and classification are powered by fine-tuned ALBERT models, quantized for efficient CPU inference.
Built-In Web Interface
Explore all NLP capabilities interactively with shekar serve -p 8080. No coding required.
Capabilities
Shekar covers the full Persian NLP pipeline in one package:
Examples
Text Normalization
Normalizes Persian text following official Academy of Persian Language guidelines. Handles Arabic Unicode variants, spacing, diacritics, and informal writing.
Tokenization
Word and sentence tokenizers using Unicode-aware rules for Persian text.
Part-of-Speech Tagging
Transformer-based POS tagger (ALBERT) following Universal Dependencies tags.
Named Entity Recognition (NER)
Detects persons, locations, organizations, and dates using a quantized ALBERT model.
Word Embeddings
Static FastText embeddings (100d or 300d) and contextual ALBERT embeddings (768d).
Spell Checking
Detects and corrects misspelled Persian words with ranked suggestions.
Sentiment Analysis
Binary sentiment classifier (positive / negative) using a fine-tuned ALBERT model.
Dependency Parsing
Syntactic dependency parsing with printable tree visualization following Universal Dependencies.
Custom Preprocessing Pipelines
Compose any preprocessing operators with | to build lightweight, task-specific pipelines.
Frequently Asked Questions
What is the Shekar Python library?
Shekar is an open-source Python library for Persian (Farsi) NLP. It covers the full pipeline from text normalization and tokenization to transformer-based POS tagging, NER, dependency parsing, and sentiment analysis, all in a single lightweight package.
How do I install Shekar?
Run pip install shekar in your terminal. It works on Windows, Linux, and macOS, including Apple Silicon. Requires Python 3.8+.
How do I normalize Persian text in Python?
Use from shekar import Normalizer; n = Normalizer(); result = n(text). The normalizer applies rules from the Academy of Persian Language and Literature, handling Arabic Unicode variants, diacritics, spacing, punctuation, and informal register.
Is there a Persian NLP library that supports POS tagging and NER?
Yes. Shekar includes transformer-based (ALBERT) POS tagging following Universal Dependencies tags, and a NER module detecting persons, locations, organizations, and dates in Persian text.
Is Shekar free and open source?
Yes. Shekar is released under the MIT License and is free for both research and commercial use. Source code: github.com/amirivojdan/shekar.
How do I cite Shekar in a research paper?
Cite the JOSS paper: Amirivojdan, A. (2025). Shekar: A Python Toolkit for Persian Natural Language Processing. Journal of Open Source Software, 10(114), 9128. DOI: 10.21105/joss.09128.
Citation
If you use Shekar in your research, please cite:
@article{Amirivojdan_Shekar,
author = {Amirivojdan, Ahmad},
doi = {10.21105/joss.09128},
journal = {Journal of Open Source Software},
month = oct,
number = {114},
pages = {9128},
title = {{Shekar: A Python Toolkit for Persian Natural Language Processing}},
url = {https://joss.theoj.org/papers/10.21105/joss.09128},
volume = {10},
year = {2025}
}