Neyshekar
A Large-Scale Open Persian Speech Dataset
Neyshekar is an open, community-driven Persian speech dataset collected via a web-based crowdsourcing platform at ney.shekar.io. Recordings are provided by volunteer contributors and paid voice actors, all native Persian speakers. Each release is a stable snapshot enabling reproducible research and consistent benchmarking. Released entirely under CC0 1.0 Universal. Free for any use, including commercial.
Dataset Statistics: v4 (Latest)
Speaker gender distribution: 21,613 female recordings and 17,845 male recordings.
Named Entity Distribution
Entities were identified automatically using the Shekar NER model, providing rich linguistic metadata for downstream tasks.
Why Neyshekar?
Open & CC0 Licensed
Released under CC0 1.0 Universal, the most permissive open license. Use it for any purpose, including commercial products and proprietary models, without attribution requirements.
Native Speakers Only
All recordings are from native Persian speakers: a mix of volunteer contributors and professional voice actors, ensuring natural prosody and authentic pronunciation.
Reproducible Releases
Each version is a stable, versioned snapshot on Zenodo with a DOI, enabling reproducible experiments and consistent benchmarks across publications.
Rich Metadata
Every release includes transcriptions, speaker gender, duration statistics, vocabulary counts, and NER-tagged entity annotations from the Shekar NLP pipeline.
Community-Driven Growth
The dataset grows continuously through community contributions at ney.shekar.io. Each new release captures the latest snapshot with incremental improvements.
Balanced Gender Coverage
v4 includes 21,613 female and 17,845 male recordings, a balanced split essential for building gender-robust speech models.
Use Cases
Neyshekar is designed for a broad range of Persian speech research and engineering tasks:
Version History
Neyshekar is released incrementally. Each version is archived on Zenodo with a permanent DOI.
License & Terms of Use
CC0 1.0 Universal
The Neyshekar dataset is released under the CC0 1.0 Universal (Public Domain Dedication) license. It may be used, modified, and redistributed for any purpose, including commercial use, without restriction or attribution.
Note: Any attempt to identify or uncover the identity of individual speakers in the dataset is strictly prohibited.
Frequently Asked Questions
What is Neyshekar?
Neyshekar is a large-scale open Persian speech dataset built through community crowdsourcing. It contains 40,000+ recordings totalling 63+ hours of native Persian speech, along with transcriptions, gender labels, and NER-tagged entity metadata.
How many hours of speech does Neyshekar contain?
Version 4 (the latest, released May 2026) contains 63.03 hours of audio across 40,008 recordings averaging 5.67 seconds each. The dataset has grown from 14.42 hours in v1 to 63+ hours in v4.
Where can I download the Neyshekar dataset?
Download it from Zenodo: doi.org/10.5281/zenodo.18073632. The collection platform is at ney.shekar.io.
Can I use Neyshekar for commercial products?
Yes. The CC0 1.0 Universal license places the dataset in the public domain. You can use, modify, and redistribute it for any purpose, including building commercial ASR or TTS products, without restriction.
Who recorded the dataset?
Recordings were contributed by volunteer community members at ney.shekar.io and by paid professional voice actors. All speakers are native Persian speakers.
What tasks is Neyshekar designed for?
Neyshekar targets ASR, TTS, speech representation learning, speaker identification, voice activity detection, and other downstream Persian speech applications.
How do I cite Neyshekar?
Cite the Zenodo record: Amirivojdan, A. (2026). Neyshekar: A Large-Scale Open Persian Speech Dataset (v4). Zenodo. DOI: 10.5281/zenodo.18073632.
Citation
If you use Neyshekar in your research, please cite:
@dataset{amirivojdan_neyshekar_2026,
author = {Amirivojdan, Ahmad},
title = {{Neyshekar: A Large-Scale Open Persian Speech Dataset}},
year = {2026},
version = {v4},
publisher = {Zenodo},
doi = {10.5281/zenodo.18073632},
url = {https://doi.org/10.5281/zenodo.18073632}
}