Neyshekar

A Large-Scale Open Persian Speech Dataset

Neyshekar is an open, community-driven Persian speech dataset collected via a web-based crowdsourcing platform at ney.shekar.io. Recordings are provided by volunteer contributors and paid voice actors, all native Persian speakers. Each release is a stable snapshot enabling reproducible research and consistent benchmarking. Released entirely under CC0 1.0 Universal. Free for any use, including commercial.

Dataset Statistics: v4 (Latest)

63.03h
Total audio duration
40,008
Total recordings
5.67s
Average clip duration
456K
Total tokens
26,758
Vocabulary size
12,443
Named entities (NER)

Speaker gender distribution: 21,613 female recordings and 17,845 male recordings.

Named Entity Distribution

Entities were identified automatically using the Shekar NER model, providing rich linguistic metadata for downstream tasks.

LOC
4,817
Locations
DAT
3,616
Dates
PER
2,156
Persons
ORG
1,588
Organizations
EVE
266
Events

Why Neyshekar?

Open & CC0 Licensed

Released under CC0 1.0 Universal, the most permissive open license. Use it for any purpose, including commercial products and proprietary models, without attribution requirements.

Native Speakers Only

All recordings are from native Persian speakers: a mix of volunteer contributors and professional voice actors, ensuring natural prosody and authentic pronunciation.

Reproducible Releases

Each version is a stable, versioned snapshot on Zenodo with a DOI, enabling reproducible experiments and consistent benchmarks across publications.

Rich Metadata

Every release includes transcriptions, speaker gender, duration statistics, vocabulary counts, and NER-tagged entity annotations from the Shekar NLP pipeline.

Community-Driven Growth

The dataset grows continuously through community contributions at ney.shekar.io. Each new release captures the latest snapshot with incremental improvements.

Balanced Gender Coverage

v4 includes 21,613 female and 17,845 male recordings, a balanced split essential for building gender-robust speech models.

Use Cases

Neyshekar is designed for a broad range of Persian speech research and engineering tasks:

Automatic Speech Recognition (ASR) Text-to-Speech (TTS) Speech Representation Learning Speaker Identification Voice Activity Detection Low-Resource Language Modeling Acoustic Model Training End-to-End Speech Models Pronunciation Modeling Multilingual Speech Research

Version History

Neyshekar is released incrementally. Each version is archived on Zenodo with a permanent DOI.

v4 May 14, 2026
40,008 recordings 63.03 h audio 456K tokens 26,758 vocab 12,443 entities
v3 March 23, 2026
30,019 recordings 45.71 h audio 331K tokens 23,972 vocab
v2 January 15, 2026
20,020 recordings 29.08 h audio 208K tokens 20,853 vocab
v1 December 29, 2025
10,044 recordings 14.42 h audio 103K tokens 15,224 vocab

License & Terms of Use

CC0 1.0 Universal

The Neyshekar dataset is released under the CC0 1.0 Universal (Public Domain Dedication) license. It may be used, modified, and redistributed for any purpose, including commercial use, without restriction or attribution.

Note: Any attempt to identify or uncover the identity of individual speakers in the dataset is strictly prohibited.

Frequently Asked Questions

What is Neyshekar?

Neyshekar is a large-scale open Persian speech dataset built through community crowdsourcing. It contains 40,000+ recordings totalling 63+ hours of native Persian speech, along with transcriptions, gender labels, and NER-tagged entity metadata.

How many hours of speech does Neyshekar contain?

Version 4 (the latest, released May 2026) contains 63.03 hours of audio across 40,008 recordings averaging 5.67 seconds each. The dataset has grown from 14.42 hours in v1 to 63+ hours in v4.

Where can I download the Neyshekar dataset?

Download it from Zenodo: doi.org/10.5281/zenodo.18073632. The collection platform is at ney.shekar.io.

Can I use Neyshekar for commercial products?

Yes. The CC0 1.0 Universal license places the dataset in the public domain. You can use, modify, and redistribute it for any purpose, including building commercial ASR or TTS products, without restriction.

Who recorded the dataset?

Recordings were contributed by volunteer community members at ney.shekar.io and by paid professional voice actors. All speakers are native Persian speakers.

What tasks is Neyshekar designed for?

Neyshekar targets ASR, TTS, speech representation learning, speaker identification, voice activity detection, and other downstream Persian speech applications.

How do I cite Neyshekar?

Cite the Zenodo record: Amirivojdan, A. (2026). Neyshekar: A Large-Scale Open Persian Speech Dataset (v4). Zenodo. DOI: 10.5281/zenodo.18073632.

Citation

If you use Neyshekar in your research, please cite:

@dataset{amirivojdan_neyshekar_2026,
  author    = {Amirivojdan, Ahmad},
  title     = {{Neyshekar: A Large-Scale Open Persian Speech Dataset}},
  year      = {2026},
  version   = {v4},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18073632},
  url       = {https://doi.org/10.5281/zenodo.18073632}
}

Links