Neyshekar is a large-scale, open Persian speech dataset collected via a community crowdsourcing platform at ney.shekar.io. It contains 40,000+ recordings totalling 63+ hours of native Persian speech, designed for ASR, TTS, and speech representation learning research.

Who recorded the Neyshekar dataset?

Recordings were contributed by volunteer community members via the web platform at ney.shekar.io, supplemented by paid native Persian-speaking voice actors. All speakers are native Persian speakers.

What NLP tasks is Neyshekar suitable for?

Neyshekar is designed for automatic speech recognition (ASR), text-to-speech (TTS), speaker identification, speech representation learning, and other downstream Persian speech tasks.

Neyshekar

A Large-Scale Open Persian Speech Dataset

Neyshekar is an open, community-driven Persian speech dataset collected via a web-based crowdsourcing platform at ney.shekar.io. Recordings are provided by volunteer contributors and paid voice actors, all native Persian speakers. Each release is a stable snapshot enabling reproducible research and consistent benchmarking. Released entirely under CC0 1.0 Universal. Free for any use, including commercial.

Download Dataset Collection Platform GitHub

Dataset Statistics: v4 (Latest)

63.03h

Total audio duration

40,008

Total recordings

5.67s

Average clip duration

456K

Total tokens

26,758

Vocabulary size

12,443

Named entities (NER)

Speaker gender distribution: 21,613 female recordings and 17,845 male recordings.

Named Entity Distribution

Entities were identified automatically using the Shekar NER model, providing rich linguistic metadata for downstream tasks.

LOC

4,817

Locations

DAT

3,616

Dates

PER

2,156

Persons

ORG

1,588

Organizations

EVE

266

Events

Why Neyshekar?

Open & CC0 Licensed

Released under CC0 1.0 Universal, the most permissive open license. Use it for any purpose, including commercial products and proprietary models, without attribution requirements.

Native Speakers Only

All recordings are from native Persian speakers: a mix of volunteer contributors and professional voice actors, ensuring natural prosody and authentic pronunciation.

Reproducible Releases

Each version is a stable, versioned snapshot on Zenodo with a DOI, enabling reproducible experiments and consistent benchmarks across publications.

Rich Metadata

Every release includes transcriptions, speaker gender, duration statistics, vocabulary counts, and NER-tagged entity annotations from the Shekar NLP pipeline.

Community-Driven Growth

The dataset grows continuously through community contributions at ney.shekar.io. Each new release captures the latest snapshot with incremental improvements.

Balanced Gender Coverage

v4 includes 21,613 female and 17,845 male recordings, a balanced split essential for building gender-robust speech models.

Use Cases

Neyshekar is designed for a broad range of Persian speech research and engineering tasks:

Automatic Speech Recognition (ASR) Text-to-Speech (TTS) Speech Representation Learning Speaker Identification Voice Activity Detection Low-Resource Language Modeling Acoustic Model Training End-to-End Speech Models Pronunciation Modeling Multilingual Speech Research

Version History

Neyshekar is released incrementally. Each version is archived on Zenodo with a permanent DOI.

v4 May 14, 2026

40,008 recordings 63.03 h audio 456K tokens 26,758 vocab 12,443 entities

v3 March 23, 2026

30,019 recordings 45.71 h audio 331K tokens 23,972 vocab

v2 January 15, 2026

20,020 recordings 29.08 h audio 208K tokens 20,853 vocab

v1 December 29, 2025

10,044 recordings 14.42 h audio 103K tokens 15,224 vocab

License & Terms of Use

CC0 1.0 Universal

The Neyshekar dataset is released under the CC0 1.0 Universal (Public Domain Dedication) license. It may be used, modified, and redistributed for any purpose, including commercial use, without restriction or attribution.

Note: Any attempt to identify or uncover the identity of individual speakers in the dataset is strictly prohibited.

Frequently Asked Questions

What is Neyshekar?

Neyshekar is a large-scale open Persian speech dataset built through community crowdsourcing. It contains 40,000+ recordings totalling 63+ hours of native Persian speech, along with transcriptions, gender labels, and NER-tagged entity metadata.

How many hours of speech does Neyshekar contain?

Version 4 (the latest, released May 2026) contains 63.03 hours of audio across 40,008 recordings averaging 5.67 seconds each. The dataset has grown from 14.42 hours in v1 to 63+ hours in v4.

Where can I download the Neyshekar dataset?

Download it from Zenodo: doi.org/10.5281/zenodo.18073632. The collection platform is at ney.shekar.io.

Can I use Neyshekar for commercial products?

Yes. The CC0 1.0 Universal license places the dataset in the public domain. You can use, modify, and redistribute it for any purpose, including building commercial ASR or TTS products, without restriction.

Who recorded the dataset?

Recordings were contributed by volunteer community members at ney.shekar.io and by paid professional voice actors. All speakers are native Persian speakers.

What tasks is Neyshekar designed for?

Neyshekar targets ASR, TTS, speech representation learning, speaker identification, voice activity detection, and other downstream Persian speech applications.

How do I cite Neyshekar?

Cite the Zenodo record: Amirivojdan, A. (2026). Neyshekar: A Large-Scale Open Persian Speech Dataset (v4). Zenodo. DOI: 10.5281/zenodo.18073632.

Citation

If you use Neyshekar in your research, please cite:

@dataset{amirivojdan_neyshekar_2026,
  author    = {Amirivojdan, Ahmad},
  title     = {{Neyshekar: A Large-Scale Open Persian Speech Dataset}},
  year      = {2026},
  version   = {v4},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18073632},
  url       = {https://doi.org/10.5281/zenodo.18073632}
}

Links

Download on Zenodo Collection Platform GitHub Repository