Persian STT & TTS Data Collection

This project aims to provide the codes that have been utilized to collect the Farsi ASR Dataset🤗.

Current Resources

Crawls specified YouTube channels.
Downloads high-quality audio along with corresponding manually created subtitle.
Batches up the downloaded files in tar.gz files and stores them in HuggingFace🤗 repository.

Please feel free to contribute to improve and expand this dataset! Here’s how you can help:

Run the existing scripts: Execute the Colab notebooks to expand the dataset and submit a pull request.
Suggest new data sources: Open an issue and mention YouTube channels with high-quality subtitles or other sources of transcribed Persian speech data.
Improve data processing: Help refine cleaning, filtering, and segmentation methods.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
subtitles_cleanup		subtitles_cleanup
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt