Skip to content

A collection of scripts and codes used to collect the largest ASR datasets for Farsi language.

Notifications You must be signed in to change notification settings

srezasm/farsi-asr-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Persian STT & TTS Data Collection

This project aims to provide the codes that have been utilized to collect the Farsi ASR Dataset🤗.

Current Resources

1. YouTube Data Collection Notebook

Open In Colab

  • Crawls specified YouTube channels.
  • Downloads high-quality audio along with corresponding manually created subtitle.
  • Batches up the downloaded files in tar.gz files and stores them in HuggingFace🤗 repository.

🚀 Contribution

Please feel free to contribute to improve and expand this dataset! Here’s how you can help:

  • Run the existing scripts: Execute the Colab notebooks to expand the dataset and submit a pull request.
  • Suggest new data sources: Open an issue and mention YouTube channels with high-quality subtitles or other sources of transcribed Persian speech data.
  • Improve data processing: Help refine cleaning, filtering, and segmentation methods.

About

A collection of scripts and codes used to collect the largest ASR datasets for Farsi language.

Resources

Stars

Watchers

Forks

Languages