Skip to content
This repository has been archived by the owner on Oct 10, 2022. It is now read-only.

Releases: snakers4/open_stt

First release

25 Apr 14:47
Compare
Choose a tag to compare
First release Pre-release
Pre-release

First release

Yeah, we are building the largest open STT dataset for the Russian language)
Because we beilieve that life is not a zero-sum game.

This release mostly consists of our attempts to:

  • See what is available;
  • Gather and document work done before us in one place;

Historical composition and downloads

(Old dowload links will be discarded every iteration or two)

Type Utterances Hours GB Speaker sets Characters Mean length, seconds Mean chars
Lecture 6,803 6.3 1.9 29 316,953 3.36 46.6
Narration 67,052 80.3 27.5 584 3,075,827 4.31 45.9
Phone_calls 233,868 211.2 45.9 8175 6,706,717 3.25 28.7
Series 20,243 17.5 5.2 51 759,433 3.10 37.5
Total 327,966 315 80 8,839 10,858,930
  1. Dowload the chunks:
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_aa
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_ab
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_ac
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_ad
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_ae
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_af

For multi-threaded downloads use aria2 with -x flag.

  1. Download the meta data:
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01_public.csv
  1. Put the chunks together:
    cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz

  2. Unpack