Skip to content
This repository has been archived by the owner on Oct 10, 2022. It is now read-only.

First release

Pre-release
Pre-release
Compare
Choose a tag to compare
@snakers4 snakers4 released this 25 Apr 14:47
· 76 commits to master since this release

First release

Yeah, we are building the largest open STT dataset for the Russian language)
Because we beilieve that life is not a zero-sum game.

This release mostly consists of our attempts to:

  • See what is available;
  • Gather and document work done before us in one place;

Historical composition and downloads

(Old dowload links will be discarded every iteration or two)

Type Utterances Hours GB Speaker sets Characters Mean length, seconds Mean chars
Lecture 6,803 6.3 1.9 29 316,953 3.36 46.6
Narration 67,052 80.3 27.5 584 3,075,827 4.31 45.9
Phone_calls 233,868 211.2 45.9 8175 6,706,717 3.25 28.7
Series 20,243 17.5 5.2 51 759,433 3.10 37.5
Total 327,966 315 80 8,839 10,858,930
  1. Dowload the chunks:
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_aa
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_ab
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_ac
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_ad
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_ae
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01.tar.gz_af

For multi-threaded downloads use aria2 with -x flag.

  1. Download the meta data:
wget https://ru-open-stt-v01.ams3.digitaloceanspaces.com/ru_open_stt_v01_public.csv
  1. Put the chunks together:
    cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz

  2. Unpack