Skip to content
This repository has been archived by the owner on Oct 10, 2022. It is now read-only.

Commit

Permalink
Release v0.4, add mp3 download links, update status
Browse files Browse the repository at this point in the history
  • Loading branch information
snakers4 committed May 10, 2019
1 parent dbf3415 commit 328915c
Show file tree
Hide file tree
Showing 2 changed files with 204 additions and 48 deletions.
204 changes: 181 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# **Russian Open Speech To Text (STT/ASR) Dataset**

Arguably the largest public Russian STT dataset up to date:
- (**new!**) Now in `.mp3` to reduce download time 7-8x;
- ~4.6m utterances;
- ~4000 hours;
- 431 GB;
- 431 GB (in `.wav` format in `int16`);
- Additional 1,500 hours ... and more ... to be released soon!;
- And then maybe even more hours to be released!;


Prove [us](mailto:[email protected]) wrong!
Open issues, collaborate, submit a PR, contribute, share your datasets!
Let's make STT in Russian (and more) as open and available as CV models.
Expand Down Expand Up @@ -50,14 +52,20 @@ Let's make STT in Russian (and more) as open and available as CV models.
This alignment was performed using Yuri's alignment tool.
[Contact him](mailto:[email protected]) if you need alignment for your own dataset.

# **_update 2019-05-07_ Help needed!**
## **_Update 2019-05-10_**

Quickly converted the dataset to MP3 thanks to the community!
Waiting for our account for academic torrents to be approved.
v0.4 will boast MP3 download links.

## **_Update 2019-05-07_ Help needed!**

**If you want to support the project, you can:**
- Help us with hosting (create a mirror) / provide a reliable node for torrent;
- Help us with writing some [helper](https://github.com/snakers4/open_stt/issues/2) functions;
- [Donate](https://buymeacoff.ee/8oneCIN) (each coffee pays for several full downloads) / use our DO referral [link](https://sohabr.net/habr/post/357748/) to help;

We are converting the dataset to MP3 now.
~~We are converting the dataset to MP3 now.~~
Please contact us using the below contacts, if you would like to help.

# **Downloads**
Expand All @@ -66,22 +74,22 @@ Please contact us using the below contacts, if you would like to help.

Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_data_v03.csv).

| Dataset | GB, wav | GB, mp3 | Wav | Mp3 | Source | Manifest |
|---------------------------------------|------|----------------|-------|-----| -------| ----------|
| audiobook_2 | 166 | 21.0 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_2_mp3.tar.gz) | Sources from the Internet + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_2.csv) |
| asr_public_phone_calls_2 | 66 | 7.5 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2_mp3.tar.gz) | Sources from the Internet + ASR | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.csv) |
| asr_public_stories_2 | 9 (7.5) | NA | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.tar.gz) | NA | Sources from the Internet + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.csv) |
| tts_russian_addresses_rhvoice_4voices | 80.9 | 9.9 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices_mp3.tar.gz) | TTS | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.csv) |
| public_youtube700 | 75.0 | 9.6 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700_mp3.tar.gz) | YouTube videos | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.csv) |
| asr_public_phone_calls_1 | 22.7 | 2.6 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1_mp3.tar.gz) | Sources from the Internet + ASR | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.csv) |
| asr_public_stories_1 | 4.1 | 0.5 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_1_mp3.tar.gz) | Public stories | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_1.csv) |
| public_series_1 | 1.9 | 0.2 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/public_series_1_mp3.tar.gz) | Public series | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/public_series_1.csv) |
| ru_RU | 1.9 | 0.2 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/ru_ru.tar.gz) | Caito.de [dataset](https://www.caito.de/data/Training/stt_tts/) | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/ru_RU.csv) |
| voxforge_ru | 1.9 | 0.2 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/voxforge_ru_mp3.tar.gz) | Voxforge [dataset](https://www.repository.voxforge1.org/downloads/) | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/voxforge_ru.csv) |
| russian_single | 0.9 | 0.1 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/russian_single_mp3.tar.gz) | Russian single speaker [dataset](https://www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset) | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/russian_single.csv) |
| public_lecture_1 | 0.7 | 0.1 | down | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/public_lecture_1_mp3.tar.gz) | Sources from the Internet | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/public_lecture_1.csv) |
| Total | 431 | 52 | | | | |

| Dataset | GB | GB, compressed | Audio | Source | Manifest |
|---------------------------------------|------|----------------|-------| -------| ----------|
| audiobook_2 | 166 | 131.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ad), [part5](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ae), [part6](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_af), [part7](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ag) | Sources from the Internet + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_2.csv) |
| asr_public_phone_calls_2 | 66 | 51.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ac) | Sources from the Internet + ASR | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.csv) |
| asr_public_stories_2 | 9 | 7.5 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.tar.gz) | Sources from the Internet + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.csv) |
| tts_russian_addresses_rhvoice_4voices | 80.9 | 67.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ad) | TTS | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.csv) |
| public_youtube700 | 75.0 | 67.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ad) | YouTube videos | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.csv) |
| asr_public_phone_calls_1 | 22.7 | 19.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.tar.gz) | Sources from the Internet + ASR | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.csv) |
| asr_public_stories_1 | 4.1 | 3.8 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_1.tar.gz) | Public stories | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_1.csv) |
| public_series_1 | 1.9 | 1.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/public_series_1.tar.gz) | Public series | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/public_series_1.csv) |
| ru_RU | 1.9 | 1.4 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/ru_ru.tar.gz) | Caito.de [dataset](https://www.caito.de/data/Training/stt_tts/) | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/ru_RU.csv) |
| voxforge_ru | 1.9 | 1.5 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/voxforge_ru.tar.gz) | Voxforge [dataset](https://www.repository.voxforge1.org/downloads/) | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/voxforge_ru.csv) |
| russian_single | 0.9 | 0.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/russian_single.tar.gz) | Russian single speaker [dataset](https://www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset) | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/russian_single.csv) |
| public_lecture_1 | 0.7 | 0.6 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/public_lecture_1.tar.gz) | Sources from the Internet | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/public_lecture_1.csv) |
| Total | 190 | 163 | | | | |


## **Download instructions**
Expand All @@ -108,6 +116,7 @@ Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_dat

## **Check md5sum**

Including links to deprecated files.
`md5sum /path/to/downloaded/file`

<details>
Expand All @@ -118,6 +127,62 @@ Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_dat
<th>md5sum</th>
<th>file</th>
</tr>
<tr>
<td>audio</td>
<td>c356e279fe65530a14079b952a3374e1</td>
<td>asr_public_phone_calls_1_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>a9c6c721d5c8cbbf683fae325fbc20e9</td>
<td>asr_public_phone_calls_2_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>dee17aea8d0ba197e5636508bb2ac6a9</td>
<td>asr_public_stories_1_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>be5cec0a66f44e77adacc8fb09142bbd</td>
<td>private_buriy_audiobooks_2_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>e1abff84b5318007ae17d293dcc24783</td>
<td>public_lecture_1_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>3d954ffdc65693fb4caf0bca61171b34</td>
<td>public_series_1_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>501f16dc4bf529a99315beb2d31e76ef</td>
<td>public_youtube700_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>ba9e68fdeb5e60fc9292cbeb24c09eb5</td>
<td>ru_ru_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>d79f85cc8c70cb36255f1cce4d0eddd1</td>
<td>russian_single_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>d6213dc7930591a99a6dd495bc2eda6a</td>
<td>tts_russian_addresses_rhvoice_4voices_mp3.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>dd5704a9f0c695ccd333dea807a0cd87</td>
<td>voxforge_ru_mp3.tar.gz</td>
</tr>

<tr>
<td>manifest</td>
<td>b0ce7564ba90b121aeb13aada73a6e30</td>
Expand Down Expand Up @@ -316,9 +381,11 @@ Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_dat
</table>
</details>


## **End to end download scripts**

You can use this [script](https://github.com/snakers4/open_stt/blob/master/download.sh) with this config [file](https://github.com/snakers4/open_stt/blob/master/md5sum.lst).
Please check the config first.
You can also [contribute](https://github.com/snakers4/open_stt/issues/2) a similar script in python.

# **Annotation methodology**
Expand Down Expand Up @@ -404,11 +471,102 @@ Please contact us [here](mailto:[email protected]) or just create a GitH

# **FAQ**

## **0. Why not MP3?**
## **0. ~~Why not MP3?~~ MP3 encoding / decoding**

#### **Encoding**

Mostly we used `pydub` (via ffmpeg) to convert to MP3.
We omitted blank files (YouTube mostly).
We used the following parameters:
- 16kHz;
- 32 kbps;
- Mono;

Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech.
But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice.
We did not use other formats like `.ogg`, because `.mp3` is much more popular.

<details><summary>See example</summary>
<p>

```python
from pydub import AudioSegment

sound = AudioSegment.from_file(temp_path,
format="wav")

file_handle = sound.export(store_mp3_path,
format="mp3",
parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
bitrate="{}k".format(str(32)))
```

</p>
</details>

#### **Decoding**

It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:

<details><summary>See example</summary>
<p>

```python
# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile

def save_wav_diskdb(wav,
root_folder='../data/ru_open_stt/',
target_sr=16000):
assert type(wav) == np.ndarray
assert wav.dtype == np.dtype('int16')
assert len(wav.shape)==1

target_format = 'wav'
wavb = wav.tobytes()

# f_path = Path(audio_path)
f_hash = hashlib.sha1(wavb).hexdigest()

store_path = Path(root_folder,
f_hash[0],
f_hash[1:3],
f_hash[3:15]+'.'+target_format)

We were planning to make an MP3 version (around 64 kb/s), and probably we were too quick to publish the dataset - it grew out of control.
Despite having ample free DO credits, we incurred some charges for data transfer.
We are making / will soon make an MP3 version and replace the links with the new ones.
store_path.parent.mkdir(parents=True,
exist_ok=True)

wavfile.write(filename=str(store_path),
rate=target_sr,
data=wav)

return str(store_path)

root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
mono=True,
sr=target_sr)

# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int

wav_path = save_wav_diskdb(wav,
root_folder=root_folder,
target_sr=target_sr)
```

</p>
</details>

## **1. Issues with reading files**

Expand Down Expand Up @@ -444,7 +602,7 @@ TTS does not enjoy the same attention by ML community because it is data hungry
Ultimately it leads to worse-off situation for the general community.

## **3. Known issues with the dataset to be fixed**
- Blank files in Youtube dataset. Just filter them out using meta-data. Will be fixed in future;
- ~~Blank files in Youtube dataset~~. Removed in mp3 archive. Meta-data not cleaned;
- Some files that have low values / crash with tochaudio;
- Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;

Expand Down
48 changes: 23 additions & 25 deletions md5sum.lst
Original file line number Diff line number Diff line change
Expand Up @@ -9,29 +9,27 @@ a81dfb018c88d0ecd5194ab3d8ff6c95 public_youtube700.csv
c858f020729c34ba0ab525bbb8950d0c ru_RU.csv
0275525914825dec663fd53390fdc9a0 russian_single.csv
52f406f4e30fcc8c634f992befd91beb tts_russian_addresses_rhvoice_4voices.csv
a5496898ee78654bf398ec6df71540d7 asr_public_phone_calls_1.tar.gz
97594127a922df8a7bcc2eecd2470805 asr_public_phone_calls_2.tar.gz_aa
f9b6475f0f2898b16d9e6e0e648fb531 asr_public_phone_calls_2.tar.gz_ab
b19977c889cda639f621195251e6bb6f asr_public_phone_calls_2.tar.gz_ac
657a31b544b10295f909ef4b2ca5c156 asr_public_stories_1.tar.gz
c356e279fe65530a14079b952a3374e1 asr_public_phone_calls_1_mp3.tar.gz
a9c6c721d5c8cbbf683fae325fbc20e9 asr_public_phone_calls_2_mp3.tar.gz
dee17aea8d0ba197e5636508bb2ac6a9 asr_public_stories_1_mp3.tar.gz
be5cec0a66f44e77adacc8fb09142bbd private_buriy_audiobooks_2_mp3.tar.gz
e1abff84b5318007ae17d293dcc24783 public_lecture_1_mp3.tar.gz
3d954ffdc65693fb4caf0bca61171b34 public_series_1_mp3.tar.gz
501f16dc4bf529a99315beb2d31e76ef public_youtube700_mp3.tar.gz
ba9e68fdeb5e60fc9292cbeb24c09eb5 ru_ru_mp3.tar.gz
d79f85cc8c70cb36255f1cce4d0eddd1 russian_single_mp3.tar.gz
d6213dc7930591a99a6dd495bc2eda6a tts_russian_addresses_rhvoice_4voices_mp3.tar.gz
dd5704a9f0c695ccd333dea807a0cd87 voxforge_ru_mp3.tar.gz
7533581bb26975212817bcacb25546d0 asr_public_stories_2.tar.gz
3955616cd89761bf2d54d0e992f7eae5 audiobooks_2.tar.gz_aa
81b6ec147c0c43bdd56002c41e0288b8 audiobooks_2.tar.gz_ab
15d4cf99171c2db3f375619f4bd2b6d9 audiobooks_2.tar.gz_ac
50635b0f4bdf44fae96e5a65f4738e19 audiobooks_2.tar.gz_ad
f1103be39ffc2da4a98d8f6ddeb50aa0 audiobooks_2.tar.gz_ae
8b45d2bd8b1fa1d906e36b9fabd9fe4c audiobooks_2.tar.gz_af
5104df44933b612b3c1bfc06f6376654 audiobooks_2.tar.gz_ag
e6b9e5f46811d33ea34ce50f6067a762 public_lecture_1.tar.gz
86ebf7e30986b8ee8df11f85b35588a0 public_series_1.tar.gz
dc260dd8151b4fce6cde6d80af13146d public_youtube700.tar.gz_aa
04706ef0f98841ec8d2f20a83aca3cf1 public_youtube700.tar.gz_ab
e11d5b118bf71425e4915e61277a06a9 public_youtube700.tar.gz_ac
d9a93157263eb9d8078c0e0b88c271de public_youtube700.tar.gz_ad
1bbba5eb2f4911c9ed20ec69cbd292cb ru_ru.tar.gz
6f79a9c514ad48a5763e3142919fc765 russian_single.tar.gz
31d515e0bdfc467c3fe63088b817c15c tts_russian_addresses_rhvoice_4voices.tar.gz_aa
4ca15694a8d8a638bbdc5e90832eadb4 tts_russian_addresses_rhvoice_4voices.tar.gz_ab
447559a38cd8bf61c5de64e602f06da3 tts_russian_addresses_rhvoice_4voices.tar.gz_ac
9131347a97c2e794d7c6d5a265083e83 tts_russian_addresses_rhvoice_4voices.tar.gz_ad
91e2115b17b1ad08649f428d2caa643b voxforge_ru.tar.gz












0 comments on commit 328915c

Please sign in to comment.