Skip to content
This repository has been archived by the owner on Oct 10, 2022. It is now read-only.

Commit

Permalink
Add v0.3-alpha desciption
Browse files Browse the repository at this point in the history
  • Loading branch information
snakers4 committed Apr 30, 2019
1 parent 6ed89f9 commit dd6ac59
Showing 1 changed file with 286 additions and 33 deletions.
319 changes: 286 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,51 @@
# **Russian Open STT Dataset**

Arguably the largest public Russian STT dataset up to date:
- ~3m utterances;
- 1,771+ hours;
- 190GB;
- Additional 3,000 hours ... and more ... to be released soon!;
- ~5m utterances;
- ~4,200 hours;
- 457GB;
- Additional 1,500 hours ... and more ... to be released soon!;
- And then maybe even more hours to be released!;


Prove [me](https://t.me/snakers41) wrong!
Prove [us](https://t.me/snakers41) wrong!
Open issues, collaborate, submit a PR, contribute, share your datasets!
Let's make STT in Russian (and more) as open and available as CV models.


# **Dataset composition**

| Dataset | Utterances | Hours | GB | Av len/chars | Comment | Annotation | Quality/noise |
|-------------------------------|------------|-------|-----|--------------|------------------|---------------|---------------|
| asr_public_phone_calls_2 (*) | | 1,500 | | | * Coming soon | | |
| public_youtube1500 (*) | | 1,500 | | | * Coming soon | | |
| tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses| TTS, 4 voices | 100% / crisp |
| public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | >95% / ~crisp |
| asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
| asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 70% / crisp |
| public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
| ru_RU | 5,826 | 17 | 2 | 10.8s / 12 | Public dataset | Alignment | 99% / crisp |
| voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
| russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
| public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | >95% / crisp |
| Total | 2,825,904 | 1,771 | 190 | | | | |
| Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise |
|---------------------------|------------|-------|-----|------------|------------------|-------------|---------------|
| public_youtube1500 (*) | | 1,500 | | | * Coming soon | | |
| audiobook_2 | 1,149,404 | 1,511 | 166 | 4.7s / 56 | Books | Alignment | 99% / crisp |
| audiobook_1 | 196,666 | 237 | 26 | 4.3s / 50 | Books | Alignment | 99% / crisp |
| public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | 95% / ~crisp |
| tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses| TTS 4 voices| 100% / crisp |
| asr_public_phone_calls_2 | 603,797 | 601 | 66 | 3.6s / 37 | Phone calls | ASR | 70% / noisy |
| asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
| asr_public_stories_2 | 78,186 | 78 | 9 | 3.5s / 43 | Books | ASR | 80% / crisp |
| asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 80% / crisp |
| public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
| ru_RU | 5,826 | 17 | 2 | 11s / 12 | Public dataset | Alignment | 99% / crisp |
| voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
| russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
| public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | 95% / crisp |
| Total | 4,853,957 | 4,198 | 457 | | | | |

# **Downloads**

## **Links**

Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_data_v02.csv).
Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_data_v03.csv).


| Dataset | GB | GB, compressed | Audio | Source | Manifest |
|---------------------------------------|------|----------------|-------| -------| ----------|
| audiobook_1 | 26 | 20.8 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_1.tar.gz) | Public books + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_1.csv) |
| audiobook_2 | 166 | 131.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ad), [part5](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ae), [part6](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_af), [part7](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ag) | Public books + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_2.csv) |
| asr_public_phone_calls_2 | 66 | 51.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ac) | ASR + public phone calls | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.csv) |
| asr_public_stories_2 | 9 | 7.5 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.tar.gz) | Public books + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.csv) |
| tts_russian_addresses_rhvoice_4voices | 80.9 | 67.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ad) | TTS | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.csv) |
| public_youtube700 | 75.0 | 67.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ad) | YouTube videos | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.csv) |
| asr_public_phone_calls_1 | 22.7 | 19.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.tar.gz) | ASR + public phone calls | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.csv) |
Expand Down Expand Up @@ -71,6 +80,226 @@ Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_dat
2. Download the meta data and manifests for each dataset:
3. Merge files (where applicable), unpack and enjoy!

## **Check md5sum**

`md5sum /path/to/downloaded/file`

<details>
<summary>Click to expand</summary>
<table>
<tr>
<th>type</th>
<th>md5sum</th>
<th>file</th>
</tr>
<tr>
<td>manifest</td>
<td>b0ce7564ba90b121aeb13aada73a6e30</td>
<td>asr_public_phone_calls_1.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>6867d14dfdec1f9e9b8ca2f1de9ceda6</td>
<td>asr_public_phone_calls_2.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>0bdd77e15172e654d9a1999a86e92c7f</td>
<td>asr_public_stories_1.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>f388013039d94dc36970547944db51c7</td>
<td>asr_public_stories_2.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>697738331b6021890c29a0d415d0f22d</td>
<td>private_buriy_audiobooks_1.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>3b67e27c1429593cccbf7c516c4b582d</td>
<td>private_buriy_audiobooks_2.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>04027c20eb3aff05f6067957ecff856b</td>
<td>public_lecture_1.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>89da3f1b6afcd4d4936662ceabf3033e</td>
<td>public_series_1.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>a81dfb018c88d0ecd5194ab3d8ff6c95</td>
<td>public_youtube700.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>c858f020729c34ba0ab525bbb8950d0c</td>
<td>ru_RU.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>0275525914825dec663fd53390fdc9a0</td>
<td>russian_single.csv</td>
</tr>
<tr>
<td>manifest</td>
<td>52f406f4e30fcc8c634f992befd91beb</td>
<td>tts_russian_addresses_rhvoice_4voices.csv</td>
</tr>
<tr>
<td>audio</td>
<td>a5496898ee78654bf398ec6df71540d7</td>
<td>asr_public_phone_calls_1.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>e4df5ef50787384648b59f5a87edc0c6</td>
<td>asr_public_phone_calls_2.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>97594127a922df8a7bcc2eecd2470805</td>
<td>asr_public_phone_calls_2.tar.gz_aa</td>
</tr>
<tr>
<td>audio</td>
<td>f9b6475f0f2898b16d9e6e0e648fb531</td>
<td>asr_public_phone_calls_2.tar.gz_ab</td>
</tr>
<tr>
<td>audio</td>
<td>b19977c889cda639f621195251e6bb6f</td>
<td>asr_public_phone_calls_2.tar.gz_ac</td>
</tr>
<tr>
<td>audio</td>
<td>657a31b544b10295f909ef4b2ca5c156</td>
<td>asr_public_stories_1.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>7533581bb26975212817bcacb25546d0</td>
<td>asr_public_stories_2.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>d7d374025c56ca556d9cde86b9fdffda</td>
<td>audiobooks_1.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>3955616cd89761bf2d54d0e992f7eae5</td>
<td>audiobooks_2.tar.gz_aa</td>
</tr>
<tr>
<td>audio</td>
<td>81b6ec147c0c43bdd56002c41e0288b8</td>
<td>audiobooks_2.tar.gz_ab</td>
</tr>
<tr>
<td>audio</td>
<td>15d4cf99171c2db3f375619f4bd2b6d9</td>
<td>audiobooks_2.tar.gz_ac</td>
</tr>
<tr>
<td>audio</td>
<td>50635b0f4bdf44fae96e5a65f4738e19</td>
<td>audiobooks_2.tar.gz_ad</td>
</tr>
<tr>
<td>audio</td>
<td>f1103be39ffc2da4a98d8f6ddeb50aa0</td>
<td>audiobooks_2.tar.gz_ae</td>
</tr>
<tr>
<td>audio</td>
<td>8b45d2bd8b1fa1d906e36b9fabd9fe4c</td>
<td>audiobooks_2.tar.gz_af</td>
</tr>
<tr>
<td>audio</td>
<td>5104df44933b612b3c1bfc06f6376654</td>
<td>audiobooks_2.tar.gz_ag</td>
</tr>
<tr>
<td>audio</td>
<td>e6b9e5f46811d33ea34ce50f6067a762</td>
<td>public_lecture_1.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>86ebf7e30986b8ee8df11f85b35588a0</td>
<td>public_series_1.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>dc260dd8151b4fce6cde6d80af13146d</td>
<td>public_youtube700.tar.gz_aa</td>
</tr>
<tr>
<td>audio</td>
<td>04706ef0f98841ec8d2f20a83aca3cf1</td>
<td>public_youtube700.tar.gz_ab</td>
</tr>
<tr>
<td>audio</td>
<td>e11d5b118bf71425e4915e61277a06a9</td>
<td>public_youtube700.tar.gz_ac</td>
</tr>
<tr>
<td>audio</td>
<td>d9a93157263eb9d8078c0e0b88c271de</td>
<td>public_youtube700.tar.gz_ad</td>
</tr>
<tr>
<td>audio</td>
<td>1bbba5eb2f4911c9ed20ec69cbd292cb</td>
<td>ru_ru.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>6f79a9c514ad48a5763e3142919fc765</td>
<td>russian_single.tar.gz</td>
</tr>
<tr>
<td>audio</td>
<td>c926df1068218eb9cc8103c94003fcc6</td>
<td>tts_russian_addresses_rhvoice_4voices.tar</td>
</tr>
<tr>
<td>audio</td>
<td>31d515e0bdfc467c3fe63088b817c15c</td>
<td>tts_russian_addresses_rhvoice_4voices.tar.gz_aa</td>
</tr>
<tr>
<td>audio</td>
<td>4ca15694a8d8a638bbdc5e90832eadb4</td>
<td>tts_russian_addresses_rhvoice_4voices.tar.gz_ab</td>
</tr>
<tr>
<td>audio</td>
<td>447559a38cd8bf61c5de64e602f06da3</td>
<td>tts_russian_addresses_rhvoice_4voices.tar.gz_ac</td>
</tr>
<tr>
<td>audio</td>
<td>9131347a97c2e794d7c6d5a265083e83</td>
<td>tts_russian_addresses_rhvoice_4voices.tar.gz_ad</td>
</tr>
<tr>
<td>audio</td>
<td>91e2115b17b1ad08649f428d2caa643b</td>
<td>voxforge_ru.tar.gz</td>
</tr>
</table>
</details>

# **Annotation methodology**

The dataset is compiled using open domain sources.
Expand Down Expand Up @@ -105,43 +334,55 @@ store_path = Path(root_folder,

Use helper functions from here for easier work with manifest files.

Read manifests:
```
#### **Read manifests**
<details><summary>See example</summary>
<p>

```python
from utils.open_stt_utils import read_manifest

manifest_df = read_manifest('path/to/manifest.csv')
```

Merge, check and save manifests:
```
</p>
</details>

#### **Merge, check and save manifests**
<details><summary>See example</summary>
<p>

```python
from utils.open_stt_utils import (plain_merge_manifests,
check_files,
save_manifest)
train_manifests = [
'path/to/manifest1.csv',
'path/to/manifest2.csv',
'path/to/manifest1.csv',
'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
train_manifest = plain_merge_manifests(train_manifests,
MIN_DURATION=0.1,
MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
'my_manifest.csv')
```

</p>
</details>

# **Contacts**

Please contact me [here](https://t.me/snakers41) or just create a GitHub issue!
Please contact us [here](https://t.me/snakers41) or just create a GitHub issue!

# **FAQ**

## **1. Issues with reading files**

Maybe try this approach:
```
#### **Maybe try this approach:**
<details><summary>See example</summary>
<p>

```python
from scipy.io import wavfile

sample_rate, sound = wavfile.read(path)
Expand All @@ -151,6 +392,10 @@ sound = sound.astype('float32')
if abs_max>0:
sound *= 1/abs_max
```

</p>
</details>

## **2. Why share such dataset?**

We are not altruists, life just is **not a zero sum game**.
Expand All @@ -163,3 +408,11 @@ Consider the progress in computer vision, that was made possible by:

TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English.
Ultimately it leads to worse-off situation for the general community.

## **3. Known issues with the dataset to be fixed**
- Blank files in Youtube dataset. Just filter them out using meta-data. Will be fixed in future;
- Some files that have low values / crash with tochaudio;
- Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;

# **License**
Dual license, cc-by-nc and commercial usage available after agreement with dataset authors

0 comments on commit dd6ac59

Please sign in to comment.