From dd6ac595e83d30cde8f805b8df7bf15d0f56ed29 Mon Sep 17 00:00:00 2001 From: snakers41 Date: Tue, 30 Apr 2019 08:37:32 +0000 Subject: [PATCH] Add v0.3-alpha desciption --- README.md | 319 ++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 286 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index 099c585..15056b5 100644 --- a/README.md +++ b/README.md @@ -1,42 +1,51 @@ # **Russian Open STT Dataset** Arguably the largest public Russian STT dataset up to date: -- ~3m utterances; -- 1,771+ hours; -- 190GB; -- Additional 3,000 hours ... and more ... to be released soon!; +- ~5m utterances; +- ~4,200 hours; +- 457GB; +- Additional 1,500 hours ... and more ... to be released soon!; +- And then maybe even more hours to be released!; -Prove [me](https://t.me/snakers41) wrong! +Prove [us](https://t.me/snakers41) wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models. # **Dataset composition** -| Dataset | Utterances | Hours | GB | Av len/chars | Comment | Annotation | Quality/noise | -|-------------------------------|------------|-------|-----|--------------|------------------|---------------|---------------| -| asr_public_phone_calls_2 (*) | | 1,500 | | | * Coming soon | | | -| public_youtube1500 (*) | | 1,500 | | | * Coming soon | | | -| tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses| TTS, 4 voices | 100% / crisp | -| public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | >95% / ~crisp | -| asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy | -| asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 70% / crisp | -| public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp | -| ru_RU | 5,826 | 17 | 2 | 10.8s / 12 | Public dataset | Alignment | 99% / crisp | -| voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp | -| russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp | -| public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | >95% / crisp | -| Total | 2,825,904 | 1,771 | 190 | | | | | +| Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise | +|---------------------------|------------|-------|-----|------------|------------------|-------------|---------------| +| public_youtube1500 (*) | | 1,500 | | | * Coming soon | | | +| audiobook_2 | 1,149,404 | 1,511 | 166 | 4.7s / 56 | Books | Alignment | 99% / crisp | +| audiobook_1 | 196,666 | 237 | 26 | 4.3s / 50 | Books | Alignment | 99% / crisp | +| public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | 95% / ~crisp | +| tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses| TTS 4 voices| 100% / crisp | +| asr_public_phone_calls_2 | 603,797 | 601 | 66 | 3.6s / 37 | Phone calls | ASR | 70% / noisy | +| asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy | +| asr_public_stories_2 | 78,186 | 78 | 9 | 3.5s / 43 | Books | ASR | 80% / crisp | +| asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 80% / crisp | +| public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp | +| ru_RU | 5,826 | 17 | 2 | 11s / 12 | Public dataset | Alignment | 99% / crisp | +| voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp | +| russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp | +| public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | 95% / crisp | +| Total | 4,853,957 | 4,198 | 457 | | | | | # **Downloads** ## **Links** -Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_data_v02.csv). +Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_data_v03.csv). + | Dataset | GB | GB, compressed | Audio | Source | Manifest | |---------------------------------------|------|----------------|-------| -------| ----------| +| audiobook_1 | 26 | 20.8 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_1.tar.gz) | Public books + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_1.csv) | +| audiobook_2 | 166 | 131.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ad), [part5](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ae), [part6](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_af), [part7](https://ru-open-stt.ams3.digitaloceanspaces.com/audiobooks_2.tar.gz_ag) | Public books + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/private_buriy_audiobooks_2.csv) | +| asr_public_phone_calls_2 | 66 | 51.7 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.tar.gz_ac) | ASR + public phone calls | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_2.csv) | +| asr_public_stories_2 | 9 | 7.5 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.tar.gz) | Public books + alignment | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_stories_2.csv) | | tts_russian_addresses_rhvoice_4voices | 80.9 | 67.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.tar.gz_ad) | TTS | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/tts_russian_addresses_rhvoice_4voices.csv) | | public_youtube700 | 75.0 | 67.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_aa), [part2](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ab), [part3](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ac), [part4](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.tar.gz_ad) | YouTube videos | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/public_youtube700.csv) | | asr_public_phone_calls_1 | 22.7 | 19.0 | [part1](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.tar.gz) | ASR + public phone calls | [link](https://ru-open-stt.ams3.digitaloceanspaces.com/asr_public_phone_calls_1.csv) | @@ -71,6 +80,226 @@ Meta data [file](https://ru-open-stt.ams3.digitaloceanspaces.com/public_meta_dat 2. Download the meta data and manifests for each dataset: 3. Merge files (where applicable), unpack and enjoy! +## **Check md5sum** + +`md5sum /path/to/downloaded/file` + +
+ Click to expand + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
typemd5sumfile
manifestb0ce7564ba90b121aeb13aada73a6e30asr_public_phone_calls_1.csv
manifest6867d14dfdec1f9e9b8ca2f1de9ceda6asr_public_phone_calls_2.csv
manifest0bdd77e15172e654d9a1999a86e92c7fasr_public_stories_1.csv
manifestf388013039d94dc36970547944db51c7asr_public_stories_2.csv
manifest697738331b6021890c29a0d415d0f22dprivate_buriy_audiobooks_1.csv
manifest3b67e27c1429593cccbf7c516c4b582dprivate_buriy_audiobooks_2.csv
manifest04027c20eb3aff05f6067957ecff856bpublic_lecture_1.csv
manifest89da3f1b6afcd4d4936662ceabf3033epublic_series_1.csv
manifesta81dfb018c88d0ecd5194ab3d8ff6c95public_youtube700.csv
manifestc858f020729c34ba0ab525bbb8950d0cru_RU.csv
manifest0275525914825dec663fd53390fdc9a0russian_single.csv
manifest52f406f4e30fcc8c634f992befd91bebtts_russian_addresses_rhvoice_4voices.csv
audioa5496898ee78654bf398ec6df71540d7asr_public_phone_calls_1.tar.gz
audioe4df5ef50787384648b59f5a87edc0c6asr_public_phone_calls_2.tar.gz
audio97594127a922df8a7bcc2eecd2470805asr_public_phone_calls_2.tar.gz_aa
audiof9b6475f0f2898b16d9e6e0e648fb531asr_public_phone_calls_2.tar.gz_ab
audiob19977c889cda639f621195251e6bb6fasr_public_phone_calls_2.tar.gz_ac
audio657a31b544b10295f909ef4b2ca5c156asr_public_stories_1.tar.gz
audio7533581bb26975212817bcacb25546d0asr_public_stories_2.tar.gz
audiod7d374025c56ca556d9cde86b9fdffdaaudiobooks_1.tar.gz
audio3955616cd89761bf2d54d0e992f7eae5audiobooks_2.tar.gz_aa
audio81b6ec147c0c43bdd56002c41e0288b8audiobooks_2.tar.gz_ab
audio15d4cf99171c2db3f375619f4bd2b6d9audiobooks_2.tar.gz_ac
audio50635b0f4bdf44fae96e5a65f4738e19audiobooks_2.tar.gz_ad
audiof1103be39ffc2da4a98d8f6ddeb50aa0audiobooks_2.tar.gz_ae
audio8b45d2bd8b1fa1d906e36b9fabd9fe4caudiobooks_2.tar.gz_af
audio5104df44933b612b3c1bfc06f6376654audiobooks_2.tar.gz_ag
audioe6b9e5f46811d33ea34ce50f6067a762public_lecture_1.tar.gz
audio86ebf7e30986b8ee8df11f85b35588a0public_series_1.tar.gz
audiodc260dd8151b4fce6cde6d80af13146dpublic_youtube700.tar.gz_aa
audio04706ef0f98841ec8d2f20a83aca3cf1public_youtube700.tar.gz_ab
audioe11d5b118bf71425e4915e61277a06a9public_youtube700.tar.gz_ac
audiod9a93157263eb9d8078c0e0b88c271depublic_youtube700.tar.gz_ad
audio1bbba5eb2f4911c9ed20ec69cbd292cbru_ru.tar.gz
audio6f79a9c514ad48a5763e3142919fc765russian_single.tar.gz
audioc926df1068218eb9cc8103c94003fcc6tts_russian_addresses_rhvoice_4voices.tar
audio31d515e0bdfc467c3fe63088b817c15ctts_russian_addresses_rhvoice_4voices.tar.gz_aa
audio4ca15694a8d8a638bbdc5e90832eadb4tts_russian_addresses_rhvoice_4voices.tar.gz_ab
audio447559a38cd8bf61c5de64e602f06da3tts_russian_addresses_rhvoice_4voices.tar.gz_ac
audio9131347a97c2e794d7c6d5a265083e83tts_russian_addresses_rhvoice_4voices.tar.gz_ad
audio91e2115b17b1ad08649f428d2caa643bvoxforge_ru.tar.gz
+
+ # **Annotation methodology** The dataset is compiled using open domain sources. @@ -105,43 +334,55 @@ store_path = Path(root_folder, Use helper functions from here for easier work with manifest files. -Read manifests: -``` +#### **Read manifests** +
See example +

+ +```python from utils.open_stt_utils import read_manifest manifest_df = read_manifest('path/to/manifest.csv') ``` -Merge, check and save manifests: -``` +

+
+ +#### **Merge, check and save manifests** +
See example +

+ +```python from utils.open_stt_utils import (plain_merge_manifests, check_files, save_manifest) - train_manifests = [ - 'path/to/manifest1.csv', - 'path/to/manifest2.csv', + 'path/to/manifest1.csv', + 'path/to/manifest2.csv', ] - -train_manifest = plain_merge_manifests(train_manifests, +train_manifest = plain_merge_manifests(train_manifests, MIN_DURATION=0.1, MAX_DURATION=100) check_files(train_manifest) - save_manifest(train_manifest, 'my_manifest.csv') ``` +

+
+ # **Contacts** -Please contact me [here](https://t.me/snakers41) or just create a GitHub issue! +Please contact us [here](https://t.me/snakers41) or just create a GitHub issue! # **FAQ** ## **1. Issues with reading files** -Maybe try this approach: -``` +#### **Maybe try this approach:** +
See example +

+ +```python from scipy.io import wavfile sample_rate, sound = wavfile.read(path) @@ -151,6 +392,10 @@ sound = sound.astype('float32') if abs_max>0: sound *= 1/abs_max ``` + +

+
+ ## **2. Why share such dataset?** We are not altruists, life just is **not a zero sum game**. @@ -163,3 +408,11 @@ Consider the progress in computer vision, that was made possible by: TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community. + +## **3. Known issues with the dataset to be fixed** +- Blank files in Youtube dataset. Just filter them out using meta-data. Will be fixed in future; +- Some files that have low values / crash with tochaudio; +- Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above; + +# **License** +Dual license, cc-by-nc and commercial usage available after agreement with dataset authors