Multi-Language Dataset Cleaner/Combiner for Mozilla's DeepSpeech Framework
- de - German = 9.84%
- pl - Polish = 13.7%
- es - Spanish = 13.9%
- it - Italian = 18.4%
- fr - French = 22.7%
- uk - Ukrainian = 29.9%
- ru - Russian = 36.9%
- nl - Dutch = 39.6%
- pt - Portuguese = 50.7%
- en - English
install kenlm
install DeepSpeech
git clone https://github.com/silenterus/deepspeech-cleaner
cd deepspeech-cleaner
pip install -r requirements.txt
python3 deepspeech-cleaner.py download --lang fr
python3 deepspeech-cleaner.py insert /path/to/corpora/
python3 deepspeech-cleaner.py create
python3 deepspeech-cleaner.py create --noclean --notrie
bash languages/fr/training/standard/start_train.sh
download/extract/clean articles from Wiki Dumps
python3 deepspeech-cleaner.py crawl
python3 deepspeech-cleaner.py test 1 2 3 is not for me
python3 deepspeech-cleaner.py test /path/to/textfile.txt
python3 deepspeech-cleaner.py convert
python3 deepspeech-cleaner.py autosave
languages/fr/replacer/..
'@> '
' Sat > Saturday '
- only files with a number attached will be used
- <0 used before number translation
- =>0 used after number translation
- replace a word/symbol with '�' and the whole sentence get rejected
- spaces at the start/end are important for whole words
languages/fr/sql_query/..
- files are named like the tables in your "audio.db"
- '!' at the end of a line functions as NOT
python3 deepspeech-cleaner.py help
-
Mozilla's DeepSpeech incredibly work - the potential to gain some autonomy back.
-
Mozilla's Common Voice help with your voice
-
TU-Darmstadt Kaldi great dataset for trainingsets with lower qualitys
-
deepspeech-ger i got many ideas from this git
-
WikiExtractor used part of it to extract the wiki articles
-
Open Speech and Language Resources all sorts of datasets
Test - WER: 0.098498, CER: 3.228931, loss: 23.721140
WER: 3.500000, CER: 37.000000, loss: 326.320953
src: “eine neue”
res: “einem neuen leben und neuen pflichten entgegen”
WER: 3.000000, CER: 6.000000, loss: 7.963222
src: “ausverkauft”
res: “aus der fast”
WER: 3.000000, CER: 5.000000, loss: 11.577581
src: “riesengebirge”
res: “riesen der berge”
WER: 3.000000, CER: 6.000000, loss: 11.873451
src: “beerdigung”
res: “wer die un”
WER: 3.000000, CER: 8.000000, loss: 17.944910
src: “besuchstermin”
res: “es wuchs der”
WER: 3.000000, CER: 6.000000, loss: 22.410923
src: “beerdigung”
res: “wer die un”
WER: 3.000000, CER: 4.000000, loss: 25.310646
src: “weitermachen”
res: “bei der machen”
WER: 3.000000, CER: 34.000000, loss: 237.857559
src: “misses dent”
res: “es ist mein wunsch vergessen vernachlässigt”
WER: 3.000000, CER: 74.000000, loss: 484.282074
src: “es endigte mit einem”
res: “es endigte mit einem lauten schall welcher in jedem einsamen zimmer in echo zu wecken schienen”
WER: 2.800000, CER: 69.000000, loss: 650.892578
src: “computer alarm in neun minuten”
res: “per definition handelt es sich bei diesen geräten im engeren sinn um personal computer”