This is the accompanying repository of the thesis work Deep Noise Suppression for Real Time Speech Enhancement in a Single Channel Wide Band Scenario developed as part of the requirements of the Master in Sound and Music Computing at Universitat Pompeu Fabra. It provides the necessary code the train and evaluate all the models studied throughout the aforementioned research project. In order use this repository, some external dependencies addressed in the following sections are needed. Additionally, a selection of pretrained models and audio examples are provided. In case of any inquiries, feel free to contact the author through his website https://www.estebangomez.me.
Speech enhancement can be regarded as a dual task that addresses two important issues of degraded speech: Speech quality and speech intelligibility. This work is focused on speech quality in a real time context. Algorithms that improve speech quality are sometimes referred to as noise suppression algorithms, since they enhance quality by suppressing the background noise of the degraded speech. Real time capable algorithms are especially important for devices with a limited processing power and physical constraints that cannot make use of large architectures, such as hearing aids or wearables. This work uses a deep learning based approach to expand on two previously proposed architectures in the context of the Deep Noise Suppression Challenge carried out by Microsoft Corporation. This challenge has provided datasets and resources to teams of researchers with the common goal of fostering the research on the aforementioned topic. The outcome of this thesis can be divided into three main contributions: First, an extended comparison between six variants of the two selected models, considering performance, computational complexity and real time efficiency analyses. Secondly, making available an open source implementation of one of the proposed architectures as well as a framework translation of an existing implementation. Finally, proposed variants that outperform the previously defined models in terms of denoising performance, complexity and real time efficiency.
The content of this repository can be summarized as follows:
docs
: Folder containing the written thesis report.src
: Folder that contains the actual code used in this project.src/dataset
: Contains anexample_dataset
(only provided as example since it only has 50 clean/noisy speech pairs) and asampling
folder containing examples to be predicted in intermediate steps of a model training.src/logs
: Folder were trainingtensorboard
logs and checkpoints are saved.src/predicted
: Place to store the folders with predicted audios when usingpredict.py
.src/pretrained_models
: As its name implies, it has a selection of pretrained models ready to use.src/reports
: Folder where.csv
files containing the scores for each preddicted file using the different provided metrics (SI-SDR, SNR, ViSQOL, etc) are stored.src/utils
: Folder containing a collection of utilities used to implement different classes and functions needed throughout the project.src/visqol
: Expected location for ViSQOL's content (see External dependencies for more details).src/cruse.py
,src/dtln.py
: Implementation of the actual model classes.src/train_cruse.py
,src/train_dtln.py
: Scripts for training each model variant.src/dns_dataset.py
: Dataloader implementation.src/predict.py
: Script to predict clean speech given a noisy speech folder.src/profiler.py
: Script to analyze a given model in terms of real time performance and complexity.src/requirements.txt
: Python dependencies.src/score.py
: Script for obtaining the performance metric scores in.csv
format of a specified folder containing predicted files.
This repository requires python 3.7.10
or higher. It may work in older versions, although it has not been tested. In order to install python
dependencies, please run the following command:
pip install -r requirements.txt
ViSQOL (Virtual Speech Quality Objective Listener) is an objective, full-reference metric for perceived audio quality. In this project, it is used to score the predicted audio files. ViSQOL was implemented in C++ by Google. In this project, it is called by score.py
. This script assumes that ViSQOL is available in a folder called visqol
inside the src
folder of the project. The path to the executable would then be /src/visqol/bazel-bin/visqol
. The following instructions were copied from the Build section of the original repository that can be found at https://github.com/google/visqol.
1. Install Bazel
- Bazel can be installed following the instructions for Linux or Mac.
- Tested with Bazel version
3.4.1
.
2. Build ViSQOL
- Change directory to the root of the ViSQOL project (i.e. where the WORKSPACE file is) and run the following command:
bazel build :visqol -c opt
1. Install Bazel
- Bazel can be installed for Windows from here.
- Tested with Bazel version
3.5.0
.
2. Install git
git
for Windows can be obtained from the official git website.- When installing, select the option that allows
git
to be accessed from the system shells.
3.Build ViSQOL:
Change directory to the root of the ViSQOL project (i.e. where the WORKSPACE file is) and run the following command: bazel build :visqol -c opt
DNSMOS is a non-intrusive deep learning based metric developed by Microsoft and provided to researchers as a web-API upon request as part of the Deep Noise Suppression Challenge. To use DNSMOS, you need to enter your corresponding SCORING_URI
and AUTH_KEY
in the body of the run_dnsmos()
function inside /src/utils/evaluation_process.py
. More information about this metric and how to request access to it can be found here.
The are three provided pretrained models that can be directly used for inferencing. These are inside the pretrained_models
. In order to use one of these models for prediction, you cd
inside the src
folder and issue the following command:
python predict.py <input_dir> <output_dir> <checkpoint> -m <model>
For example, clean the noisy speech found inside dataset/example_dataset/noisy_speech
using pretrained_models/DTLN_BiLSTM_500h.tar
, the command would be the following:
python predict.py dataset/example_dataset/noisy_speech predicted/DTLN_BiLSTM_500h pretrained_models/DTLN_BiLSTM_500h.tar -m dtln_bilstm
This will create a DTLN_BiLSTM_500h
folder inside the predicted
folder that will contain all the predicted audio files.
It is always possible to use the command line help by using the -h
argument. There it is also possible to see the identifiers for each model to instantiate them correctly. If there is a mismatch between the utilized checkpoint and the model instance you will get an error because the structure contained in the checkpoint differs from that found when the model was instantiated.
python predict.py -h
After a set of predictions is computed, these can be evaluating using some of all of the provided metrics (STOI, SI-SDR, PESQ, ViSQOL, DNSMOS, WARP-Q) by using the following command:
python score.py <reference_dir> <estimates_dir>
As an example, let's asume the prediction of dataset/example_dataset
are stored in predicted/example_predictions
. Then, the command would be:
python score.py dataset/example_dataset/clean_speech predicted/example_predictions
Please note that DNSMOS and ViSQOL are not included by default since these require external resources to run. If you want to include them, you can do it by typing:
python score.py dataset/example_dataset/clean_speech predicted/example_predictions -m stoi si-sdr pesq visqol dnsmos warpq
This will compute all the available metrics. Once the metrics are computed, a .csv
file will be automatically created inside the reports
folder. It will contain one column per metric with the respective results as well as two additional columns showing the reference and estimate path used for each prediction.
It is possible to inspect the amount of parameters and FLOPs performed by each layer as well as the inference time on a given machine by issuing the following command:
python profiler.py -m <model>
For example, to see the information about CRUSEx4GRU
you must issue the following command:
python profiler.py -m crusex4gru
This will print the statistics during the inference if 1000 prediction cycles along with a table showing the parameters on each layer as well as the FLOPs needed to perform each calculation. Again, further options can be displayed by typing:
python profiler.py -h
Two files are provided to train a model from scratch. train_dtln.py
and train_cruse.py
can be used to train a model of their respective classes. The five possible models to be trained are dtln
, dtln_gru
, dtln_bigru
, dtln_bilstm
, cruse
and crusex4gru
. Their description can be found in the thesis written report inside the docs
folders. The dataset to trained these models is not provided because of its size, but can be synthezised using the scripts provided by the DNS-Challenge. Please refer to their repository for further details. dns_dataset.py
file already contains a data loader that to handle the DNS-Challenge dataset. The command to train a model is:
python train_dtln.py <input_dir> <output_dir> -m <model> -d <device> -b <batch_size>
or
python train_cruse.py <input_dir> <output_dir> -m <model> -d <device> -b <batch_size>
For example, to train a DTLN_BiGRU
on a GPU using the example_dataset
, the command would be as follows:
python train_dtln.py dataset/example_dataset/noisy_speech dataset/example_dataset/clean_speech -m dtln_bigru -d cuda:0 -b 10
By doing this, the training process will start and information about it will be displayed on the screen along with a progress bar showing the information of the latest epoch.
A few things to consider:
- Some parameters have already a default value and therefore may not need to be specified depending on your setup, the details can be check using the help command.
python train_dtln.py -h
or
python train_cruse.py -h
-
Conversely, several other options are available to tweak the training procedure, these can be explored using the same command.
-
Both
train_dtln.py
andtrain_cruse.py
have a default batch size as specified in their respective papers referenced in the thesis document.-b 10
is used as an example because bigger batch sizes could cause an error with the example dataset since it only contains 50 audio files and a train/validation split is by default set to 80/20. With a bigger dataset, bigger batch sizes will be okay as well. -
During the training process, a folder inside the
logs
folder will be created. It will contain information that can be visualized in the browser usingtensorboard
in order to facilitate the tracking of the training process. Additionally, at intermediate steps, checkpoints will be saved and the best three checkpoints in terms of validation loss are kept. Further modifications of the routines that are called during the training process can be inspected by looking intotrain_dtln.py
andtrain_cruse.py
. Moreover, the implementation of the callbacks is insideutils/callbacks.py
. New callbacks can also be added by creating new classes that inherit fromTrainingProcessCallback
. If you have usedpytorch_lightning
you may be already familiar with this concept. This repository is implemented usingPyTorch
, although the structured followed is heavily inspired bypytorch_lightning
. -
To launch
tensorboard
use the following command:
tensorboard --logdir <logs_dir>
This will display a URL that can be clicked or copy/pasted depending on your terminal of choice. As soon as you do it, you will see a website that will locally display information of your training process such as the current epoch, training loss, validation loss, spectrogram and waveform plots, predicted audio examples and histograms of weights and biases of each layer.
I would like to express my depeest gratitude to Andrés Pérez, PhD and Pritish Chandna, PhD, the supervisor and co-supervisor of this project, respectively. Additionally, I want to thank Voicemod for providing the necessary hardware to carry out the experiments needed throghout the project. Last but not least, I would like to thank Microsoft Corporation for allowing me to use their proprietary tools that are provided to researchers as part of their Deep Noise Suppression Challenge.
If this work turns out to be useful in your own research, you can cite it using this BibTeX code:
@mastersthesis{GomezSMCSE2021,
author = {Esteban Gómez},
title = {{Deep Noise Suppression for Real Time Speech Enhancement in a Single Channel Wide Band Scenario}},
school = {Universitat Pompeu Fabra},
year = 2021,
}
This work is licensed under a Creative Commons Attribution 4.0 International License.