Skip to content

vogelskamp/GCP-Diffusion-Spectrograms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Denoising Diffusion Probabilistic Model to generate spectrograms of human voice

This project was developed for the Audio Data Science module at the University of Applied Sciences in Düsseldorf. Based on dome272's implementation of the DDPM paper, the project aims to train a diffusion model on spectrograms of human sentences in effort to produce gibberish, aka nonsense speech. The project is meant to be baseline research for whether or not diffusion models can be used to generate spectrograms of human speech, so later approaches can use it for speech-to-text approaches including text vectors of the spoken sentences as input for the training.

How to use

Preparing the data

The GCPSpectrogramSet included in data_set.py expects a numpy array saved as a .npy file stored in a GCP bucket. To generate the numpy array, open the batch_gen_data.py and adjust the ORIGIN_FOLDER_PATH and DESTINATION_FOLDER_PATH variables as required. Running the script will generate a numpy array for each processed audio file individually, so in a following step you need to run the combine_data.py script to combine them into one file. Once the file is created, it can be uploaded to a GCP bucket of choice.

Modifying the model

The underlying U-Net model is defined in modules.py, where parameters for each layer can be modified as needed. Note that the layer sizes are dependent on each previous one, so local testing upon modifying the values is advised. This can be done by running the modules.py directly, which will do a test run with randomly generated data to check the validity of the model. To adjust hyperparameters, argparse was used to parse command parameters. The following hyperparameters were set up to customize the training:

name short form description default
--epochs -e The amount of epochs the model should train for. 100
--batch-size -bs The batch size to use for training. 5
--image_size -is The image size in height, width format as a tuple. (64, 256)
--learning_rate -lr The desired learning rate. 3e-4
--device -d The device to use for training. cuda
--dataset -d The name of the dataset uploaded to the GCP bucket. data256_test.py
--bucket_name -bn The name of the GCP bucket. diffusion-project-data
--result_bucket -rb The name of the GCP bucket where the samples from each epoch, as well as the resulting model will be saved. diffusion-project-results-na
--name -n The name of the diffusion model used when running the project. DDPM_Unconditional256x

Training the model

The project is set up to use Google Cloud Platform for its training, so the only local dependency required to run the project is the Google Cloud Cli. After setup, the publish.sh script can be modified and run to create a custom docker container for the training and generate a custom job on the GCP Vertex AI platform used to train AI models.

Testing results

The project includes a image_to_audio function in audio_utils.py to convert generated images back to audio.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published