Skip to content

Latest commit

 

History

History
149 lines (111 loc) · 8.05 KB

README.md

File metadata and controls

149 lines (111 loc) · 8.05 KB

Minor Adjustments for HiFi-GAN compatibilty with Chasing Waterfalls Acoustic Model Output

This Hifigan implementation is compatible with output from the chasing waterfalls acoustic model. Mel spectrograms generated from the acoustic model can be read as pickle files for inference and fine-tuning. The Vocoder model was pretrained using LJ-Speech at 44kHz (upsampled using upsample.sh). The segment_length was doubled to train on the same time length as in the original HiFi-GAN paper.

Mel Compatibility

The acoustic model was trained on mels from 44kHz, 32bit audio files with specific hop and window length. The mels generated by the acoustic model represenet the same format as mels generated using librosa.feature.melspectrogram() and were normalized (using meldataset.norm_mel()) prior to saving them as pickle files. Hifigans mel generation represents a torch.Tensor implementation of the same method as librosa but with an additional dynamic range compression as a last step (see meldataset.spectral_normalize_torch). For inference and fine-tuning, these mels were thus denormalised and meldataset.dynamic_range_compression() was applied to convert them to the same format as the original HiFi-GAN implementation.

Dataset preparation

Training and validation split are defined during the dataset generation of the acoustic model. For those files, a training.txt and validation.txt has to be generated containing all audio-filenames of the corresponding dataset. prepare dataset.py can be used for this as followed:

python3 prepare_dataset.py --dataset_folder ../fastspeech_fork/data/K3_processed/snippets_test/wav_mono/ 

Training

The model was pretrained using LJSpeech for 220k steps and then trained again on the destination dataset for another 45k steps using the commands below:

# pretraining on 44khz upsampled LJSpeech
python3 train.py --config config_am.json --input_wavs_dir LJSpeech-1.1/wavs_44khz --training_epochs 1000000 --input_training_file LJSpeech-1.1/training.txt --input_validation_file LJSpeech-1.1/validation.txt --checkpoint_interval 20000 --validation_interval 50 --stdout_interval 10

# train on real dataset at 44khz starting from ljspeech pretrained
python3 train.py --config config_am.json --input_wavs_dir /opt/waterfalls/data/vocoder/097/wav_mono --training_epochs 50000 --input_training_file /opt/waterfalls/data/vocoder/097/training.txt --input_validation_file /opt/waterfalls/data/vocoder/097/validation.txt --input_mels_dir /opt/waterfalls/data/vocoder/097/mels_diff --stdout_interval 10 --fine_tuning False --validation_interval 50

The flag --test_pickle can be used to evaluate the model during training against a chosen mel spectrogram. See the sample pkl am_output.pkl.

Inference

The vocoder can be used to generate 44kHz, 32bit audio files from mel spectrograms saved in pickle files as shown below:

python3 inference_e2e.py --checkpoint_file g_00140000 --input_mels_dir data/test/ --output_dir data/test/ --file_suffix _ljs_44khz_140k

The addiional inference_app.py script is used for vocoder implementation in a svelte frontend.

See below for the original HiFi-GAN Readme.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.

Abstract : Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Visit our demo website for audio samples.

Pre-requisites

  1. Python >= 3.6
  2. Clone this repository.
  3. Install python requirements. Please refer requirements.txt
  4. Download and extract the LJ Speech dataset. And move all wav files to LJSpeech-1.1/wavs

Training

python train.py --config config_v1.json

To train V2 or V3 Generator, replace config_v1.json with config_v2.json or config_v3.json.
Checkpoints and copy of the configuration file are saved in cp_hifigan directory by default.
You can change the path by adding --checkpoint_path option.

Validation loss during training with V1 generator.
validation loss

Pretrained Model

You can also use pretrained models we provide.
Download pretrained models
Details of each folder are as in follows:

Folder Name Generator Dataset Fine-Tuned
LJ_V1 V1 LJSpeech No
LJ_V2 V2 LJSpeech No
LJ_V3 V3 LJSpeech No
LJ_FT_T2_V1 V1 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V2 V2 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V3 V3 LJSpeech Yes (Tacotron2)
VCTK_V1 V1 VCTK No
VCTK_V2 V2 VCTK No
VCTK_V3 V3 VCTK No
UNIVERSAL_V1 V1 Universal No

We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.

Fine-Tuning

  1. Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
    The file name of the generated mel-spectrogram should match the audio file and the extension should be .npy.
    Example:
    Audio File : LJ001-0001.wav
    Mel-Spectrogram File : LJ001-0001.npy
    
  2. Create ft_dataset folder and copy the generated mel-spectrogram files into it.
  3. Run the following command.
    python train.py --fine_tuning True --config config_v1.json
    
    For other command line options, please refer to the training section.

Inference from wav file

  1. Make test_files directory and copy wav files into the directory.
  2. Run the following command.
    python inference.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files by default.
You can change the path by adding --output_dir option.

Inference for end-to-end speech synthesis

  1. Make test_mel_files directory and copy generated mel-spectrogram files into the directory.
    You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth.
  2. Run the following command.
    python inference_e2e.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files_from_mel by default.
You can change the path by adding --output_dir option.

Acknowledgements

We referred to WaveGlow, MelGAN and Tacotron2 to implement this.