Skip to content

Commit

Permalink
Updated README
Browse files Browse the repository at this point in the history
  • Loading branch information
bshall committed Jul 28, 2022
1 parent e05702e commit 374a456
Show file tree
Hide file tree
Showing 6 changed files with 121 additions and 55 deletions.
113 changes: 80 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,27 @@
<p align="center">
<a target="_blank" href="https://colab.research.google.com/github/bshall/soft-vc/blob/main/soft-vc-demo.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
</p>

# HiFi-GAN

An 16kHz implementation of HiFi-GAN for [soft-vc](https://github.com/bshall/soft-vc).
Training and inference scripts for the vocoder models in [A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion](https://ieeexplore.ieee.org/abstract/document/9746484). For more details see [soft-vc](https://github.com/bshall/soft-vc). Audio samples can be found [here](https://bshall.github.io/soft-vc/). Colab demo can be found [here](https://colab.research.google.com/github/bshall/soft-vc/blob/main/soft-vc-demo.ipynb).

Relevant links:
- [Official HiFi-GAN repo](https://github.com/jik876/hifi-gan)
- [HiFi-GAN paper](https://arxiv.org/abs/2010.05646)
- [Soft-VC repo](https://github.com/bshall/soft-vc)
- [Soft-VC paper]()
<div align="center">
<img width="100%" alt="Soft-VC"
src="https://raw.githubusercontent.com/bshall/hifigan/main/vocoder.png">
</div>
<div>
<sup>
<strong>Fig 1:</strong> Architecture of the voice conversion system. a) The <strong>discrete</strong> content encoder clusters audio features to produce a sequence of discrete speech units. b) The <strong>soft</strong> content encoder is trained to predict the discrete units. The acoustic model transforms the discrete/soft speech units into a target spectrogram. The vocoder converts the spectrogram into an audio waveform.
</sup>
</div>

## Example Usage

### Programmatic Usage

```python
import torch
import numpy as np
Expand All @@ -22,39 +34,71 @@ mel = torch.from_numpy(np.load("path/to/mel")).unsqueeze(0).cuda()
wav, sr = hifigan.generate(mel)
```

## Train
### Script-Based Usage

```
usage: generate.py [-h] {soft,discrete,base} in-dir out-dir
Generate audio for a directory of mel-spectrogams using HiFi-GAN.
positional arguments:
{soft,discrete,base} available models (HuBERT-Soft, HuBERT-Discrete, or
Base).
in-dir path to input directory containing the mel-
spectrograms.
out-dir path to output directory.
optional arguments:
-h, --help show this help message and exit
```

## Training

### Step 1: Dataset Preparation

Download and extract the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset. The training script expects the following tree structure for the dataset directory:

```
└───wavs
├───dev
│ ├───LJ001-0001.wav
│ ├───...
│ └───LJ050-0278.wav
└───train
├───LJ002-0332.wav
├───...
└───LJ047-0007.wav
```

The `train` and `dev` directories should contain the training and validation splits respectively. The splits used for the paper can be found [here](https://github.com/bshall/hifigan/releases/tag/v0.1).

### Step 2: Resample the Audio

**Step 1**: Download and extract the [LJ-Speech dataset](https://keithito.com/LJ-Speech-Dataset/)
Resample the audio to 16kHz using the `resample.py` script:

**Step 2**: Resample the audio to 16kHz:
```
usage: resample.py [-h] [--sample-rate SAMPLE_RATE] in-dir out-dir
Resample an audio dataset.
positional arguments:
in-dir path to the dataset directory
out-dir path to the output directory
in-dir path to the dataset directory.
out-dir path to the output directory.
optional arguments:
-h, --help show this help message and exit
--sample-rate SAMPLE_RATE
target sample rate (default 16kHz)
```

**Step 3**: Download the dataset splits and move them into the root of the dataset directory.
After steps 2 and 3 your dataset directory should look like this:
for example:

```
LJSpeech-1.1
│ test.txt
│ train.txt
│ validation.txt
├───mels
└───wavs
python reample.py path/to/LJSpeech-1.1/ path/to/LJSpeech-Resampled/
```
Note: the mels directory is optional. If you want to fine-tune HiFi-GAN the mels directory should contain ground-truth aligned spectrograms from an acoustic model.

**Step 4**: Train HiFi-GAN:
### Step 3: Train HifiGAN

```
usage: train.py [-h] [--resume RESUME] [--finetune] dataset-dir checkpoint-dir
Expand All @@ -70,22 +114,25 @@ optional arguments:
--finetune whether to finetune (note that a resume path must be given)
```

## Generate
To generate using the trained HiFi-GAN models, see [Example Usage](#example-usage) or use the `generate.py` script:
## Links

```
usage: generate.py [-h] [--model-name {hifigan,hifigan-hubert-soft,hifigan-hubert-discrete}] in-dir out-dir
- [Soft-VC repo](https://github.com/bshall/soft-vc)
- [Soft-VC paper](https://ieeexplore.ieee.org/abstract/document/9746484)
- [HuBERT content encoders](https://github.com/bshall/hubert)
- [Acoustic models](https://github.com/bshall/acoustic-model)

Generate audio for a directory of mel-spectrogams using HiFi-GAN.
## Citation

positional arguments:
in-dir path to directory containing the mel-spectrograms
out-dir path to output directory
If you found this work helpful please consider citing our paper:

optional arguments:
-h, --help show this help message and exit
--model-name {hifigan,hifigan-hubert-soft,hifigan-hubert-discrete}
available models
```
@inproceedings{
soft-vc-2022,
author={van Niekerk, Benjamin and Carbonneau, Marc-André and Zaïdi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman},
booktitle={ICASSP},
title={A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion},
year={2022}
}
```

## Acknowledgements
Expand Down
20 changes: 9 additions & 11 deletions generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,9 @@


def generate(args):
args.out_dir.mkdir(exist_ok=True, parents=True)

print("Loading checkpoint")
hifigan = torch.hub.load("bshall/hifigan:main", args.model_name).cuda()
model_name = f"hifigan_hubert_{args.model}" if args.model != "base" else "hifigan"
hifigan = torch.hub.load("bshall/hifigan:main", model_name).cuda()

print(f"Generating audio from {args.in_dir}")
for path in tqdm(list(args.in_dir.rglob("*.npy"))):
Expand All @@ -29,24 +28,23 @@ def generate(args):
parser = argparse.ArgumentParser(
description="Generate audio for a directory of mel-spectrogams using HiFi-GAN."
)
parser.add_argument(
"model",
help="available models (HuBERT-Soft, HuBERT-Discrete, or Base).",
choices=["soft", "discrete", "base"],
)
parser.add_argument(
"in_dir",
metavar="in-dir",
help="path to directory containing the mel-spectrograms",
help="path to input directory containing the mel-spectrograms.",
type=Path,
)
parser.add_argument(
"out_dir",
metavar="out-dir",
help="path to output directory",
help="path to output directory.",
type=Path,
)
parser.add_argument(
"--model-name",
help="available models",
choices=["hifigan", "hifigan_hubert_soft", "hifigan_hubert_discrete"],
default="hifigan_hubert_soft",
)
args = parser.parse_args()

generate(args)
27 changes: 20 additions & 7 deletions hifigan/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,18 +36,31 @@ def forward(self, wav):

class MelDataset(Dataset):
def __init__(
self, root, segment_length, sample_rate, hop_length, train=True, finetune=False
self,
root: Path,
segment_length: int,
sample_rate: int,
hop_length: int,
train: bool = True,
finetune: bool = False,
):
self.root = Path(root)
self.wavs_dir = root / "wavs"
self.mels_dir = root / "mels"
self.data_dir = self.wavs_dir if not finetune else self.mels_dir

self.segment_length = segment_length
self.sample_rate = sample_rate
self.hop_length = hop_length
self.train = train
self.finetune = finetune

split = "train.txt" if train else "validation.txt"
with open(self.root / split) as file:
self.metadata = [line.strip() for line in file]
suffix = ".wav" if not finetune else ".npy"
pattern = f"train/**/*{suffix}" if train else "dev/**/*{suffix}"

self.metadata = [
path.relative_to(self.data_dir).with_suffix("")
for path in self.data_dir.rglob(pattern)
]

self.logmel = LogMelSpectrogram()

Expand All @@ -56,7 +69,7 @@ def __len__(self):

def __getitem__(self, index):
path = self.metadata[index]
wav_path = self.root / "wavs" / path
wav_path = self.wavs_dir / path

info = torchaudio.info(wav_path.with_suffix(".wav"))
if info.sample_rate != self.sample_rate:
Expand All @@ -65,7 +78,7 @@ def __getitem__(self, index):
)

if self.finetune:
mel_path = self.root / "mels" / path
mel_path = self.mels_dir / path
src_logmel = torch.from_numpy(np.load(mel_path.with_suffix(".npy")))
src_logmel = src_logmel.unsqueeze(0)

Expand Down
8 changes: 6 additions & 2 deletions resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,12 @@ def preprocess_dataset(args):

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Resample an audio dataset.")
parser.add_argument("in-dir", help="path to the dataset directory", type=Path)
parser.add_argument("out-dir", help="path to the output directory", type=Path)
parser.add_argument(
"in_dir", metavar="in-dir", help="path to the dataset directory.", type=Path
)
parser.add_argument(
"out_dir", metavar="out-dir", help="path to the output directory.", type=Path
)
parser.add_argument(
"--sample-rate",
help="target sample rate (default 16kHz)",
Expand Down
8 changes: 6 additions & 2 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,8 @@ def train_model(rank, world_size, args):
logger=logger,
finetune=args.finetune,
)
else:
global_step, best_loss = 0, float("inf")

if args.finetune:
global_step, best_loss = 0, float("inf")
Expand Down Expand Up @@ -301,12 +303,14 @@ def train_model(rank, world_size, args):
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Train or finetune HiFi-GAN.")
parser.add_argument(
"dataset-dir",
"dataset_dir",
metavar="dataset-dir",
help="path to the preprocessed data directory",
type=Path,
)
parser.add_argument(
"checkpoint-dir",
"checkpoint_dir",
metavar="checkpoint-dir",
help="path to the checkpoint directory",
type=Path,
)
Expand Down
Binary file added vocoder.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 374a456

Please sign in to comment.