Updated README

bshall · Jul 28, 2022 · 374a456 · 374a456
1 parent e05702e
commit 374a456
Show file tree

Hide file tree

Showing 6 changed files with 121 additions and 55 deletions.
diff --git a/README.md b/README.md
@@ -1,15 +1,27 @@
+<p align="center">
+    <a target="_blank" href="https://colab.research.google.com/github/bshall/soft-vc/blob/main/soft-vc-demo.ipynb">
+        <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+    </a>
+</p>
+
 # HiFi-GAN
 
-An 16kHz implementation of HiFi-GAN for [soft-vc](https://github.com/bshall/soft-vc).
+Training and inference scripts for the vocoder models in [A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion](https://ieeexplore.ieee.org/abstract/document/9746484). For more details see [soft-vc](https://github.com/bshall/soft-vc). Audio samples can be found [here](https://bshall.github.io/soft-vc/). Colab demo can be found [here](https://colab.research.google.com/github/bshall/soft-vc/blob/main/soft-vc-demo.ipynb).
 
-Relevant links:
-- [Official HiFi-GAN repo](https://github.com/jik876/hifi-gan)
-- [HiFi-GAN paper](https://arxiv.org/abs/2010.05646)
-- [Soft-VC repo](https://github.com/bshall/soft-vc)
-- [Soft-VC paper]()
+<div align="center">
+    <img width="100%" alt="Soft-VC"
+      src="https://raw.githubusercontent.com/bshall/hifigan/main/vocoder.png">
+</div>
+<div>
+  <sup>
+    <strong>Fig 1:</strong> Architecture of the voice conversion system. a) The <strong>discrete</strong> content encoder clusters audio features to produce a sequence of discrete speech units. b) The <strong>soft</strong> content encoder is trained to predict the discrete units. The acoustic model transforms the discrete/soft speech units into a target spectrogram. The vocoder converts the spectrogram into an audio waveform.
+  </sup>
+</div>
 
 ## Example Usage
 
+### Programmatic Usage
+
 ```python
 import torch
 import numpy as np
@@ -22,39 +34,71 @@ mel = torch.from_numpy(np.load("path/to/mel")).unsqueeze(0).cuda()
 wav, sr = hifigan.generate(mel)
 ```
 
-## Train
+### Script-Based Usage
+
+```
+usage: generate.py [-h] {soft,discrete,base} in-dir out-dir
+
+Generate audio for a directory of mel-spectrogams using HiFi-GAN.
+
+positional arguments:
+  {soft,discrete,base}  available models (HuBERT-Soft, HuBERT-Discrete, or
+                        Base).
+  in-dir                path to input directory containing the mel-
+                        spectrograms.
+  out-dir               path to output directory.
+
+optional arguments:
+  -h, --help            show this help message and exit
+```
+
+## Training
+
+### Step 1: Dataset Preparation
+
+Download and extract the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset. The training script expects the following tree structure for the dataset directory:
+
+```
+└───wavs
+    ├───dev
+    │   ├───LJ001-0001.wav
+    │   ├───...
+    │   └───LJ050-0278.wav
+    └───train
+        ├───LJ002-0332.wav
+        ├───...
+        └───LJ047-0007.wav
+```
+
+The `train` and `dev` directories should contain the training and validation splits respectively. The splits used for the paper can be found [here](https://github.com/bshall/hifigan/releases/tag/v0.1).
+
+### Step 2: Resample the Audio
 
-**Step 1**: Download and extract the [LJ-Speech dataset](https://keithito.com/LJ-Speech-Dataset/)
+Resample the audio to 16kHz using the `resample.py` script:
 
-**Step 2**: Resample the audio to 16kHz:
 ```
 usage: resample.py [-h] [--sample-rate SAMPLE_RATE] in-dir out-dir
 
 Resample an audio dataset.
 
 positional arguments:
-  in-dir                path to the dataset directory
-  out-dir               path to the output directory
+  in-dir                path to the dataset directory.
+  out-dir               path to the output directory.
 
 optional arguments:
   -h, --help            show this help message and exit
   --sample-rate SAMPLE_RATE
                         target sample rate (default 16kHz)
 ```
 
-**Step 3**: Download the dataset splits and move them into the root of the dataset directory.
-After steps 2 and 3 your dataset directory should look like this:
+for example:
+
 ```
-LJSpeech-1.1
-│   test.txt
-│   train.txt
-│   validation.txt
-├───mels
-└───wavs
+python reample.py path/to/LJSpeech-1.1/ path/to/LJSpeech-Resampled/
 ```
-Note: the mels directory is optional. If you want to fine-tune HiFi-GAN the mels directory should contain ground-truth aligned spectrograms from an acoustic model.
 
-**Step 4**: Train HiFi-GAN:
+### Step 3: Train HifiGAN
+
 ```
 usage: train.py [-h] [--resume RESUME] [--finetune] dataset-dir checkpoint-dir
 
@@ -70,22 +114,25 @@ optional arguments:
   --finetune       whether to finetune (note that a resume path must be given)
 ```
 
-## Generate
-To generate using the trained HiFi-GAN models, see [Example Usage](#example-usage) or use the `generate.py` script:
+## Links
 
-```
-usage: generate.py [-h] [--model-name {hifigan,hifigan-hubert-soft,hifigan-hubert-discrete}] in-dir out-dir
+- [Soft-VC repo](https://github.com/bshall/soft-vc)
+- [Soft-VC paper](https://ieeexplore.ieee.org/abstract/document/9746484)
+- [HuBERT content encoders](https://github.com/bshall/hubert)
+- [Acoustic models](https://github.com/bshall/acoustic-model)
 
-Generate audio for a directory of mel-spectrogams using HiFi-GAN.
+## Citation
 
-positional arguments:
-  in-dir                path to directory containing the mel-spectrograms
-  out-dir               path to output directory
+If you found this work helpful please consider citing our paper:
 
-optional arguments:
-  -h, --help            show this help message and exit
-  --model-name {hifigan,hifigan-hubert-soft,hifigan-hubert-discrete}
-                        available models
+```
+@inproceedings{
+    soft-vc-2022,
+    author={van Niekerk, Benjamin and Carbonneau, Marc-André and Zaïdi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman},
+    booktitle={ICASSP}, 
+    title={A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion}, 
+    year={2022}
+}
 ```
 
 ## Acknowledgements

diff --git a/generate.py b/generate.py
@@ -7,10 +7,9 @@
 
 
 def generate(args):
-    args.out_dir.mkdir(exist_ok=True, parents=True)
-
     print("Loading checkpoint")
-    hifigan = torch.hub.load("bshall/hifigan:main", args.model_name).cuda()
+    model_name = f"hifigan_hubert_{args.model}" if args.model != "base" else "hifigan"
+    hifigan = torch.hub.load("bshall/hifigan:main", model_name).cuda()
 
     print(f"Generating audio from {args.in_dir}")
     for path in tqdm(list(args.in_dir.rglob("*.npy"))):
@@ -29,24 +28,23 @@ def generate(args):
     parser = argparse.ArgumentParser(
         description="Generate audio for a directory of mel-spectrogams using HiFi-GAN."
     )
+    parser.add_argument(
+        "model",
+        help="available models (HuBERT-Soft, HuBERT-Discrete, or Base).",
+        choices=["soft", "discrete", "base"],
+    )
     parser.add_argument(
         "in_dir",
         metavar="in-dir",
-        help="path to directory containing the mel-spectrograms",
+        help="path to input directory containing the mel-spectrograms.",
         type=Path,
     )
     parser.add_argument(
         "out_dir",
         metavar="out-dir",
-        help="path to output directory",
+        help="path to output directory.",
         type=Path,
     )
-    parser.add_argument(
-        "--model-name",
-        help="available models",
-        choices=["hifigan", "hifigan_hubert_soft", "hifigan_hubert_discrete"],
-        default="hifigan_hubert_soft",
-    )
     args = parser.parse_args()
 
     generate(args)
diff --git a/hifigan/dataset.py b/hifigan/dataset.py
@@ -36,18 +36,31 @@ def forward(self, wav):
 
 class MelDataset(Dataset):
     def __init__(
-        self, root, segment_length, sample_rate, hop_length, train=True, finetune=False
+        self,
+        root: Path,
+        segment_length: int,
+        sample_rate: int,
+        hop_length: int,
+        train: bool = True,
+        finetune: bool = False,
     ):
-        self.root = Path(root)
+        self.wavs_dir = root / "wavs"
+        self.mels_dir = root / "mels"
+        self.data_dir = self.wavs_dir if not finetune else self.mels_dir
+
         self.segment_length = segment_length
         self.sample_rate = sample_rate
         self.hop_length = hop_length
         self.train = train
         self.finetune = finetune
 
-        split = "train.txt" if train else "validation.txt"
-        with open(self.root / split) as file:
-            self.metadata = [line.strip() for line in file]
+        suffix = ".wav" if not finetune else ".npy"
+        pattern = f"train/**/*{suffix}" if train else "dev/**/*{suffix}"
+
+        self.metadata = [
+            path.relative_to(self.data_dir).with_suffix("")
+            for path in self.data_dir.rglob(pattern)
+        ]
 
         self.logmel = LogMelSpectrogram()
 
@@ -56,7 +69,7 @@ def __len__(self):
 
     def __getitem__(self, index):
         path = self.metadata[index]
-        wav_path = self.root / "wavs" / path
+        wav_path = self.wavs_dir / path
 
         info = torchaudio.info(wav_path.with_suffix(".wav"))
         if info.sample_rate != self.sample_rate:
@@ -65,7 +78,7 @@ def __getitem__(self, index):
             )
 
         if self.finetune:
-            mel_path = self.root / "mels" / path
+            mel_path = self.mels_dir / path
             src_logmel = torch.from_numpy(np.load(mel_path.with_suffix(".npy")))
             src_logmel = src_logmel.unsqueeze(0)
 

diff --git a/resample.py b/resample.py
@@ -40,8 +40,12 @@ def preprocess_dataset(args):
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Resample an audio dataset.")
-    parser.add_argument("in-dir", help="path to the dataset directory", type=Path)
-    parser.add_argument("out-dir", help="path to the output directory", type=Path)
+    parser.add_argument(
+        "in_dir", metavar="in-dir", help="path to the dataset directory.", type=Path
+    )
+    parser.add_argument(
+        "out_dir", metavar="out-dir", help="path to the output directory.", type=Path
+    )
     parser.add_argument(
         "--sample-rate",
         help="target sample rate (default 16kHz)",

diff --git a/train.py b/train.py
@@ -144,6 +144,8 @@ def train_model(rank, world_size, args):
             logger=logger,
             finetune=args.finetune,
         )
+    else:
+        global_step, best_loss = 0, float("inf")
 
     if args.finetune:
         global_step, best_loss = 0, float("inf")
@@ -301,12 +303,14 @@ def train_model(rank, world_size, args):
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Train or finetune HiFi-GAN.")
     parser.add_argument(
-        "dataset-dir",
+        "dataset_dir",
+        metavar="dataset-dir",
         help="path to the preprocessed data directory",
         type=Path,
     )
     parser.add_argument(
-        "checkpoint-dir",
+        "checkpoint_dir",
+        metavar="checkpoint-dir",
         help="path to the checkpoint directory",
         type=Path,
     )

diff --git a/vocoder.png b/vocoder.png