GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling
The official implementation of GenSE (ICLR 2025)
We propose a comprehensive framework tailored for language model-based speech enhancement, called GenSE. Speech enhancement is regarded as a conditional language modeling task rather than a continuous signal regression problem defined in existing works. This is achieved by tokenizing speech signals into semantic tokens using a pre-trained self-supervised model and into acoustic tokens using a custom-designed single-quantizer neural codec model.
GenSE employs a hierarchical modeling framework with a two-stage process: a N2S transformation front-end, which converts noisy speech into clean semantic tokens, and an S2S generation back-end, which synthesizes clean speech using both semantic tokens and noisy acoustic tokens.
- Release Inference pipeline
- Release pre-trained model
- Support in colab
- More to be added
- Pytorch >=1.13 and torchaudio >= 0.13
- Install requirements
conda create -n gense python=3.8
pip install -r requirements.txt
Download XLSR model and move it to ckpts dir.
or
Download WavLM Large run a variant of XLSR version.
Download pre-trained model from huggingface, all checkpoints should be stored in ckpts dir.
python infer.py run \
--noisy_path noisy.wav
--out_path ./enhanced.wav
--config_path configs/gense.yaml
from components.simcodec.model import SimCodec
codec = SimCodec('config.json')
codec.load_ckpt('g_00100000')
codec = codec.eval()
codec = codec.to('cuda')
code = codec(wav)
print(code.shape) #[B, L1, 1]
syn = codec.decode(code)
print(syn.shape) #[B, 1, L2]
torchaudio.save('copy.wav', syn.detach().cpu().squeeze(0), 16000)