AuraVox is a virtual instrument built in C++ that performs real-time Timbre Transfer.
This project is part of the thesis From Voice to Virtuosity: DDSP-based Timbre Transfer presented at Universitat Pompeu Fabra, Barcelona in July 2024.
More information can be found at:
Original Thesis Publication is available at: From Voice to Virtuosity: DDSP-based Timbre Transfer
Online Supplement available at: Reki Sounds [Password: auravox].
AuraVox comprises two main sections:
-
Load Section: Users can load files or drag and drop the input audio into the target audio player. Upon loading a valid file, the waveform is automatically displayed with its name and can be played back using the target audio Play/Pause button.
-
Studio Section: After selecting the desired audio for timbre transfer, users can choose from 7 different TensorFlow Lite instrument models by clicking on the model in the Studio Section. Once an instrument model is selected, AuraVox runs the TensorFlow inference pipeline internally on a separate thread. The Synthesized Audio Player then displays the converted output file, which can be played using the synthesized audio Play/Pause button. Finally, users can drag and drop the timbre-transferred output file directly into their Digital Audio Workstation to continue working on their session.
TFG.-.AuraVox.Final.Video.mp4
The following pipeline architecture diagram illustrates the system's structure.
Integrating the decoder posed an additional challenge due to the limited selection of built-in operators in TensorFlow Lite. Consequently, a much smaller GRU recurrent neural network was implemented, with the state stored natively in C++. This optimization significantly reduced the TensorFlow Lite binary size from 150MB to 7MB. Loudness computation relies on the Root Mean Square (RMS) of the input signal. During the synthesis phase, parameters for the harmonic and filtered noise synthesizers are generated by the instrument model, comprising 60 harmonic components for additive synthesis and 65 noise magnitudes for filtering white noise.
One of the primary implementation challenges has been the variability in frame rates, stemming from differing user block sizes and sample rates, alongside a fixed model input size (64ms), hop size (20ms), and sample rate (16kHz). This disparity is addressed through resampling, FIFOs at input and output stages, and threading inference separately. Given that the pretrained models are trained at 16kHz, input audio is downsampled to this rate before inference. Subsequently, synthesized audio is upsampled to the original user sample rate.