Skip to content

Ikerlandarech/AuraVox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AURAVOX

AuraVox is a virtual instrument built in C++ that performs real-time Timbre Transfer.

The aim of AuraVox is to elevate vocal expressiveness by merging the organic qualities of the human voice with the acoustic properties of instruments. This fusion opens a new realm of possibilities, preserving the human organic quality in the expression and articulation of the instrument sound.

This project is part of the thesis From Voice to Virtuosity: DDSP-based Timbre Transfer presented at Universitat Pompeu Fabra, Barcelona in July 2024.

More information can be found at:

Original Thesis Publication is available at: From Voice to Virtuosity: DDSP-based Timbre Transfer

Online Supplement available at: Reki Sounds [Password: auravox].

GUI Overview

AURAVOX - GITREPO_GUI_EXPLAINED_NEW

AuraVox comprises two main sections:

  • Load Section: Users can load files or drag and drop the input audio into the target audio player. Upon loading a valid file, the waveform is automatically displayed with its name and can be played back using the target audio Play/Pause button.

  • Studio Section: After selecting the desired audio for timbre transfer, users can choose from 7 different TensorFlow Lite instrument models by clicking on the model in the Studio Section. Once an instrument model is selected, AuraVox runs the TensorFlow inference pipeline internally on a separate thread. The Synthesized Audio Player then displays the converted output file, which can be played using the synthesized audio Play/Pause button. Finally, users can drag and drop the timbre-transferred output file directly into their Digital Audio Workstation to continue working on their session.

AuraVox is designed with minimalism in mind, featuring the fewest possible controls to avoid distracting users with parameter tweaking. This simplicity ensures seamless integration into users' workflows, allowing quick and effortless use of the plugin.

AuraVox Demonstration:

TFG.-.AuraVox.Final.Video.mp4

Methodology

Initially, the model architecture presented during the prototyping phase needed to be translated into an audio plugin architecture. This was achieved by integrating the models into the C++ codebase using the TensorFlow C API, incorporating its corresponding CUDA kernels and backward pass implementations. For model inference, all TensorFlow computations are executed on a separate thread, leveraging TensorFlow Lite, which offers significant optimizations to prevent buffer underruns in the main audio processing thread.

The following pipeline architecture diagram illustrates the system's structure.

AURAVOX - GITREPO_BLOCK_DIAGRAM

As outlined in the proposed timbre transfer model architecture, the CREPE Large model serves as the pitch tracking network algorithm to extract the ground truth fundamental frequency. However, due to constraints within the audio plugin architecture, a smaller CREPE model, known as CREPE Micro, trained on approximately 160k parameters, is employed in this implementation. This model predicts logits of the larger CREPE Large model, which contains 137 times more parameters.

Integrating the decoder posed an additional challenge due to the limited selection of built-in operators in TensorFlow Lite. Consequently, a much smaller GRU recurrent neural network was implemented, with the state stored natively in C++. This optimization significantly reduced the TensorFlow Lite binary size from 150MB to 7MB. Loudness computation relies on the Root Mean Square (RMS) of the input signal. During the synthesis phase, parameters for the harmonic and filtered noise synthesizers are generated by the instrument model, comprising 60 harmonic components for additive synthesis and 65 noise magnitudes for filtering white noise.

One of the primary implementation challenges has been the variability in frame rates, stemming from differing user block sizes and sample rates, alongside a fixed model input size (64ms), hop size (20ms), and sample rate (16kHz). This disparity is addressed through resampling, FIFOs at input and output stages, and threading inference separately. Given that the pretrained models are trained at 16kHz, input audio is downsampled to this rate before inference. Subsequently, synthesized audio is upsampled to the original user sample rate.

Distribution

We are actively working on making our plugin globally available. However, this is a complex task due to the number of third-party libraries required to perform the timbre transfer task effectively such as TensorFlow APIs, Ruy, Pasta, Flatbuffers, TFRT, Protobuf... All these dependencies are essential for ensuring the high performance and accuracy of our plugin. Integrating and distributing these libraries across various environments and platforms presents significant challenges. We appreciate your patience as we work towards a solution that will enable seamless installation and usage of our plugin globally.
If you have any suggestions or would like to contribute to this effort, please feel free to reach out or contribute to our repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages