
voicera_demo.mp4
"Voicera" is a text-to-speech (TTS) model designed for generating speech from written text. It uses a GPT-2 type architecture, which helps in creating natural and expressive speech. The model converts audio into tokens using the "Multi-Scale Neural Audio Codec (SNAC)" model, allowing it to understand and produce speech sounds. Voicera aims to provide clear and understandable speech, focusing on natural pronunciation and intonation. It's a project to explore TTS technology and improve audio output quality.
-
Data Preparation: We use a dataset containing text paired with corresponding audio. The audio is tokenized using the Multi-Scale Neural Audio Codec (SNAC) model, which converts audio into a sequence of tokens that the model can process.
-
Model Architecture: Voicera uses a transformer-based architecture similar to GPT-2, which is adept at handling sequential data. This architecture allows the model to understand the nuances of language and generate coherent speech.
-
Training: The model is trained on a large dataset of paired text and audio tokens. After each epoch, the model's performance is evaluated to ensure the generated audio improves over time.
The video above shows the model capabilities
There are three models, We have the base model and two other finetuned on jenny and expresso datasets The best of all currently is the Jenny finetune Here are colab link to all 3 respectively