Huge thanks to RunDiffusion for supporting this project! π
AudioLab is an open-source powerhouse for voice-cloning and audio separation, built with modularity and extensibility in mind. Whether you're an audio engineer, researcher, or just a curious tinkerer, AudioLab has you covered.
- πΌ Music Generation: Create music from scratch or remix existing tracks using YuE.
- π΅ Song Generation: Create full-length songs with vocals and instrumentals using DiffRhythm.
- π£οΈ Zonos Text-to-Speech: High-quality TTS with deep learning.
- π Orpheus TTS: Real-time natural-sounding speech powered by large language models.
- π’ Text-to-Speech: Clone voices and generate natural-sounding speech with Coqui TTS.
- π Text-to-Audio: Generate sound effects and ambient audio from text descriptions using Stable Audio.
- ποΈ Audio Separation: Isolate vocals, drums, bass, and other components from a track.
- π€ Vocal Isolation: Distinguish lead vocals from background.
- π Noise Removal: Get rid of echo, crowd noise, and unwanted sounds.
- 𧬠Voice Cloning: Train high-quality voice models with just 30-60 minutes of data.
- π Audio Super Resolution: Enhance and clean up audio.
- ποΈ Remastering: Apply spectral characteristics from a reference track.
- π Audio Conversion: Convert between popular formats effortlessly.
- π Export to DAW: Easily create Ableton Live and Reaper projects from separated stems.
- Auto-preprocessing for voice model training.
- Merge separated sources back into a single file with ease.
Before you dive in, make sure you have:
- Python 3.10 β Because match statements exist, and fairseq is allergic to 3.11.
- CUDA 12.4 β Other versions? Maybe fine. Maybe not. Do you like surprises?
- Virtual Environment β Strongly recommended to avoid dependency chaos.
- Windows Users β You're in for an adventure! Zonos/Triton can be a pain. Make sure to install MSVC and add these paths to your environment variables:
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.42.34433\bin\Hostx64\x64 C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.42.34433\bin\Hostx86\x86
Note: This project assumes basic Python knowledge. If you've never set up a virtual environment before... now's the time to learn! π
If dependencies refuse to install on Windows, try the following:
- Install MSVC Build Tools:
- Ensure CUDA is correctly installed:
- Check version:
nvcc --version
- Download CUDA 12.4
- Check version:
- DLL Errors? Try moving necessary DLLs from
/libs
to:.venv\lib\site-packages\pandas\_libs\window .venv\lib\site-packages\sklearn\.libs C:\Program Files\Python310\ (or wherever your Python is installed)
Heads up! The
requirements.txt
is not complete on purpose. Use the setup scripts instead!
- Clone the repository:
git clone https://github.com/yourusername/audiolab.git cd audiolab
- Set up a virtual environment:
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
- Run the setup script:
./setup.sh # Windows: setup.bat
Common Issues & Fixes:
- Downgrade
pip
if installation fails:python -m pip install pip==24.0
- Install older CUDA drivers if needed: CUDA Toolkit Archive
- Install
fairseq
manually if necessary:pip install fairseq>=0.12.2 --no-deps
- Activate your virtual environment:
source venv/bin/activate # Windows: venv\Scripts\activate.bat
- Run the application:
python main.py
- Optional flags:
--listen
β Bind to0.0.0.0
for remote access.--port PORT
β Specify a custom port.
![]() |
![]() |
---|---|
![]() |
![]() |
![]() |
Generate high-quality sound effects, ambient audio, and musical samples from text descriptions:
- π Text Prompting: Create sounds by describing them in natural language.
- β±οΈ Variable Duration: Generate audio up to 47 seconds long.
- ποΈ Full Control: Adjust parameters like inference steps and guidance scale.
- π Negative Prompts: Specify what to avoid in your generated audio.
- π² Multiple Variations: Generate different versions of the same prompt.
Example prompts:
- "A peaceful forest ambience with birds chirping and leaves rustling"
- "An electronic beat with pulsing bass at 120 BPM"
- "A sci-fi spaceship engine humming"
Create complete songs with vocals and instrumentals using state-of-the-art latent diffusion:
- π΅ Complete Songs: Generate full-length songs up to 4m45s.
- π€ Lyrics Support: Add lyrics using LRC format with timestamps.
- πΉ Style Control: Define the musical style using text prompts or reference audio.
- β‘ Blazingly Fast: Efficient generation compared to other music models.
- πΎ Memory Efficient: Chunked decoding option for consumer GPUs.
Example use cases:
- Create original songs in any genre with your own lyrics
- Generate background music for videos with specific moods
- Experiment with unique musical styles and vocal characteristics
Generate natural-sounding speech with LLM-powered text-to-speech capabilities:
- β‘ Real-time Processing: Instantaneous speech generation.
- π£οΈ Voice Cloning: Create custom voice models from your recordings.
- π Emotion Control: Adjust speaking style for more expressive speech.
- π Multilingual Support: Generate speech in multiple languages.
- π Style Variety: Create different styles from a single voice model.
Example applications:
- Create audiobooks with natural narration
- Develop voice assistants with your own voice
- Generate voiceovers for videos and presentations
- Create accessible content for those with reading difficulties
Convert audio recordings to text with speaker identification and precise timing:
- π₯ Speaker Diarization: Automatically identify and label different speakers.
- β±οΈ Word-Level Timestamps: Create perfectly aligned text with audio timing.
- π Multilingual Support: Transcribe content in multiple languages.
- π Batch Processing: Process multiple audio files in sequence.
- π Multiple Output Formats: Generate both JSON metadata and readable text.
Example applications:
- Create subtitles for videos with speaker labels
- Transcribe interviews and meetings with speaker attribution
- Generate searchable archives of audio content
- Create training data for voice and speech models
The heart of AudioLab with modular audio processing through a chain of wrappers:
- π Separate: Split audio into vocals, drums, bass, and other instruments.
- π€ Clone: Apply voice conversion with trained models.
- β‘ Remaster: Enhance audio based on reference tracks.
- π¬ Super Resolution: Improve audio detail and clarity.
- π Merge: Mix separate audio tracks with complete control.
- π Convert: Change audio formats with customizable settings.
Example workflows:
- Extract vocals β Apply voice clone β Merge with original instruments
- Split song β Enhance each component β Remix with new levels
- Remaster old recordings using modern reference tracks
Train custom voice models for voice conversion and cloning:
- π― One-Click Process: Simplified training with automatic preprocessing.
- βοΈ Advanced Options: Fine-tune training for specific voice characteristics.
- π Training Visualization: Monitor progress in real-time.
- π Model Management: Organize and share your trained voice models.
Example applications:
- Create virtual versions of your own voice
- Develop character voices for games or animations
- Restore or enhance historical recordings
AudioLab is powered by some fantastic open-source projects:
- π΅ python-audio-separator β Core for audio separation.
- π matchering β Professional-grade remastering.
- π versatile-audio-super-resolution β High-quality audio enhancement.
- π Real-Time-Voice-Cloning β Voice cloning.
- πΆ MVSEP-MDX23 β Music separation.
- π WhisperX β Audio transcription.
- π£ Coqui TTS β State-of-the-art TTS.
- πΌ YuE β Music generation.
- π Zonos β High-quality TTS.
- π Stable Audio β Text-to-audio generation.
- π΅ DiffRhythm β Full-length song generation with latent diffusion.
- π£οΈ Orpheus-TTS β Real-time high-quality text-to-speech.
Want to help? Check out the Contributing Guide!
Licensed under MIT. See LICENSE for details.
Made with β€οΈ by the AudioLab team. (AKA D8ahazard)