- {Akarsh}
- {Rakesh}
- 26-03-23
- 27-03-23
- 28-03-23
- 29-03-23
- 30-03-23
- 01-04-23
- 02-04-23
- 05-04-23
- 08-04-23
- 09-04-23
- 11-04-23
- 12-04-23
- 13-04-23
- 18-04-23
- 19-04-23
- 23-04-23
- 25-04-23
- 26-04-23
- 28-04-23
- 02-05-23
- 05-05-23
- 22-05-23
- 12-06-23
{Akarsh}
- HindiTTS for dataset creating, annotating and cleaning.
- Aeneas_Extended for alignment using aeneas.
- ESPNET, Coqui for format, pipeline and code reference.
- ReadtheDocs for documentation preparation.
- First understand logging, argparse and get a defined
<input, output, log>
format for any code. - Follow
pip
standards, docstrings, and tabspaces for coding. - Define config file format (separate class ->
yaml
wrapper?):- project details
- name
- tag
- language
- dataset details (for extracting text:
local/data.sh
) - preparing details
- text:
- token type
- cleaner
- g2p (if any)
- audio:
- sampling_rate(fs)
- trim_silence? (trim_db)
- signal_norm
- text:
- preprocessing details (TODO)
- model details
- name
- params
- trainer details (TODO: logging, batch, etc)
- output details (path for
output_dir/
)
- project details
- Learn git branching.
- Decide if bash is used or not: I prefer using it because it is very quick, inclined to write both bash and python variants.
- readme md links:
- files:
run.py
(prepare_data, preprocess_data, train_tts)prepare_data.py
preprocess_data.py
make_dataset.py
(dataset creating using raw audio, transcripts or youtube links)train_tts.py
Idealogy: run code with config path
- Added base_config, dataset_config (using this and this for value and docstring reference).
- Added dataset_processor for making i2t and i2w (created LJSpeech_{small, sample} for experimentation and testing).
- Writing entire runner script in
run.py
and will later divide it into individual components.
{Akarsh}
- Using coqui for coding format, espnet for data format and NVIDIA for model code.
- We are using
nn.ModuleList
instead oflist
becausemodel.parameters()
wont have the necessary parameters as it cannot read a regular python list (source). - squeeze and unsqueeze in torch tensor (source).
- Assumption: single speaker, single language, single GPU (not distributed).
- For distributed read about: workers, jobs, nodes, tasks.
- collate: "collect and combine (texts, information, or data)" (source, CREPE-ref)
- MUST-C dataset (source)
{Rakesh}
- Using youtube-dl to download the audio (wav) and transcript(.vtt file) of the audio file. (used
pip install git+https://github.com/ytdl-org/youtube-dl.git@master#egg=youtube_dl
as there was a change in youtube metadata) - Converted the data in vtt file into csv file formatting similar to the LJSpeech dataset format.
- The audio is split into smaller segments by using the timestamps in vtt file and output wav files into a directory.
{Akarsh}
self.training
innn.Module
is explained here (source).- Gate Prediction is basically stop token prediction. It is used to stop inference in decoding stage (source).
- To prevent
Markdown All in One
from auto creatingTable of Contents
, add a<!-- no toc -->
comment above the list. subprocess
in python to handlebash
commands (source).
{Rakesh}
- Input is taken to format the naming of the wav files to particular id and the padding the file number to match number of digits using rjust
- Same id format for the id in csv file
- Trim silence of left and right ends of the audio segment by using a silence threshold (ref)
{Akarsh}
{Rakesh}
- Learn about STFT and Mel Spectrogram.
- Understand process in the code (NVIDIA stft, NVIDIA train).
{Akarsh}
- Added
DownloadConfig
andDownloadProcessor
. - Added extraction support and changed pipeline ->
i2at, i2aw
contain all values before processing andi2t, i2w
contain only the values that are valid. - Added wav dump creation with length filtering and sampling rate modification (and bitrate) in
dump/wavs/
directory. - NOTE: Need to add
trimAudio
for the wavs before checking the length before filtering. i2w
format -><utt_id> <wav_path> <wav_shape>
.- You can now download
wav
andvtt
files from Youtube (yt_dlp
) and use the alignment to segment the audio file intowavs/
directory and also choose aspeaker_id
forutt_id
. Then we can preparetranscript.txt
using standard delimiter|
and finally we have a dataset. - Added
secToFormattedTime
inutils
for printing time from seconds to a standard format. - Check
trimAudio
in general and formatcreate_dataset
.
{Rakesh}
- Added
create_dataset
file to download wav and transcript of the given youtube link and divide it into multiple segments and their respective text data. - Switched from
youtube_dl
toyt-dlp
which is a forked version of youtube_dl as there were issues with youtube_dl. (yt-dlp). - Added a function to check whether the given link is youtube link or not.
{Akarsh}
- Added own code for
trim_audio_silence
and removedpydub
dependency. - References for calculating dBFS: (wiki, src, and ChatGPT).
- Changed
None
for default valuestr
type. (typeguard
gives issues for different versions). - Fix global pip installation issue (src). Not able to fix!
- Fixed global pip installation issue. The issue arises because the machine has both
pip3
andconda
installed. So when we do/path/to/conda/env/python3 -m pip list
with python already installed in the env, it is looking for the normal pip installation. Hence the simplest solution would be to just remove thepip3
installation and thenconda
will only look in its own envs for pip.
{Rakesh}
- Changed the
vtt_to_csv
file to remove the dependency on reg ex. Also simplified the function to remove redundancy. - Removed
parse_vtt
function and added the same functionality intovtt_to_csv
function. - Added verbose and quiet options for
Download_YT
function. - Using
os.path.join()
for all directory paths.
{Akarsh}
- Changed the pipeline.
DatasetProcessor
takes inAudioConfig
andTextConfig
and handles preprocessing usingTextProcessor
andAudioProcessor
inside it.
- NOTE: Need to add code for handling multi channel wav audio files (convert to single channel before checking for silence).
- Added simple text tokenization and updated them with calculated index.
- We also need to <SOS/EOS>, tokens too along with some standard tokenizers and cleaners.
- Use
random.sample()
as it samples without replacement.random.choices()
samples with replacement.
{Rakesh}
- Check how the text is processed (tokenizers, cleaners) in (NVIDIA) and (espnet).
- Learn how the text and mel are processed in
TextMelLoader
andTextMelCollate
.
{Akarsh}
- Look into
model.eval()
andtorch.no_grad()
use cases (src). np.float32
datatype: 1 sign bit, 23 bits mantissa, 8 bits exponent (single decimal precision float) (src).g2p
: we need to add oov (for char level we just ignore those characters).- Trainable Fourier kernels (src1, src2).
- STFT simple code idea in python (src).
- We pad zeroes the length of the frame, in order to get first half of FFT the same length of as that of the frame.
np.fft.fft
has even (Hermitian) symmetry (src).
ffmpeg
cannot edit existing files in-place. we need to make duplicate file.- Replaced
pcm_u8
codec to defaultpcm_s16le
codec inffmpeg
. In unsigned type, the mean is positive (signal moved up into positive axis). Hence mean wont be zero and therefore not ideal.
{Akarsh}
tensor.half()
ormodel.half()
will convert all model weights to half precision. This is done to speed up calculations (src).- why computed frame count and
librosa.stft
frame count is not matching. Because of center padding. Read (src1, src2). Withcenter=False
it is the normal calculation, that is $$ \textrm{total length} = \textrm{window length} + (\lambda - 1) * \textrm{hop length}$$ this also assumes that the signal fits perfectly (if extra is there, then we dont consider it).
Also ifcenter=False
inlibrosa.stft()
thenpad_mode
is ignored. Ifcenter=True
, then we use the default value ofpad_mode="constant"
to pad zeros to the input signal on both sides for framing. - Nyquist theorem and calculating signal having a particular frequency:
nyquist theorem states: sampling frequency >= 2 * max_frequency to be preserved #points per sec >= 2 * #waves per sec --- if coeff increases wavelength decreases (inversely related) coeff -> #values for one wave (continuous) 1 -> 2 * pi values x -> 1 / f values 1 * 2 * pi = x * 1 / f => x = (2 * pi * f) use sin(x * points) for f frequency signal
- Important concepts for STFT and mel spectrogram (the way
librosa
doeslibrosa.amplitude_to_db()
,librosa.stft()
):- signal -> [filter_length, hop_length, window_length] -> frames
- frames (real) -> [stacking fft(frame) => stft] -> spectrogram (complex)
- spectrogram (complex) -> [np.abs] -> amplitude_spectrogram -> [np.power, usually ^2] -> power_spectrogram
- power_spectrogram -> [10 * log10(max(amin, sig)) - 10 * log10(max(amin, ref))] -> dB_scale_spectrogram
- power_spectrogram -> [np.matmul(mel_transform_filter, spectrogram)] -> mel_spectrogram
np.abs()
of spectrogram gives magnitude whilenp.angle()
gives phase
- Reading about mel and its use (src).
{Akarsh}
librosa.stft()
usesnp.fft.rfft()
, hence we get1 + (n_fft // 2)
values as the output (src).
np.fft.fft()
for real input gives Hermitian-symmetric output, where the negative frequencies are the complex conjugates of positive frequencies and are hence redundant.- Completely implemented
stft()
andistft()
from scratch usingnumpy
. Heavily usedlibrosa
docs and other resources, but greatly simplified the code by making to quite task specific. - Added above explored code along with other audio helper functions in
audio.py
. - Removed
librosa
dependency. - Added
AudioProcessor.convert2mel()
that takes in a wav_path and extracts features (mel spectrogram) from it and saves it as a.npy
file indump/feats
directory.
.npy
file isnumpy
format for saving arrays as data (src).
{Akarsh}
- Explore the idea of calculating features on the go. We can reduce load and increase parallelization during loading. This is handled natively by
torch.utils.data.DataLoader
which can be accessed bynum_workers
value. The only upside is to save memory. - Assumption: single speaker, single language on single GPU
- Added data splitting into training and validation inside
DatasetProcessor
. - Check
pin_memory=False
anddrop_last=True
along withshuffle=True/False
intorch.utils.data.DataLoader
. - Add collate function for DataLoader.
{Akarsh}
torch.nn
vstorch.nn.functional
(src).torch.utils.data.DataLoader
attributes: (src)- pin_memory (bool, optional) – If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below. (default: False)
- drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
- Added
TextMelCollate()
fortorch.utils.data.DataLoader.collate_fn
inTrainer
.
{Akarsh}
- Added
CheckpointManager()
inTrainer
that will handle file saving for top n checkpoints based on the loss value. - Added
WandbLogger()
inTrainer
that will handlewandb
logging and syncing. - Added
use_g2p
and updatedTextProcessor
pipeline. - Added
g2p_en
by Park Kyu Byong (src). - NOTE:
- Consider on the fly loading of tokens and mel feats generation (along with resampling and trimming). This will save space, but we wont have control over size of audio.
{Akarsh}
- Omitting param groups of optimizer update before every iteration. Later on add and experiment. (src).
optimizer.zero_grad()
vsmodel.zero_grad()
(src). They are the same (src2).- Decide variable to log (gpu, etc) and config values.
- Add timers for training loop and estimators.
{Akarsh}
- Added evaluation in training loop with timers and estimators.
- Integrated wandb that supports images. Updated
WandbLogger
to abstract directwandb
API.commit=True
inwandb.log()
also increments the iteration value (src).
- Added
inference.py
that contains theTTSModel()
class for handling the inference. - Added time formatting in
utils.py
and log printing. - Difference between numpy saving formats
.npy
and.npz
and where to use them (src).
{Akarsh} (cumulative)
- Added validation functionality inside the
Trainer()
class to handlevalidation_dataloader
, logging and saving plots inexp/validation_runs
and wandb.- Added wandb Image plotting along with ground truth and prediction plotting.
- Added checkpoint resuming for training.
- Changed model saving format to
so that during checkpoint resuming for training, we have access to the iteration value for correct saving and logging.
{ 'model_state_dict': model.parameters(), 'iteration': iteration }
- Added time formatting and center printing using functions in
utils.py
current_formatted_time(), log_print(), center_print()
.- Epoch start, end and duration.
- Iteration start, end and duration.
- Validation start, end and duration.
- Estimated time to finish training.
- Added plotters in
utils.py
for plotting mel spectrogram (saveplot_mel()
), alignments (saveplot_alignment()
), gates (saveplot_gate()
with double purpose), and the raw wav signal (saveplot_signal()
). - Added
config.yaml
support. Automatically save and load based on extension [json, yaml].- Added autosaving params to
exp/config.yaml
which is handled by theTrainer()
class. - Added
load_yaml()
anddump_yaml()
inutils.py
.
- Added autosaving params to
- Added
db_to_amplitude()
inaudio.py
for converting decibel scale mel to its proper magnitude, along with testing code. - Added
mel2audio()
support forTTSModel()
ininference.py
along with saving of mel, gate, alignment and signal plots. - Added option
remove_wav_dump
inDatasetConfig
to remove thedump/wavs
directory once the features have been calculated and saved in thedump/feats
directory. - Added extra values for config dict in
wandb.init()
for project details. - Added gradient clipping for preventing gradient explosion.
- Other fixes for gradient issues src.
torch.backends.cudnn.enabled=True
is better to speed up conv and RNN layers src.torch.backends.cudnn.benchmark=True
allows cudnn autotuner to optimize the algorithm for the hardware. But this only helps us if the input size is same always. But since our input size changes every iteration we keep it asFalse
to prevent it from decreasing the performance src.
{Akarsh}
- Added scale and power option in
amplitude_to_db()
inaudio.py
to handle special feature extraction (power=False, scale=1
) which we will be using from now on (even though it defies normal db definition we use that because it gives better results). Also updateddb_to_amplitude()
to handle this. - Added epoch start offset for when resuming training.
{Akarsh}
- NOTE: ALIGNMENTS!!. Changing the mel db scale does the trick. The idea is that the original scale was quite high and was hard to learn. Now decreasing the scale to get almost 0-1 range (giving it pseudo normalization), makes it ideal for the network to learn (check
exp_run_7
). Even changedref_level_db=1
. - Added
griffin_lim()
inaudio.py
for signal reconstruction.
{Akarsh}
- Added
seaborn
style plotting. - Added
reduce_noise()
based on low pass butter filter inaudio.py
. - NOTE: Created new git branch
dev
for the development of model independant codebase, with support for both TTS and Vocoder. - Git Branching
git remote show origin
for full branch details.- src
- src
- src
- src
- src
- If you want to switch to a remote branch that does not exist as local branch in your local working directory, you can simply execute git switch remoteBranch. When Git is unable to find this branch in your local repository, it will assume that you want to checkout the respective remote branch with the same name. It will then create a local branch with the same name. It will also set up a tracking relationship between your remote and local branch so that git pull and git push will work as intended src.
{Akarsh}
- Python Decorators (src)
- My implementation src.
- Added
tools/
directory to host helper scripts to run independant tasks. The code is directly taken from existing GenVox code. The motivation is to use the existing code for other purposes, hence the need to give seperate access.tools/resample.py
to resample audio wav files based onAudioProcessor()
.tools/trim_audio.py
to trim audio based onutils.trim_audio_silence()
.
{Akarsh}
- Added
vocoder/utils.py
that houses theSigMelDataset()
dataset class for audio and mel data.- Added
max_frames
to clip or pad to a specific number of frames (and equally in audio signal length) for faster training and cudnn benckmark accelaration. - NOTE: Need to experiment adding noise as given here src.
- Added
- Made changes in
trainer
to handle vocoder input.
- Added MelGAN vocoder.
- Fixed the training bug:
.detach()
for generator output during training of discriminator. - Used
torch.backends.cudnn.benckmark = True
for optimization. Fixed the input lengths for every batch.
- Fixed the training bug:
- Main fix for GAN vocoder training:
optimizer_config = OptimizerConfig( learning_rate=0.0001, beta1=0.5, beta2=0.9, weight_decay=0 )
- NOTE: Dont use two package managers (
pip
,conda
) while installing packages.- If you installed
torch
withpip
(pip install torch
) and then installaccelerate
usingconda
(conda install accelerate -c conda-forge
), then it will also installpytorch
(mostly cpu version) and replace the already installedtorch
package. - This is because
conda
is not aware of whatpip
installed and will do independantly, hence corrupting the environment. - We can avoid this by using
pip
to install (pip install accelerate
).
- If you installed
- Using HuggingFace Accelerate Python package for distributed training (huggingface, github).