Development Log


Log Dates



  • HindiTTS for dataset creating, annotating and cleaning.
  • Aeneas_Extended for alignment using aeneas.
  • ESPNET, Coqui for format, pipeline and code reference.
  • ReadtheDocs for documentation preparation.
  • First understand logging, argparse and get a defined <input, output, log> format for any code.
  • Follow pip standards, docstrings, and tabspaces for coding.
  • Define config file format (separate class -> yaml wrapper?):
    • project details
      • name
      • tag
      • language
    • dataset details (for extracting text: local/
    • preparing details
      • text:
        • token type
        • cleaner
        • g2p (if any)
      • audio:
        • sampling_rate(fs)
        • trim_silence? (trim_db)
        • signal_norm
    • preprocessing details (TODO)
    • model details
      • name
      • params
    • trainer details (TODO: logging, batch, etc)
    • output details (path for output_dir/)
  • Learn git branching.
  • Decide if bash is used or not: I prefer using it because it is very quick, inclined to write both bash and python variants.
  • readme md links:
  • files:
    • (prepare_data, preprocess_data, train_tts)
    • (dataset creating using raw audio, transcripts or youtube links)

Idealogy: run code with config path

  • Added base_config, dataset_config (using this and this for value and docstring reference).
  • Added dataset_processor for making i2t and i2w (created LJSpeech_{small, sample} for experimentation and testing).
  • Writing entire runner script in and will later divide it into individual components.



  • Using coqui for coding format, espnet for data format and NVIDIA for model code.
  • We are using nn.ModuleList instead of list because model.parameters() wont have the necessary parameters as it cannot read a regular python list (source).
  • squeeze and unsqueeze in torch tensor (source).
  • Assumption: single speaker, single language, single GPU (not distributed).
  • For distributed read about: workers, jobs, nodes, tasks.
  • collate: "collect and combine (texts, information, or data)" (source, CREPE-ref)
  • MUST-C dataset (source)


  • Using youtube-dl to download the audio (wav) and transcript(.vtt file) of the audio file. (used pip install git+ as there was a change in youtube metadata)
  • Converted the data in vtt file into csv file formatting similar to the LJSpeech dataset format.
  • The audio is split into smaller segments by using the timestamps in vtt file and output wav files into a directory.



  • in nn.Module is explained here (source).
  • Gate Prediction is basically stop token prediction. It is used to stop inference in decoding stage (source).
  • To prevent Markdown All in One from auto creating Table of Contents, add a <!-- no toc --> comment above the list.
  • subprocess in python to handle bash commands (source).


  • Input is taken to format the naming of the wav files to particular id and the padding the file number to match number of digits using rjust
  • Same id format for the id in csv file
  • Trim silence of left and right ends of the audio segment by using a silence threshold (ref)






  • Added DownloadConfig and DownloadProcessor.
  • Added extraction support and changed pipeline -> i2at, i2aw contain all values before processing and i2t, i2w contain only the values that are valid.
  • Added wav dump creation with length filtering and sampling rate modification (and bitrate) in dump/wavs/ directory.
  • NOTE: Need to add trimAudio for the wavs before checking the length before filtering.
  • i2w format -> <utt_id> <wav_path> <wav_shape>.
  • You can now download wav and vtt files from Youtube (yt_dlp) and use the alignment to segment the audio file into wavs/ directory and also choose a speaker_id for utt_id. Then we can prepare transcript.txt using standard delimiter | and finally we have a dataset.
  • Added secToFormattedTime in utils for printing time from seconds to a standard format.
  • Check trimAudio in general and format create_dataset.


  • Added create_dataset file to download wav and transcript of the given youtube link and divide it into multiple segments and their respective text data.
  • Switched from youtube_dl to yt-dlp which is a forked version of youtube_dl as there were issues with youtube_dl. (yt-dlp).
  • Added a function to check whether the given link is youtube link or not.



  • Added own code for trim_audio_silence and removed pydub dependency.
  • References for calculating dBFS: (wiki, src, and ChatGPT).
  • Changed None for default value str type. (typeguard gives issues for different versions).
  • Fix global pip installation issue (src). Not able to fix!
  • Fixed global pip installation issue. The issue arises because the machine has both pip3 and conda installed. So when we do /path/to/conda/env/python3 -m pip list with python already installed in the env, it is looking for the normal pip installation. Hence the simplest solution would be to just remove the pip3 installation and then conda will only look in its own envs for pip.


  • Changed the vtt_to_csv file to remove the dependency on reg ex. Also simplified the function to remove redundancy.
  • Removed parse_vtt function and added the same functionality into vtt_to_csv function.
  • Added verbose and quiet options for Download_YT function.
  • Using os.path.join() for all directory paths.



  • Changed the pipeline.
    • DatasetProcessor takes in AudioConfig and TextConfig and handles preprocessing using TextProcessor and AudioProcessor inside it.
  • NOTE: Need to add code for handling multi channel wav audio files (convert to single channel before checking for silence).
  • Added simple text tokenization and updated them with calculated index.
  • We also need to <SOS/EOS>, tokens too along with some standard tokenizers and cleaners.
  • Use random.sample() as it samples without replacement. random.choices() samples with replacement.


  • Check how the text is processed (tokenizers, cleaners) in (NVIDIA) and (espnet).
  • Learn how the text and mel are processed in TextMelLoader and TextMelCollate.



  • Look into model.eval() and torch.no_grad() use cases (src).
  • np.float32 datatype: 1 sign bit, 23 bits mantissa, 8 bits exponent (single decimal precision float) (src).
  • g2p: we need to add oov (for char level we just ignore those characters).
  • Trainable Fourier kernels (src1, src2).
  • STFT simple code idea in python (src).
    • We pad zeroes the length of the frame, in order to get first half of FFT the same length of as that of the frame.
    • np.fft.fft has even (Hermitian) symmetry (src).
  • ffmpeg cannot edit existing files in-place. we need to make duplicate file.
  • Replaced pcm_u8 codec to default pcm_s16le codec in ffmpeg. In unsigned type, the mean is positive (signal moved up into positive axis). Hence mean wont be zero and therefore not ideal.



  • tensor.half() or model.half() will convert all model weights to half precision. This is done to speed up calculations (src).
  • why computed frame count and librosa.stft frame count is not matching. Because of center padding. Read (src1, src2). With center=False it is the normal calculation, that is $$ \textrm{total length} = \textrm{window length} + (\lambda - 1) * \textrm{hop length}$$ this also assumes that the signal fits perfectly (if extra is there, then we dont consider it).
    Also if center=False in librosa.stft() then pad_mode is ignored. If center=True, then we use the default value of pad_mode="constant" to pad zeros to the input signal on both sides for framing.
  • Nyquist theorem and calculating signal having a particular frequency:
    nyquist theorem states:
    sampling frequency >= 2 * max_frequency to be preserved
    #points per sec >= 2 * #waves per sec
    if coeff increases wavelength decreases (inversely related)
    coeff -> #values for one wave (continuous)
    1 -> 2 * pi values
    x -> 1 / f values
    1 * 2 * pi = x * 1 / f
    => x = (2 * pi * f)
    use sin(x * points) for f frequency signal
  • Important concepts for STFT and mel spectrogram (the way librosa does librosa.amplitude_to_db(), librosa.stft()):
    • signal -> [filter_length, hop_length, window_length] -> frames
    • frames (real) -> [stacking fft(frame) => stft] -> spectrogram (complex)
    • spectrogram (complex) -> [np.abs] -> amplitude_spectrogram -> [np.power, usually ^2] -> power_spectrogram
    • power_spectrogram -> [10 * log10(max(amin, sig)) - 10 * log10(max(amin, ref))] -> dB_scale_spectrogram
    • power_spectrogram -> [np.matmul(mel_transform_filter, spectrogram)] -> mel_spectrogram
    • np.abs() of spectrogram gives magnitude while np.angle() gives phase
  • Reading about mel and its use (src).



  • librosa.stft() uses np.fft.rfft(), hence we get 1 + (n_fft // 2) values as the output (src).
    np.fft.fft() for real input gives Hermitian-symmetric output, where the negative frequencies are the complex conjugates of positive frequencies and are hence redundant.
  • Completely implemented stft() and istft() from scratch using numpy. Heavily used librosa docs and other resources, but greatly simplified the code by making to quite task specific.
  • Added above explored code along with other audio helper functions in
  • Removed librosa dependency.
  • Added AudioProcessor.convert2mel() that takes in a wav_path and extracts features (mel spectrogram) from it and saves it as a .npy file in dump/feats directory.
    .npy file is numpy format for saving arrays as data (src).



  • Explore the idea of calculating features on the go. We can reduce load and increase parallelization during loading. This is handled natively by which can be accessed by num_workers value. The only upside is to save memory.
  • Assumption: single speaker, single language on single GPU
  • Added data splitting into training and validation inside DatasetProcessor.
  • Check pin_memory=False and drop_last=True along with shuffle=True/False in
  • Add collate function for DataLoader.



  • torch.nn vs torch.nn.functional (src).
  • attributes: (src)
    • pin_memory (bool, optional) – If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below. (default: False)
    • drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
  • Added TextMelCollate() for in Trainer.



  • Added CheckpointManager() in Trainer that will handle file saving for top n checkpoints based on the loss value.
  • Added WandbLogger() in Trainer that will handle wandb logging and syncing.
  • Added use_g2p and updated TextProcessor pipeline.
  • Added g2p_en by Park Kyu Byong (src).
  • NOTE:
    • Consider on the fly loading of tokens and mel feats generation (along with resampling and trimming). This will save space, but we wont have control over size of audio.



  • Omitting param groups of optimizer update before every iteration. Later on add and experiment. (src).
  • optimizer.zero_grad() vs model.zero_grad() (src). They are the same (src2).
  • Decide variable to log (gpu, etc) and config values.
  • Add timers for training loop and estimators.



  • Added evaluation in training loop with timers and estimators.
  • Integrated wandb that supports images. Updated WandbLogger to abstract direct wandb API.
    • commit=True in wandb.log() also increments the iteration value (src).
  • Added that contains the TTSModel() class for handling the inference.
  • Added time formatting in and log printing.
  • Difference between numpy saving formats .npy and .npz and where to use them (src).


  • Added validation functionality inside the Trainer() class to handle validation_dataloader, logging and saving plots in exp/validation_runs and wandb.
    • Added wandb Image plotting along with ground truth and prediction plotting.
  • Added checkpoint resuming for training.
  • Changed model saving format to
      'model_state_dict': model.parameters(),
      'iteration': iteration
    so that during checkpoint resuming for training, we have access to the iteration value for correct saving and logging.
  • Added time formatting and center printing using functions in current_formatted_time(), log_print(), center_print().
    • Epoch start, end and duration.
    • Iteration start, end and duration.
    • Validation start, end and duration.
    • Estimated time to finish training.
  • Added plotters in for plotting mel spectrogram (saveplot_mel()), alignments (saveplot_alignment()), gates (saveplot_gate() with double purpose), and the raw wav signal (saveplot_signal()).
  • Added config.yaml support. Automatically save and load based on extension [json, yaml].
    • Added autosaving params to exp/config.yaml which is handled by the Trainer() class.
    • Added load_yaml() and dump_yaml() in
  • Added db_to_amplitude() in for converting decibel scale mel to its proper magnitude, along with testing code.
  • Added mel2audio() support for TTSModel() in along with saving of mel, gate, alignment and signal plots.
  • Added option remove_wav_dump in DatasetConfig to remove the dump/wavs directory once the features have been calculated and saved in the dump/feats directory.
  • Added extra values for config dict in wandb.init() for project details.
  • Added gradient clipping for preventing gradient explosion.
    • Other fixes for gradient issues src.
  • torch.backends.cudnn.enabled=True is better to speed up conv and RNN layers src.
  • torch.backends.cudnn.benchmark=True allows cudnn autotuner to optimize the algorithm for the hardware. But this only helps us if the input size is same always. But since our input size changes every iteration we keep it as False to prevent it from decreasing the performance src.



  • Added scale and power option in amplitude_to_db() in to handle special feature extraction (power=False, scale=1) which we will be using from now on (even though it defies normal db definition we use that because it gives better results). Also updated db_to_amplitude() to handle this.
  • Added epoch start offset for when resuming training.



  • NOTE: ALIGNMENTS!!. Changing the mel db scale does the trick. The idea is that the original scale was quite high and was hard to learn. Now decreasing the scale to get almost 0-1 range (giving it pseudo normalization), makes it ideal for the network to learn (check exp_run_7). Even changed ref_level_db=1.
  • Added griffin_lim() in for signal reconstruction.



  • Added seaborn style plotting.
  • Added reduce_noise() based on low pass butter filter in
  • NOTE: Created new git branch dev for the development of model independant codebase, with support for both TTS and Vocoder.
  • Git Branching
    • git remote show origin for full branch details.
    • src
    • src
    • src
    • src
    • src
    • If you want to switch to a remote branch that does not exist as local branch in your local working directory, you can simply execute git switch remoteBranch. When Git is unable to find this branch in your local repository, it will assume that you want to checkout the respective remote branch with the same name. It will then create a local branch with the same name. It will also set up a tracking relationship between your remote and local branch so that git pull and git push will work as intended src.



  • Python Decorators (src)
    • My implementation src.
  • Added tools/ directory to host helper scripts to run independant tasks. The code is directly taken from existing GenVox code. The motivation is to use the existing code for other purposes, hence the need to give seperate access.
    • tools/ to resample audio wav files based on AudioProcessor().
    • tools/ to trim audio based on utils.trim_audio_silence().



  • Added vocoder/ that houses the SigMelDataset() dataset class for audio and mel data.
    • Added max_frames to clip or pad to a specific number of frames (and equally in audio signal length) for faster training and cudnn benckmark accelaration.
    • NOTE: Need to experiment adding noise as given here src.
  • Made changes in trainer to handle vocoder input.


  • Added MelGAN vocoder.
    • Fixed the training bug: .detach() for generator output during training of discriminator.
    • Used torch.backends.cudnn.benckmark = True for optimization. Fixed the input lengths for every batch.
  • Main fix for GAN vocoder training:
        optimizer_config = OptimizerConfig(


  • NOTE: Dont use two package managers (pip, conda) while installing packages.
    • If you installed torch with pip (pip install torch) and then install accelerate using conda (conda install accelerate -c conda-forge), then it will also install pytorch (mostly cpu version) and replace the already installed torch package.
    • This is because conda is not aware of what pip installed and will do independantly, hence corrupting the environment.
    • We can avoid this by using pip to install (pip install accelerate).
  • Using HuggingFace Accelerate Python package for distributed training (huggingface, github).