Skip to content

Commit

Permalink
add paper link
Browse files Browse the repository at this point in the history
  • Loading branch information
SWivid committed Oct 10, 2024
1 parent 9b37661 commit a56fa46
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions F5-TTS/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,9 @@ <h2 style="display: flex; align-items: center; justify-content: center">
<object type="image/svg+xml" data="pics/square-f5.svg" style="width: 72px; height: 72px;"></object>-TTS: <span style="margin-left: 10px;"></span>
</h2>
<h3>A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching</h3>
<h5> <a href="https://github.com/SWivid/F5-TTS">Code</a> </h5>
<h5> <a href="https://github.com/SWivid/F5-TTS" target="_blank" rel="noopener noreferrer">Code</a>; <a href="https://arxiv.org/abs/2410.06885" target="_blank" rel="noopener noreferrer">Paper</a> </h5>
</div>
<p><b>Abstract</b> This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model’s performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at <a href="https://SWivid.github.io/F5-TTS">https://SWivid.github.io/F5-TTS</a>. We will release all code and checkpoints to promote community development.
<p><b>Abstract</b> This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model’s performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at <a href="https://SWivid.github.io/F5-TTS">https://SWivid.github.io/F5-TTS</a>. We release all code and checkpoints to promote community development.
</p>

<p>
Expand Down

0 comments on commit a56fa46

Please sign in to comment.