TTS Algorithms & Theory
About 2378 wordsAbout 8 min
2026-03-21
Understanding how modern neural TTS systems work — from raw text through phonemes, mel-spectrograms, vocoders, and all the way to a waveform — is essential for debugging quality issues, tuning inference, and building your own systems.
TTS Pipeline Anatomy
Raw Text
↓ Text Normalization
↓ Grapheme-to-Phoneme (G2P)
Phoneme Sequence
↓ Acoustic Model
Mel-Spectrogram
↓ Vocoder
Waveform (PCM)Each stage has specific failure modes and tuning knobs.
Stage 1 — Text Normalization
Before phonemization, text must be normalized into speakable form:
| Input | Normalized |
|---|---|
$14.99 | "fourteen dollars and ninety-nine cents" |
2026-03-21 | "March twenty-first, twenty twenty-six" |
Dr. Smith | "Doctor Smith" |
GPU | "G P U" (or "Graphics Processing Unit" with context) |
3.14 | "three point one four" |
1,000 | "one thousand" |
Libraries: num2words, inflect, custom regex pipelines. XTTS-v2 and Kokoro handle this internally via their text normalizers.
Stage 2 — Grapheme-to-Phoneme (G2P)
Converts written characters to phonemes (IPA or ARPAbet symbols). Critical for correct pronunciation, especially:
- Heteronyms: "lead" /liːd/ vs /lɛd/, "read" /riːd/ vs /rɛd/
- Abbreviations, proper nouns, loan words
ARPAbet (used by CMU Pronouncing Dictionary)
"hello" → HH AH0 L OW1
"world" → W ER1 L DIPA (International Phonetic Alphabet)
"hello" → həˈloʊ
"world" → wɜːldTools
| Library | Approach | Languages |
|---|---|---|
phonemizer (eSpeak backend) | Rule-based + dict | 100+ |
g2p-en | Seq2seq model | English |
gruut | Dictionary + neural fallback | 10+ |
| Kokoro built-in | misaki G2P library | EN, JA, ZH, FR, KO, PT |
| XTTS-v2 | Integrated per-language | 17 languages |
Stage 3 — Acoustic Models
Tacotron 2 (2018)
Seq2seq encoder-decoder with location-sensitive attention. Encodes phonemes to hidden states, then autoregressively decodes mel-spectrogram frames.
Encoder: CNN + BiLSTM → context vectors
Attention: location-sensitive (previous weights + conv features)
Decoder: LSTM → mel-spectrogram (2 frames per step)
Stop Token: sigmoid predicts end of utterance
Post-net: 5-layer CNN residual refinement of mel outputTacotron 2 Attention Formula
The location-sensitive attention mechanism attends over encoder states h at step t:
et,i=w⊤tanh(W1hi+W2dt+W3∗αt−1+b)
αt,i=∑jexp(et,j)exp(et,i)
where dt is the decoder state, αt−1 is the previous attention weight vector, and ∗ denotes convolution. The convolution over previous attention weights provides "where to look next" — preventing the common Tacotron failure mode of getting stuck.
Weakness: Attention failures cause repeated words, skipped syllables, or early cutoff. Slow — autoregressive (sequential).
FastSpeech 2 (2020)
Non-autoregressive (fully parallel). Uses a duration predictor to expand phoneme hidden states to match the target mel length, then generates all frames in parallel.
Encoder: Transformer (phonemes → hidden)
↓
Duration Predictor → length regulator (repeat each phoneme hidden state)
Pitch Predictor → add pitch contour (F0)
Energy Predictor → add energy contour
↓
Decoder: Transformer (parallel) → mel-spectrogramAdvantage: 30–50× faster than Tacotron 2. No attention collapse. Explicit control of duration, pitch, energy.
L=Lmel+λdLdur+λpLpitch+λeLenergy
VITS (2021) — Variational Inference with adversarial learning for TTS
End-to-end model — no separate vocoder. Combines:
- VAE: latent variable z captures prosody/style
- Normalizing flow: f:z→mel, invertible for training
- HiFi-GAN discriminators: adversarial training for waveform realism
Training:
Text encoder → μ, σ (prior)
Audio encoder → posterior q(z|x)
KL divergence: align prior and posterior
Flow: z → mel → waveform via HiFi-GAN generator
Discriminators: multi-period + multi-scale waveform discrimination
Inference:
Text encoder → prior → sample z → flow → waveformLVITS=Lrecon+LKL+Ladv+Lfm
VITS is the backbone of Coqui TTS, MeloTTS, and OpenVoice.
Stage 4 — Vocoders
Converts mel-spectrograms to waveforms. The vocoder determines audio quality more than the acoustic model in many cases.
WaveNet (2016)
Autoregressive dilated causal convolutions. Excellent quality, extremely slow (slower than real-time on CPU).
WaveGlow (2018)
Normalizing flow-based. Parallel inference. Faster but large model.
HiFi-GAN (2020) — Current Standard
GAN-based. Generator: transposed convolutions with multi-receptive field fusion (MRF). Discriminators: multi-period (MPD) + multi-scale (MSD).
Generator: Conv → [TransposedConv → MRF] × 4 → Conv → Tanh
MRF: parallel dilated convolutions at different dilation rates
MPD: discriminate at periods {2, 3, 5, 7, 11}
MSD: discriminate at scales {1×, 2×, 4×}HiFi-GAN Training Losses:
LG=Ladv(G)+λfmLfm(G)+λmelLmel(G)
- Adversarial loss (least-squares GAN):
Ladv(G)=Ez[(D(G(z))−1)2]
Ladv(D)=E(x,z)[(D(x)−1)2+(D(G(z)))2]
- Feature matching loss (intermediate discriminator layer L1):
Lfm(G)=E(x,z)[i=1∑TNi1Di(x)−Di(G(z))1]
- Mel-spectrogram reconstruction loss:
Lmel(G)=E(x,z)[∥ϕ(x)−ϕ(G(z))∥1]
where ϕ(⋅) computes the log-mel spectrogram.
RTF < 0.01 on CPU for V1 (smaller model). Default vocoder in FastSpeech 2, Kokoro, StyleTTS2.
BigVGAN (2022)
HiFi-GAN with anti-aliased multiperiodic composites. Better generalization to unseen speakers.
Neural Audio Codecs (EnCodec / DAC)
Used by codec-LM TTS (Bark, XTTS-v2, VoiceCraft). Compress audio into discrete token streams at multiple bitrate levels.
Audio waveform
↓ Encoder (strided convolutions)
↓ Residual Vector Quantization (RVQ)
Level 1: coarse tokens (semantics, prosody)
Level 2: fine tokens (timbre)
Level 3-8: detail tokens (waveform fidelity)
↓ Decoder (mirrored architecture)
Reconstructed waveformAt 24kHz, EnCodec at 1.5kbps uses 8 codebooks × 75 tokens/second = 600 tokens/s. Generating 1 second of speech requires generating ~600 tokens autoregressively — hence Bark's slowness.
Flow Matching (F5-TTS / E2-TTS)
Flow matching is a continuous normalizing flow trained with a simpler, more stable loss than diffusion. Instead of predicting noise, it predicts the velocity field that transports a noise distribution to the data distribution.
During training:
x₀ ~ noise, x₁ ~ data (mel-spectrogram)
xₜ = (1-t)x₀ + tx₁ (straight-path interpolation)
vθ(xₜ, t, condition) ≈ x₁ - x₀ (model learns velocity)
During inference:
Start from x₀ ~ N(0,I)
Integrate ODE: dx/dt = vθ(xₜ, t, text)
Arrive at x₁ = mel-spectrogramF5-TTS adds a flat-start process (no separate text encoder — text tokens are directly injected at positions matching aligned audio tokens). Voice cloning is achieved by providing a reference spectrogram as conditioning.
Advantage over diffusion: Straight trajectory → fewer solver steps → faster inference than DDPM while matching quality.
Speaker Embeddings & Voice Cloning
Speaker Embedding (Fixed Voices)
A fixed vector es∈R256 extracted from a speaker encoder (e.g., d-vector, x-vector, ECAPA-TDNN) is injected into the acoustic model as a conditioning signal. The model learns to match the voice of that speaker.
Zero-Shot Voice Cloning
The speaker encoder takes a reference audio clip (3–10s) at inference time and produces an embedding that conditions the model on an unseen voice.
Reference audio (3s of target voice)
↓ Speaker Encoder (ECAPA-TDNN or d-vector network)
Speaker embedding vector ∈ ℝ²⁵⁶
↓ Injected into acoustic model (adds to decoder hidden states)
Model generates speech in target voiceUsed by XTTS-v2, F5-TTS, OpenVoice V2.
Tone Color Transfer (OpenVoice V2)
Decouples base tone (speaker timbre, accent) from style (speed, pitch, emotion). A converter network applies a source voice's tone color to any target voice's content, enabling flexible cross-speaker style mixing.
Prosody Modeling
Prosody covers rhythm, stress, intonation, and pacing — the part that makes speech sound human rather than robotic.
| Feature | What controls it | Model component |
|---|---|---|
| Pitch (F0) | Intonation, questions, emotion | Pitch predictor (FastSpeech 2), VAE latent (VITS) |
| Duration | Syllable timing, speech rate | Duration predictor, length regulator |
| Energy | Loudness, stress | Energy predictor |
| Pause | Sentence breaks | Silence tokens, punctuation rules |
Modern end-to-end models (VITS, F5-TTS) learn prosody implicitly from data rather than explicit labels.
Diffusion-Based TTS (GradTTS / DiffSpeech)
Before flow matching (F5-TTS), diffusion models (DDPM-based) were the state-of-the-art for high-quality neural TTS. Understanding them is important because many open-source tools (DiffSpeech, Grad-TTS, NaturalSpeech 2) still use this approach.
Forward and Reverse Process
Diffusion models define a forward noising process that gradually destroys the data (mel-spectrogram x0) by adding Gaussian noise:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
Over T steps (typically T=1000), xT≈N(0,I). The model learns the reverse process — denoising step by step:
pθ(xt−1∣xt,text)=N(xt−1;μθ(xt,t,text),Σθ)
Training Objective (simplified DDPM)
Instead of predicting xt−1 directly, the model predicts the noise ϵ that was added:
LDDPM=Ex0,t,ϵ[∥ϵ−ϵθ(xt,t,text)∥2]
where xt=αˉtx0+1−αˉtϵ and αˉt=∏s=1t(1−βs).
GradTTS Architecture
Text phonemes
↓ Text encoder (Transformer)
Prior μ_enc, σ_enc ← used as the "clean" signal for diffusion
↓ Forward process: add noise → x_T
↓ Reverse (U-Net score estimator conditioned on text encoder output):
x_{T} → x_{T-1} → ... → x_0 = mel-spectrogram
↓ HiFi-GAN vocoder
WaveformKey difference from VITS: GradTTS uses diffusion in the mel-spectrogram space (not waveform). It achieves higher quality than VITS in perceptual tests but is 10–50× slower at inference due to the many denoising steps.
Accelerated Sampling
Practical diffusion TTS uses far fewer than 1000 steps:
| Sampler | Steps | Speed | Quality |
|---|---|---|---|
| DDPM (original) | 1000 | Slow | Best |
| DDIM | 50–100 | 10× faster | Very good |
| DPM-Solver | 10–20 | 50× faster | Good |
| Flow matching (F5-TTS) | 16–32 NFE | Fast | Best-in-class |
Flow matching (covered above) supersedes diffusion for TTS precisely because it achieves similar or better quality with straight trajectories and far fewer solver steps.
Multi-Speaker Training Theory
Multi-speaker TTS extends single-speaker models to produce speech in many different voices from a single model.
Speaker Embedding Injection
A speaker embedding es∈Rd (typically d=256) is injected into the acoustic model as a conditioning signal. The embedding can come from:
- Lookup table (closed-set): es=E[s] where E∈RN×d is a learned embedding matrix for N known speakers.
- Speaker encoder (open-set): es=fϕ(ref audio) — run a trained encoder on a reference clip to get the embedding of an unseen speaker at inference time.
The embedding is injected at the decoder input:
h^i=hi+Wses,Ws∈Rdhidden×d
This conditions every decoder step on the speaker without modifying the phoneme encoder.
Speaker Encoder Architectures
| Architecture | Output dim | Key Feature |
|---|---|---|
| d-vector (GE2E) | 256 | Generalized End-to-End loss; trained to separate speakers |
| x-vector (TDNN) | 512 | Time-delay neural network; standard in speaker verification |
| ECAPA-TDNN | 192 | Squeeze-excitation + channel-attention; state-of-the-art |
Generalized End-to-End (GE2E) Loss for training the speaker encoder:
LGE2E=−NM1i,j∑log∑k=1Newcos(eij,ck)+bewcos(eij,ci)+b
where eij is the j-th utterance embedding of speaker i, ci is the centroid of speaker i's embeddings, and w,b are learnable scalar parameters. The loss pulls utterances of the same speaker toward each other and pushes different speakers apart — producing a speaker space where nearby vectors sound similar.
Adapter Layers (Low-Data Fine-Tuning)
Rather than training a large multi-speaker model, adapter layers add a small number of trainable parameters to a frozen backbone:
import torch.nn as nn
class SpeakerAdapter(nn.Module):
"""
Lightweight bottleneck adapter for each speaker.
Inserted after each Transformer block in the decoder.
"""
def __init__(self, hidden_dim: int, bottleneck: int = 64):
super().__init__()
self.down = nn.Linear(hidden_dim, bottleneck)
self.act = nn.ReLU()
self.up = nn.Linear(bottleneck, hidden_dim)
nn.init.zeros_(self.up.weight)
nn.init.zeros_(self.up.bias)
def forward(self, x):
return x + self.up(self.act(self.down(x))) # residual connectionAt training, the backbone is frozen and only adapters are trained — requiring as little as 15 minutes of speech per new speaker.
TTS Evaluation Metrics
Mean Opinion Score (MOS)
Human listeners rate speech naturalness on a 1–5 scale:
| Score | Description |
|---|---|
| 5 | Excellent — indistinguishable from natural speech |
| 4 | Good — natural with minor flaws |
| 3 | Fair — noticeable artifacts, acceptable |
| 2 | Poor — unnatural, difficult to understand |
| 1 | Bad — completely unnatural |
Automated MOS (UTMOS, MOSNet): predict MOS without human listeners using a neural regression model trained on human ratings.
MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor)
ITU-R BS.1534 standard. Listeners hear several systems + a hidden reference simultaneously and rate each on 0–100. Anchors (degraded references) establish a lower bound. MUSHRA is more sensitive than MOS for differentiating systems that are already near human quality.
Objective Metrics
| Metric | What it measures |
|---|---|
| MCD (Mel Cepstral Distortion) | Spectral distance between generated and reference mel-cepstrum |
| F0 RMSE | Pitch accuracy vs. reference speaker |
| RTF (Real-Time Factor) | synthesis_time / audio_duration — must be < 1.0 for real-time |
| Round-trip WER | Run ASR on TTS output; compare text to original. Low WER = high intelligibility |
MCD=ln10102k=1∑K(ck−c^k)2
where ck and c^k are the k-th MFCC coefficients of the reference and generated speech.
See Also
- TTS Libraries Comparison
- TTS Implementation
- TTS Training from Scratch — HiFi-GAN training losses, multi-speaker setup, UTMOS evaluation code
- ASR Algorithms & Theory — Mel-spectrograms, BPE tokenization, and acoustic representations are used by both TTS (to generate) and ASR (to recognize) speech
- VAD Algorithms & Theory — Pitch (F0), energy, and spectral features in TTS prosody modeling are the same acoustic cues VAD uses for speech/non-speech classification