TTS Introduction
About 867 wordsAbout 3 min
2026-03-21
Text-to-Speech (TTS) converts written text into natural-sounding audio. In an AI voice pipeline, TTS is the final output stage: it takes the text response produced by a language model and renders it as speech that a user hears.
Position in the Voice Pipeline
Microphone
↓
VAD (detect speech boundaries)
↓
ASR (speech → text)
↓
LLM (text → response text)
↓
TTS (text → audio) ← you are here
↓
SpeakerEach stage has its own latency budget. TTS is often the largest contributor to perceived delay because audio must be generated before (or while) it plays. Streaming TTS — generating playback audio sentence-by-sentence — is the key technique for reducing that latency.
TTS Generation Approaches
1. Concatenative / Unit Selection (Legacy)
Stitches pre-recorded phoneme/diphone segments from a database. Fast, no GPU needed, but robotic-sounding and voice-locked.
- Examples: Festival, MaryTTS, eSpeak
2. Statistical Parametric (HMM/DNN Era)
Models speech as a sequence of acoustic parameters generated by a statistical model (HMM, then DNN). Smoother than concatenative but still clearly synthetic.
- Examples: HTS, Merlin
3. Neural End-to-End (Tacotron/FastSpeech Era)
A neural network maps text (or phonemes) directly to mel-spectrograms, then a vocoder converts the spectrogram to a waveform. This was the dominant architecture from 2017–2022.
- Tacotron 2: seq2seq with attention, excellent quality, slow autoregressive decoding
- FastSpeech 2: non-autoregressive (parallel), fast, duration/pitch/energy predictors
- VITS: end-to-end VAE + normalizing flow + GAN — no separate vocoder needed
4. Codec Language Models (Bark / AudioLM Era)
Encodes audio as discrete tokens using a neural audio codec (EnCodec, DAC), then trains an LLM-style autoregressive model over those tokens. Produces extremely natural speech, laughter, non-verbal sounds, but is slow (100+ tokens/second of audio).
- Examples: Bark, AudioLM, VoiceBox, VoiceCraft
5. Flow Matching / Diffusion
Uses continuous normalizing flows or score-based diffusion to generate mel-spectrograms or waveforms directly from noise. Fast inference, high quality, excellent voice cloning.
- Examples: F5-TTS, E2-TTS, Voicebox, Matcha-TTS
Library Landscape
| Library | Approach | Quality | Speed | Voice Cloning | Offline | License |
|---|---|---|---|---|---|---|
| Kokoro | Flow + HiFi-GAN | ★★★★☆ | ★★★★★ | ✗ | ✓ | Apache 2.0 |
| Coqui XTTS-v2 | Codec LM | ★★★★★ | ★★★☆☆ | ✓ (3s ref) | ✓ | CPML |
| F5-TTS | Flow matching | ★★★★★ | ★★★★☆ | ✓ (ref audio) | ✓ | MIT |
| StyleTTS2 | Diffusion + style | ★★★★★ | ★★★☆☆ | ✓ | ✓ | MIT |
| Bark | Codec LM (GPT) | ★★★★☆ | ★★☆☆☆ | ✗ (voice presets) | ✓ | MIT |
| edge-tts | Cloud (Azure) | ★★★★☆ | ★★★★★ | ✗ | ✗ | MIT (client) |
| OpenVoice V2 | VITS + tone color | ★★★★☆ | ★★★★☆ | ✓ (any voice) | ✓ | MIT |
| pyttsx3 | OS TTS engine | ★★☆☆☆ | ★★★★★ | ✗ | ✓ | MIT |
| MeloTTS | VITS-based | ★★★★☆ | ★★★★☆ | ✗ | ✓ | MIT |
Voice Quality Metrics
MOS (Mean Opinion Score)
Subjective listening test rated 1–5 by human listeners. Standard TTS quality benchmark.
| Score | Description |
|---|---|
| 5.0 | Excellent — indistinguishable from human |
| 4.0–4.5 | Good — natural, minor artifacts |
| 3.0–4.0 | Fair — clearly synthetic but intelligible |
| < 3.0 | Poor — noticeable robotic quality |
Human speech typically scores ~4.5 MOS. Modern neural TTS (XTTS-v2, F5-TTS) achieves 4.0–4.3.
WER on TTS Output
Round-trip metric: synthesize → transcribe with ASR → compare to input. Lower WER = better intelligibility.
RTF (Real-Time Factor)
RTF=audio duration (s)synthesis time (s)
RTF < 1.0 means faster than real-time (required for streaming use). RTF < 0.1 is excellent.
Additional Objective Metrics
| Metric | Full Name | Measures | Range | Better |
|---|---|---|---|---|
| PESQ | Perceptual Evaluation of Speech Quality (ITU-T P.862) | Narrowband/wideband quality vs. reference | -0.5 – 4.5 | Higher |
| STOI | Short-Time Objective Intelligibility | Intelligibility — fraction of speech correctly understood | 0 – 1 | Higher |
| ViSQOL | Virtual Speech Quality Objective Listener | Perceptual quality using neurogram similarity | 1 – 5 | Higher |
| UTMOS | Unified TTS MOS predictor (neural) | Predicted MOS without human listeners | 1 – 5 | Higher |
| F0 RMSE | Root-mean-square error of fundamental frequency | Pitch accuracy vs. reference | Hz | Lower |
| V/UV error | Voiced/unvoiced classification error | Voicing correctness | % | Lower |
# PESQ and STOI require a reference (ground-truth) waveform
# pip install pesq pystoi
from pesq import pesq
from pystoi import stoi
import numpy as np
def evaluate_tts_quality(reference: np.ndarray, synthesized: np.ndarray, sr: int = 16000):
"""Compute objective TTS quality metrics vs a reference waveform."""
pesq_score = pesq(sr, reference, synthesized, "wb") # wideband
stoi_score = stoi(reference, synthesized, sr, extended=False)
print(f"PESQ: {pesq_score:.3f} STOI: {stoi_score:.3f}")
return pesq_score, stoi_scoreUTMOS (neural MOS predictor) does not need a reference — it scores any audio:
pip install utmos utmos score audio.wav # outputs predicted MOS
When to Use Which Library
Need offline + fastest possible?
→ Kokoro (Apache 2.0, ONNX, CPU ~0.05 RTF)
Need voice cloning from a short reference?
→ F5-TTS (MIT, best quality/speed balance for cloning)
→ Coqui XTTS-v2 (CPML, excellent multilingual cloning)
Need zero setup, cloud quality?
→ edge-tts (free Azure Neural TTS, 400+ voices)
Need expressive/emotional speech, laughter, music?
→ Bark (slow but uniquely expressive)
Need your own voice or any voice style transfer?
→ OpenVoice V2 (MIT, real-time style mixing)
Need simple system TTS (no ML)?
→ pyttsx3 (wraps OS engine, zero deps)Audio Output Formats
| Format | Sample Rate | Bit Depth | Use for |
|---|---|---|---|
| WAV (PCM) | 22050 / 24000 Hz | 16-bit | Storage, further processing |
| WAV (PCM) | 16000 Hz | 16-bit | ASR round-trip, voice pipeline |
| MP3 | any | compressed | Web delivery |
| OGG Vorbis | any | compressed | Web streaming |
Most neural TTS models output at 22050 Hz or 24000 Hz. If feeding back to ASR, resample to 16000 Hz.