TTS Introduction

About 867 wordsAbout 3 min

2026-03-21

Text-to-Speech (TTS) converts written text into natural-sounding audio. In an AI voice pipeline, TTS is the final output stage: it takes the text response produced by a language model and renders it as speech that a user hears.

Position in the Voice Pipeline

Microphone
    ↓
VAD  (detect speech boundaries)
    ↓
ASR  (speech → text)
    ↓
LLM  (text → response text)
    ↓
TTS  (text → audio)            ← you are here
    ↓
Speaker

Each stage has its own latency budget. TTS is often the largest contributor to perceived delay because audio must be generated before (or while) it plays. Streaming TTS — generating playback audio sentence-by-sentence — is the key technique for reducing that latency.

TTS Generation Approaches

1. Concatenative / Unit Selection (Legacy)

Stitches pre-recorded phoneme/diphone segments from a database. Fast, no GPU needed, but robotic-sounding and voice-locked.

Examples: Festival, MaryTTS, eSpeak

2. Statistical Parametric (HMM/DNN Era)

Models speech as a sequence of acoustic parameters generated by a statistical model (HMM, then DNN). Smoother than concatenative but still clearly synthetic.

Examples: HTS, Merlin

3. Neural End-to-End (Tacotron/FastSpeech Era)

A neural network maps text (or phonemes) directly to mel-spectrograms, then a vocoder converts the spectrogram to a waveform. This was the dominant architecture from 2017–2022.

Tacotron 2: seq2seq with attention, excellent quality, slow autoregressive decoding
FastSpeech 2: non-autoregressive (parallel), fast, duration/pitch/energy predictors
VITS: end-to-end VAE + normalizing flow + GAN — no separate vocoder needed

4. Codec Language Models (Bark / AudioLM Era)

Encodes audio as discrete tokens using a neural audio codec (EnCodec, DAC), then trains an LLM-style autoregressive model over those tokens. Produces extremely natural speech, laughter, non-verbal sounds, but is slow (100+ tokens/second of audio).

Examples: Bark, AudioLM, VoiceBox, VoiceCraft

5. Flow Matching / Diffusion

Uses continuous normalizing flows or score-based diffusion to generate mel-spectrograms or waveforms directly from noise. Fast inference, high quality, excellent voice cloning.

Examples: F5-TTS, E2-TTS, Voicebox, Matcha-TTS

Library Landscape

Library	Approach	Quality	Speed	Voice Cloning	Offline	License
Kokoro	Flow + HiFi-GAN	★★★★☆	★★★★★	✗	✓	Apache 2.0
Coqui XTTS-v2	Codec LM	★★★★★	★★★☆☆	✓ (3s ref)	✓	CPML
F5-TTS	Flow matching	★★★★★	★★★★☆	✓ (ref audio)	✓	MIT
StyleTTS2	Diffusion + style	★★★★★	★★★☆☆	✓	✓	MIT
Bark	Codec LM (GPT)	★★★★☆	★★☆☆☆	✗ (voice presets)	✓	MIT
edge-tts	Cloud (Azure)	★★★★☆	★★★★★	✗	✗	MIT (client)
OpenVoice V2	VITS + tone color	★★★★☆	★★★★☆	✓ (any voice)	✓	MIT
pyttsx3	OS TTS engine	★★☆☆☆	★★★★★	✗	✓	MIT
MeloTTS	VITS-based	★★★★☆	★★★★☆	✗	✓	MIT

Voice Quality Metrics

MOS (Mean Opinion Score)

Subjective listening test rated 1–5 by human listeners. Standard TTS quality benchmark.

Score	Description
5.0	Excellent — indistinguishable from human
4.0–4.5	Good — natural, minor artifacts
3.0–4.0	Fair — clearly synthetic but intelligible
< 3.0	Poor — noticeable robotic quality

Human speech typically scores ~4.5 MOS. Modern neural TTS (XTTS-v2, F5-TTS) achieves 4.0–4.3.

WER on TTS Output

Round-trip metric: synthesize → transcribe with ASR → compare to input. Lower WER = better intelligibility.

RTF (Real-Time Factor)

RTF = \frac{\text{synthesis time (s)}}{\text{audio duration (s)}}

RTF < 1.0 means faster than real-time (required for streaming use). RTF < 0.1 is excellent.

Additional Objective Metrics

Metric	Full Name	Measures	Range	Better
PESQ	Perceptual Evaluation of Speech Quality (ITU-T P.862)	Narrowband/wideband quality vs. reference	-0.5 – 4.5	Higher
STOI	Short-Time Objective Intelligibility	Intelligibility — fraction of speech correctly understood	0 – 1	Higher
ViSQOL	Virtual Speech Quality Objective Listener	Perceptual quality using neurogram similarity	1 – 5	Higher
UTMOS	Unified TTS MOS predictor (neural)	Predicted MOS without human listeners	1 – 5	Higher
F0 RMSE	Root-mean-square error of fundamental frequency	Pitch accuracy vs. reference	Hz	Lower
V/UV error	Voiced/unvoiced classification error	Voicing correctness	%	Lower

# PESQ and STOI require a reference (ground-truth) waveform
# pip install pesq pystoi

from pesq import pesq
from pystoi import stoi
import numpy as np

def evaluate_tts_quality(reference: np.ndarray, synthesized: np.ndarray, sr: int = 16000):
    """Compute objective TTS quality metrics vs a reference waveform."""
    pesq_score = pesq(sr, reference, synthesized, "wb")   # wideband
    stoi_score = stoi(reference, synthesized, sr, extended=False)
    print(f"PESQ: {pesq_score:.3f}  STOI: {stoi_score:.3f}")
    return pesq_score, stoi_score

UTMOS (neural MOS predictor) does not need a reference — it scores any audio:
pip install utmos
utmos score audio.wav   # outputs predicted MOS

When to Use Which Library

Need offline + fastest possible?
    → Kokoro (Apache 2.0, ONNX, CPU ~0.05 RTF)

Need voice cloning from a short reference?
    → F5-TTS (MIT, best quality/speed balance for cloning)
    → Coqui XTTS-v2 (CPML, excellent multilingual cloning)

Need zero setup, cloud quality?
    → edge-tts (free Azure Neural TTS, 400+ voices)

Need expressive/emotional speech, laughter, music?
    → Bark (slow but uniquely expressive)

Need your own voice or any voice style transfer?
    → OpenVoice V2 (MIT, real-time style mixing)

Need simple system TTS (no ML)?
    → pyttsx3 (wraps OS engine, zero deps)

Audio Output Formats

Format	Sample Rate	Bit Depth	Use for
WAV (PCM)	22050 / 24000 Hz	16-bit	Storage, further processing
WAV (PCM)	16000 Hz	16-bit	ASR round-trip, voice pipeline
MP3	any	compressed	Web delivery
OGG Vorbis	any	compressed	Web streaming

Most neural TTS models output at 22050 Hz or 24000 Hz. If feeding back to ASR, resample to 16000 Hz.

AI

VAD

ASR

TTS

llama-swap

llama.cpp

Embedded Sytems

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

DevOps

Conan

Artifactory

Jenkins