---
url: /kb/ai/tts/introduction/index.md
---
# TTS Introduction

Text-to-Speech (TTS) converts written text into natural-sounding audio. In an AI voice pipeline, TTS is the **final output stage**: it takes the text response produced by a language model and renders it as speech that a user hears.

***

## Position in the Voice Pipeline

```
Microphone
    ↓
VAD  (detect speech boundaries)
    ↓
ASR  (speech → text)
    ↓
LLM  (text → response text)
    ↓
TTS  (text → audio)            ← you are here
    ↓
Speaker
```

Each stage has its own latency budget. TTS is often the **largest contributor to perceived delay** because audio must be generated before (or while) it plays. Streaming TTS — generating playback audio sentence-by-sentence — is the key technique for reducing that latency.

***

## TTS Generation Approaches

### 1. Concatenative / Unit Selection (Legacy)

Stitches pre-recorded phoneme/diphone segments from a database. Fast, no GPU needed, but robotic-sounding and voice-locked.

* Examples: Festival, MaryTTS, eSpeak

### 2. Statistical Parametric (HMM/DNN Era)

Models speech as a sequence of acoustic parameters generated by a statistical model (HMM, then DNN). Smoother than concatenative but still clearly synthetic.

* Examples: HTS, Merlin

### 3. Neural End-to-End (Tacotron/FastSpeech Era)

A neural network maps text (or phonemes) directly to mel-spectrograms, then a **vocoder** converts the spectrogram to a waveform. This was the dominant architecture from 2017–2022.

* **Tacotron 2**: seq2seq with attention, excellent quality, slow autoregressive decoding
* **FastSpeech 2**: non-autoregressive (parallel), fast, duration/pitch/energy predictors
* **VITS**: end-to-end VAE + normalizing flow + GAN — no separate vocoder needed

### 4. Codec Language Models (Bark / AudioLM Era)

Encodes audio as discrete tokens using a neural audio codec (EnCodec, DAC), then trains an LLM-style autoregressive model over those tokens. Produces extremely natural speech, laughter, non-verbal sounds, but is **slow** (100+ tokens/second of audio).

* Examples: Bark, AudioLM, VoiceBox, VoiceCraft

### 5. Flow Matching / Diffusion

Uses continuous normalizing flows or score-based diffusion to generate mel-spectrograms or waveforms directly from noise. Fast inference, high quality, excellent voice cloning.

* Examples: F5-TTS, E2-TTS, Voicebox, Matcha-TTS

***

## Library Landscape

| Library | Approach | Quality | Speed | Voice Cloning | Offline | License |
|---------|----------|---------|-------|--------------|---------|---------|
| **Kokoro** | Flow + HiFi-GAN | ★★★★☆ | ★★★★★ | ✗ | ✓ | Apache 2.0 |
| **Coqui XTTS-v2** | Codec LM | ★★★★★ | ★★★☆☆ | ✓ (3s ref) | ✓ | CPML |
| **F5-TTS** | Flow matching | ★★★★★ | ★★★★☆ | ✓ (ref audio) | ✓ | MIT |
| **StyleTTS2** | Diffusion + style | ★★★★★ | ★★★☆☆ | ✓ | ✓ | MIT |
| **Bark** | Codec LM (GPT) | ★★★★☆ | ★★☆☆☆ | ✗ (voice presets) | ✓ | MIT |
| **edge-tts** | Cloud (Azure) | ★★★★☆ | ★★★★★ | ✗ | ✗ | MIT (client) |
| **OpenVoice V2** | VITS + tone color | ★★★★☆ | ★★★★☆ | ✓ (any voice) | ✓ | MIT |
| **pyttsx3** | OS TTS engine | ★★☆☆☆ | ★★★★★ | ✗ | ✓ | MIT |
| **MeloTTS** | VITS-based | ★★★★☆ | ★★★★☆ | ✗ | ✓ | MIT |

***

## Voice Quality Metrics

### MOS (Mean Opinion Score)

Subjective listening test rated 1–5 by human listeners. Standard TTS quality benchmark.

| Score | Description |
|-------|-------------|
| 5.0 | Excellent — indistinguishable from human |
| 4.0–4.5 | Good — natural, minor artifacts |
| 3.0–4.0 | Fair — clearly synthetic but intelligible |
| < 3.0 | Poor — noticeable robotic quality |

Human speech typically scores ~4.5 MOS. Modern neural TTS (XTTS-v2, F5-TTS) achieves 4.0–4.3.

### WER on TTS Output

Round-trip metric: synthesize → transcribe with ASR → compare to input. Lower WER = better intelligibility.

### RTF (Real-Time Factor)

$$RTF = \frac{\text{synthesis time (s)}}{\text{audio duration (s)}}$$

RTF < 1.0 means faster than real-time (required for streaming use). RTF < 0.1 is excellent.

### Additional Objective Metrics

| Metric | Full Name | Measures | Range | Better |
|--------|-----------|----------|-------|--------|
| **PESQ** | Perceptual Evaluation of Speech Quality (ITU-T P.862) | Narrowband/wideband quality vs. reference | -0.5 – 4.5 | Higher |
| **STOI** | Short-Time Objective Intelligibility | Intelligibility — fraction of speech correctly understood | 0 – 1 | Higher |
| **ViSQOL** | Virtual Speech Quality Objective Listener | Perceptual quality using neurogram similarity | 1 – 5 | Higher |
| **UTMOS** | Unified TTS MOS predictor (neural) | Predicted MOS without human listeners | 1 – 5 | Higher |
| **F0 RMSE** | Root-mean-square error of fundamental frequency | Pitch accuracy vs. reference | Hz | Lower |
| **V/UV error** | Voiced/unvoiced classification error | Voicing correctness | % | Lower |

```python
# PESQ and STOI require a reference (ground-truth) waveform
# pip install pesq pystoi

from pesq import pesq
from pystoi import stoi
import numpy as np

def evaluate_tts_quality(reference: np.ndarray, synthesized: np.ndarray, sr: int = 16000):
    """Compute objective TTS quality metrics vs a reference waveform."""
    pesq_score = pesq(sr, reference, synthesized, "wb")   # wideband
    stoi_score = stoi(reference, synthesized, sr, extended=False)
    print(f"PESQ: {pesq_score:.3f}  STOI: {stoi_score:.3f}")
    return pesq_score, stoi_score
```

> **UTMOS** (neural MOS predictor) does not need a reference — it scores any audio:
>
> ```bash
> pip install utmos
> utmos score audio.wav   # outputs predicted MOS
> ```

***

## When to Use Which Library

```
Need offline + fastest possible?
    → Kokoro (Apache 2.0, ONNX, CPU ~0.05 RTF)

Need voice cloning from a short reference?
    → F5-TTS (MIT, best quality/speed balance for cloning)
    → Coqui XTTS-v2 (CPML, excellent multilingual cloning)

Need zero setup, cloud quality?
    → edge-tts (free Azure Neural TTS, 400+ voices)

Need expressive/emotional speech, laughter, music?
    → Bark (slow but uniquely expressive)

Need your own voice or any voice style transfer?
    → OpenVoice V2 (MIT, real-time style mixing)

Need simple system TTS (no ML)?
    → pyttsx3 (wraps OS engine, zero deps)
```

***

## Audio Output Formats

| Format | Sample Rate | Bit Depth | Use for |
|--------|-------------|-----------|---------|
| WAV (PCM) | 22050 / 24000 Hz | 16-bit | Storage, further processing |
| WAV (PCM) | 16000 Hz | 16-bit | ASR round-trip, voice pipeline |
| MP3 | any | compressed | Web delivery |
| OGG Vorbis | any | compressed | Web streaming |

Most neural TTS models output at **22050 Hz** or **24000 Hz**. If feeding back to ASR, resample to 16000 Hz.

***

## See Also

* [TTS Algorithms & Theory](/kb/ai/tts/algorithms-theory/)
* [TTS Libraries Comparison](/kb/ai/tts/libraries-comparison/)
* [TTS Integration (Voice Pipeline)](/kb/ai/tts/integration/)
* [ASR Introduction](/kb/ai/asr/introduction/)
* [VAD Introduction](/kb/ai/vad/introduction/)