TTS Installation

About 805 wordsAbout 3 min

2026-03-21

System dependencies, library installs, model downloads, and platform-specific notes for every major TTS library.

System Dependencies

Ubuntu / Debian

sudo apt update && sudo apt install -y \
    ffmpeg \
    libsndfile1 \
    portaudio19-dev \
    espeak-ng \
    libespeak-ng-dev \
    python3-dev \
    build-essential

Fedora / RHEL

sudo dnf install -y ffmpeg libsndfile portaudio-devel espeak-ng espeak-ng-devel gcc

macOS

brew install ffmpeg libsndfile portaudio espeak-ng

Windows

Install ffmpeg and add to PATH
Install eSpeak-NG and add to PATH
PortAudio: included in the sounddevice wheel — no manual install needed

Kokoro

Fastest CPU TTS. Recommended for production.

# PyTorch backend
pip install kokoro soundfile numpy

# ONNX backend (even faster CPU, no PyTorch)
pip install kokoro[onnx] soundfile numpy

# eSpeak-NG for phonemization (misaki backend, optional but recommended)
pip install misaki[en]      # English
pip install misaki[ja]      # Japanese
pip install misaki[zh]      # Chinese
pip install misaki[ko]      # Korean
pip install misaki[fr]      # French

Models are downloaded automatically on first use to ~/.cache/huggingface/hub/.

Manual pre-download:

from huggingface_hub import snapshot_download
snapshot_download("hexgrad/Kokoro-82M", local_dir="./models/kokoro")

Coqui XTTS-v2

# Basic install
pip install TTS

# Specific XTTS-v2 with GPU
pip install TTS torch torchaudio --index-url https://download.pytorch.org/whl/cu121

Pre-download model (~2.8GB):

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
# Downloads to ~/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2/

Manual download (for offline/Docker):

# Using huggingface-cli
pip install huggingface_hub
huggingface-cli download coqui/XTTS-v2 --local-dir ./models/xtts_v2

F5-TTS

pip install f5-tts

# With GPU support
pip install f5-tts torch torchaudio --index-url https://download.pytorch.org/whl/cu121

Pre-download model:

huggingface-cli download SWivid/F5-TTS --local-dir ./models/f5-tts

Or via Python:

from f5_tts.api import F5TTS
tts = F5TTS()  # downloads ~1.2GB on first run

Bark

pip install bark

# suno-bark fork (more maintained)
pip install git+https://github.com/suno-ai/bark.git

# Dependencies
pip install transformers accelerate torch torchaudio soundfile

Pre-download all models (~5GB total):

from bark import preload_models
preload_models()
# Downloads to ~/.cache/suno/bark_v0/

Models are stored separately:

~/.cache/suno/bark_v0/
├── text_2.pt          # semantic model (~1.2GB)
├── coarse_2.pt        # coarse acoustic (~1.2GB)
├── fine_2.pt          # fine acoustic (~1.2GB)
└── hubert_base_ls960.pt  # semantic encoder

Small models (lower quality, much faster):

import os
os.environ["SUNO_USE_SMALL_MODELS"] = "True"
from bark import preload_models
preload_models()

edge-tts

pip install edge-tts
# No model download — uses Microsoft cloud

OpenVoice V2

git clone https://github.com/myshell-ai/OpenVoice
cd OpenVoice
pip install -e .
pip install melo-tts

# Download checkpoints (~500MB)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('myshell-ai/OpenVoiceV2', local_dir='./checkpoints_v2')
"

StyleTTS2

pip install git+https://github.com/yl4579/StyleTTS2.git
pip install torch torchaudio phonemizer einops transformers

# Download model
huggingface-cli download yl4579/StyleTTS2-LibriTTS --local-dir ./models/styletts2

pyttsx3

pip install pyttsx3

# Linux: requires espeak or espeak-ng
sudo apt install espeak-ng

# macOS: uses built-in NSSpeechSynthesizer (no extra deps)
# Windows: uses SAPI5 (no extra deps)

sounddevice (for real-time audio output)

pip install sounddevice
# Requires portaudio (installed above)

Minimal requirements.txt

# Core TTS stack
kokoro>=0.9.0
soundfile>=0.12.1
numpy>=1.24.0
sounddevice>=0.4.6

# For voice cloning (pick one)
# f5-tts>=0.3.0
# TTS>=0.22.0          # Coqui XTTS-v2

# For cloud TTS (no offline needed)
# edge-tts>=6.1.9

# For Bark
# bark>=1.0.0
# transformers>=4.35.0
# accelerate>=0.24.0

Docker (CPU)

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    ffmpeg libsndfile1 espeak-ng portaudio19-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
RUN pip install --no-cache-dir kokoro soundfile numpy sounddevice

# Pre-download models at build time
RUN python -c "from kokoro import KPipeline; KPipeline(lang_code='a')"

COPY . .
CMD ["python", "tts_service.py"]

Docker (GPU — XTTS-v2 / F5-TTS)

FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3 python3-pip ffmpeg libsndfile1 espeak-ng \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir \
    torch torchaudio --index-url https://download.pytorch.org/whl/cu121 \
    TTS soundfile numpy

WORKDIR /app
COPY . .
CMD ["python3", "tts_api.py"]

Platform Notes

Platform	Notes
Raspberry Pi 4	Kokoro ONNX only — too slow otherwise. Use `int8` ONNX. Consider `tiny` voices.
Apple Silicon (M1/M2/M3)	Use `device="mps"` for PyTorch acceleration. Kokoro ONNX also works well.
Windows	pyttsx3 and edge-tts work natively. Kokoro requires WSL or proper Python env.
WSL2	PortAudio may not access mic/speakers — route audio via PulseAudio or use file output only.
Headless server	Use file output only — no sounddevice streaming. Serve audio via API.

Verify Installation

# Quick sanity check
from kokoro import KPipeline
import numpy as np
import soundfile as sf

pipe = KPipeline(lang_code="a")
chunks = [audio for _, _, audio in pipe("Installation successful.", voice="af_heart")]
sf.write("/tmp/test_tts.wav", np.concatenate(chunks), 24000)
print("TTS OK — check /tmp/test_tts.wav")

VAD

ASR

TTS

llama-swap

llama.cpp

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

Conan

Artifactory

Jenkins

TTS Installation