TTS Cheatsheet

About 974 wordsAbout 3 min

2026-03-21

Quick-reference for install commands, library selection, code snippets, and common fixes.

Install Commands

# Kokoro (fastest offline, Apache 2.0)
pip install kokoro soundfile numpy sounddevice
pip install misaki[en]       # English G2P
sudo apt install espeak-ng   # Required for phonemization

# F5-TTS (best voice cloning, MIT)
pip install f5-tts

# Coqui XTTS-v2 (multilingual cloning, CPML)
pip install TTS

# Bark (expressive/creative, MIT — slow)
pip install bark transformers accelerate

# edge-tts (cloud Azure, zero setup, MIT client)
pip install edge-tts

# OpenVoice V2 (style transfer, MIT)
pip install openvoice melo-tts

# pyttsx3 (OS TTS, zero ML, MIT)
pip install pyttsx3

# Audio utilities
pip install sounddevice soundfile librosa num2words ffmpeg-python

# System deps (Ubuntu)
sudo apt install ffmpeg libsndfile1 portaudio19-dev espeak-ng

Library Selection Guide

Need	Best Choice	Why
Fastest CPU, offline	Kokoro	RTF ~0.05, Apache 2.0
Best voice cloning	F5-TTS	Flow matching, MIT
Multilingual cloning	XTTS-v2	17 languages
Zero setup, cloud	edge-tts	400+ voices, free
Expressive/emotion	Bark	Laughter, music tokens
Voice style transfer	OpenVoice V2	Tone color converter
No ML, instant	pyttsx3	OS engine wrapper
Low latency streaming	Kokoro	Generator-based, first chunk < 100ms

Kokoro One-Liners

from kokoro import KPipeline
import numpy as np, soundfile as sf

pipe = KPipeline(lang_code="a")  # "a"=US EN, "b"=British, "j"=JA, "z"=ZH

# --- Synthesize to file ---
sf.write("out.wav", np.concatenate([a for _,_,a in pipe("Hello!", voice="af_heart")]), 24000)

# --- Stream to speakers ---
import sounddevice as sd
for _, _, audio in pipe("Hello world!", voice="af_heart"):
    sd.play(audio, 24000); sd.wait()

# --- Different voices ---
VOICES = ["af_heart", "af_bella", "af_sarah", "am_adam", "am_michael", "bf_emma", "bm_george"]

# --- Speed control ---
for _, _, audio in pipe("Speaking slowly.", voice="af_heart", speed=0.8):
    ...

# --- British English ---
pipe_gb = KPipeline(lang_code="b")
for _, _, audio in pipe_gb("Brilliant!", voice="bf_emma"):
    ...

XTTS-v2 Voice Cloning

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=False)
tts.tts_to_file(
    text="Hello in my cloned voice.",
    speaker_wav="reference.wav",   # 3-15s clean speech
    language="en",
    file_path="output.wav",
)

F5-TTS Voice Cloning

from f5_tts.api import F5TTS
import soundfile as sf

tts = F5TTS()
wav, sr, _ = tts.infer(
    ref_file="reference.wav",
    ref_text="Exact transcript of reference.wav",
    gen_text="Text to generate in that voice.",
    nfe_step=32,
)
sf.write("output.wav", wav, sr)

edge-tts Quick Start

import asyncio, edge_tts

async def speak(text, voice="en-US-AriaNeural"):
    await edge_tts.Communicate(text, voice).save("output.mp3")

asyncio.run(speak("Hello from edge-tts!"))

# List all voices
async def voices():
    return await edge_tts.list_voices()

Streaming TTS (Sentence by Sentence)

import threading, queue
import numpy as np
import sounddevice as sd
from kokoro import KPipeline

def stream_speak(text: str, voice: str = "af_heart"):
    """Generate and play sentence by sentence."""
    pipe = KPipeline(lang_code="a")
    audio_q: queue.Queue = queue.Queue()
    
    def player():
        with sd.OutputStream(samplerate=24000, channels=1, dtype="float32") as s:
            while True:
                chunk = audio_q.get()
                if chunk is None: break
                s.write(chunk.reshape(-1, 1))
    
    t = threading.Thread(target=player, daemon=True)
    t.start()
    
    for _, _, audio in pipe(text, voice=voice):
        audio_q.put(audio)
    audio_q.put(None)
    t.join()

stream_speak("First sentence plays immediately. Second sentence follows. Done!")

Text Pre-Processing

import re
from num2words import num2words  # pip install num2words

def clean_for_tts(text: str) -> str:
    text = re.sub(r"\bAPI\b", "A.P.I.", text)
    text = re.sub(r"\bLLM\b", "L.L.M.", text)
    text = re.sub(r"\bGPU\b", "G.P.U.", text)
    text = re.sub(r"\b(\d+)\b", lambda m: num2words(int(m.group())), text)
    text = text.replace("C++", "C plus plus").replace("C#", "C sharp")
    text = re.sub(r"`[^`]+`", "", text)          # strip code
    text = re.sub(r"\*+([^*]+)\*+", r"\1", text)  # strip markdown bold/italic
    text = re.sub(r"#{1,6}\s", "", text)          # strip headings
    return text.strip()

Audio Utilities

import numpy as np
import soundfile as sf

# Fade in/out (prevents clicks)
def fade(audio, ms=10, sr=24000):
    n = min(int(ms*sr/1000), len(audio)//4)
    audio[:n]  *= np.linspace(0, 1, n)
    audio[-n:] *= np.linspace(1, 0, n)
    return audio

# Normalize
def normalize(audio, peak=0.95):
    m = np.abs(audio).max()
    return audio / m * peak if m > 0 else audio

# Resample (e.g. 24kHz → 16kHz for ASR)
import librosa
def resample(audio, src_sr, dst_sr):
    return librosa.resample(audio, orig_sr=src_sr, target_sr=dst_sr)

# Save as MP3
import subprocess, tempfile, os
def save_mp3(audio, sr, path):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as t:
        sf.write(t.name, audio, sr, subtype="PCM_16")
        tmp = t.name
    subprocess.run(["ffmpeg", "-y", "-i", tmp, "-codec:a", "libmp3lame", "-qscale:a", "2", path], check=True, capture_output=True)
    os.unlink(tmp)

Common Error Fixes

Error	Cause	Fix
`ModuleNotFoundError: kokoro`	Not installed	`pip install kokoro`
`espeak-ng: command not found`	Missing system dep	`sudo apt install espeak-ng`
Robotic voice	Cheap voice / no punctuation	Try `af_heart`, add punctuation
Mispronounced acronym	G2P reads as word	Add periods: `A.P.I.`
Audio click at sentence end	No fade	Apply `fade()` to each chunk
XTTS-v2 VRAM error	Too large for GPU	`gpu=False` → use CPU
Bark too slow	CPU inference	Use small models or GPU
edge-tts network error	No internet / rate limit	Retry with backoff, or use Kokoro
`sounddevice` no output	Wrong device	`sd.query_devices()` → set `sd.default.device`
Cloned voice sounds wrong	Bad reference audio	Clean to mono 24kHz, 3–15s
Numbers spoken wrong	Not pre-processed	Use `num2words`

Voice Pipeline Latency Summary

Component           | Typical Latency | Notes
--------------------+-----------------+---------------------------
VAD utterance end   | 600ms           | Configurable silence window
ASR (Whisper base)  | 100–300ms       | faster-whisper int8 CPU
LLM first token     | 200–500ms       | Depends on model/hardware
Sentence split      | ~10ms           | After ~20 chars buffered
Kokoro synthesis    | 30–80ms/sentence| CPU, first chunk only
Playback start      | ~5ms            | sounddevice buffer
--------------------+-----------------+---------------------------
Total first audio   | 1.0 – 1.5s      | End-to-end, CPU only

All TTS Files

#	Topic	Link
01	Introduction & Overview	→
02	Algorithms & Theory	→
03	Libraries Comparison	→
04	Installation	→
05	Implementation	→
06	Real-Time Streaming	→
07	Integration Guide	→
08	Troubleshooting	→
09	Cheatsheet	← you are here

AI

VAD

ASR

TTS

llama-swap

llama.cpp

Embedded Sytems

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

DevOps

Conan

Artifactory

Jenkins

TTS Cheatsheet

Install Commands

Library Selection Guide

Kokoro One-Liners

XTTS-v2 Voice Cloning

F5-TTS Voice Cloning

edge-tts Quick Start

Streaming TTS (Sentence by Sentence)

Text Pre-Processing

Audio Utilities

Common Error Fixes

Voice Pipeline Latency Summary

All TTS Files

Voice Pipeline