TTS Cheatsheet
About 974 wordsAbout 3 min
2026-03-21
Quick-reference for install commands, library selection, code snippets, and common fixes.
Install Commands
# Kokoro (fastest offline, Apache 2.0)
pip install kokoro soundfile numpy sounddevice
pip install misaki[en] # English G2P
sudo apt install espeak-ng # Required for phonemization
# F5-TTS (best voice cloning, MIT)
pip install f5-tts
# Coqui XTTS-v2 (multilingual cloning, CPML)
pip install TTS
# Bark (expressive/creative, MIT — slow)
pip install bark transformers accelerate
# edge-tts (cloud Azure, zero setup, MIT client)
pip install edge-tts
# OpenVoice V2 (style transfer, MIT)
pip install openvoice melo-tts
# pyttsx3 (OS TTS, zero ML, MIT)
pip install pyttsx3
# Audio utilities
pip install sounddevice soundfile librosa num2words ffmpeg-python
# System deps (Ubuntu)
sudo apt install ffmpeg libsndfile1 portaudio19-dev espeak-ngLibrary Selection Guide
| Need | Best Choice | Why |
|---|---|---|
| Fastest CPU, offline | Kokoro | RTF ~0.05, Apache 2.0 |
| Best voice cloning | F5-TTS | Flow matching, MIT |
| Multilingual cloning | XTTS-v2 | 17 languages |
| Zero setup, cloud | edge-tts | 400+ voices, free |
| Expressive/emotion | Bark | Laughter, music tokens |
| Voice style transfer | OpenVoice V2 | Tone color converter |
| No ML, instant | pyttsx3 | OS engine wrapper |
| Low latency streaming | Kokoro | Generator-based, first chunk < 100ms |
Kokoro One-Liners
from kokoro import KPipeline
import numpy as np, soundfile as sf
pipe = KPipeline(lang_code="a") # "a"=US EN, "b"=British, "j"=JA, "z"=ZH
# --- Synthesize to file ---
sf.write("out.wav", np.concatenate([a for _,_,a in pipe("Hello!", voice="af_heart")]), 24000)
# --- Stream to speakers ---
import sounddevice as sd
for _, _, audio in pipe("Hello world!", voice="af_heart"):
sd.play(audio, 24000); sd.wait()
# --- Different voices ---
VOICES = ["af_heart", "af_bella", "af_sarah", "am_adam", "am_michael", "bf_emma", "bm_george"]
# --- Speed control ---
for _, _, audio in pipe("Speaking slowly.", voice="af_heart", speed=0.8):
...
# --- British English ---
pipe_gb = KPipeline(lang_code="b")
for _, _, audio in pipe_gb("Brilliant!", voice="bf_emma"):
...XTTS-v2 Voice Cloning
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=False)
tts.tts_to_file(
text="Hello in my cloned voice.",
speaker_wav="reference.wav", # 3-15s clean speech
language="en",
file_path="output.wav",
)F5-TTS Voice Cloning
from f5_tts.api import F5TTS
import soundfile as sf
tts = F5TTS()
wav, sr, _ = tts.infer(
ref_file="reference.wav",
ref_text="Exact transcript of reference.wav",
gen_text="Text to generate in that voice.",
nfe_step=32,
)
sf.write("output.wav", wav, sr)edge-tts Quick Start
import asyncio, edge_tts
async def speak(text, voice="en-US-AriaNeural"):
await edge_tts.Communicate(text, voice).save("output.mp3")
asyncio.run(speak("Hello from edge-tts!"))
# List all voices
async def voices():
return await edge_tts.list_voices()Streaming TTS (Sentence by Sentence)
import threading, queue
import numpy as np
import sounddevice as sd
from kokoro import KPipeline
def stream_speak(text: str, voice: str = "af_heart"):
"""Generate and play sentence by sentence."""
pipe = KPipeline(lang_code="a")
audio_q: queue.Queue = queue.Queue()
def player():
with sd.OutputStream(samplerate=24000, channels=1, dtype="float32") as s:
while True:
chunk = audio_q.get()
if chunk is None: break
s.write(chunk.reshape(-1, 1))
t = threading.Thread(target=player, daemon=True)
t.start()
for _, _, audio in pipe(text, voice=voice):
audio_q.put(audio)
audio_q.put(None)
t.join()
stream_speak("First sentence plays immediately. Second sentence follows. Done!")Text Pre-Processing
import re
from num2words import num2words # pip install num2words
def clean_for_tts(text: str) -> str:
text = re.sub(r"\bAPI\b", "A.P.I.", text)
text = re.sub(r"\bLLM\b", "L.L.M.", text)
text = re.sub(r"\bGPU\b", "G.P.U.", text)
text = re.sub(r"\b(\d+)\b", lambda m: num2words(int(m.group())), text)
text = text.replace("C++", "C plus plus").replace("C#", "C sharp")
text = re.sub(r"`[^`]+`", "", text) # strip code
text = re.sub(r"\*+([^*]+)\*+", r"\1", text) # strip markdown bold/italic
text = re.sub(r"#{1,6}\s", "", text) # strip headings
return text.strip()Audio Utilities
import numpy as np
import soundfile as sf
# Fade in/out (prevents clicks)
def fade(audio, ms=10, sr=24000):
n = min(int(ms*sr/1000), len(audio)//4)
audio[:n] *= np.linspace(0, 1, n)
audio[-n:] *= np.linspace(1, 0, n)
return audio
# Normalize
def normalize(audio, peak=0.95):
m = np.abs(audio).max()
return audio / m * peak if m > 0 else audio
# Resample (e.g. 24kHz → 16kHz for ASR)
import librosa
def resample(audio, src_sr, dst_sr):
return librosa.resample(audio, orig_sr=src_sr, target_sr=dst_sr)
# Save as MP3
import subprocess, tempfile, os
def save_mp3(audio, sr, path):
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as t:
sf.write(t.name, audio, sr, subtype="PCM_16")
tmp = t.name
subprocess.run(["ffmpeg", "-y", "-i", tmp, "-codec:a", "libmp3lame", "-qscale:a", "2", path], check=True, capture_output=True)
os.unlink(tmp)Common Error Fixes
| Error | Cause | Fix |
|---|---|---|
ModuleNotFoundError: kokoro | Not installed | pip install kokoro |
espeak-ng: command not found | Missing system dep | sudo apt install espeak-ng |
| Robotic voice | Cheap voice / no punctuation | Try af_heart, add punctuation |
| Mispronounced acronym | G2P reads as word | Add periods: A.P.I. |
| Audio click at sentence end | No fade | Apply fade() to each chunk |
| XTTS-v2 VRAM error | Too large for GPU | gpu=False → use CPU |
| Bark too slow | CPU inference | Use small models or GPU |
| edge-tts network error | No internet / rate limit | Retry with backoff, or use Kokoro |
sounddevice no output | Wrong device | sd.query_devices() → set sd.default.device |
| Cloned voice sounds wrong | Bad reference audio | Clean to mono 24kHz, 3–15s |
| Numbers spoken wrong | Not pre-processed | Use num2words |
Voice Pipeline Latency Summary
Component | Typical Latency | Notes
--------------------+-----------------+---------------------------
VAD utterance end | 600ms | Configurable silence window
ASR (Whisper base) | 100–300ms | faster-whisper int8 CPU
LLM first token | 200–500ms | Depends on model/hardware
Sentence split | ~10ms | After ~20 chars buffered
Kokoro synthesis | 30–80ms/sentence| CPU, first chunk only
Playback start | ~5ms | sounddevice buffer
--------------------+-----------------+---------------------------
Total first audio | 1.0 – 1.5s | End-to-end, CPU onlyAll TTS Files
| # | Topic | Link |
|---|---|---|
| 01 | Introduction & Overview | → |
| 02 | Algorithms & Theory | → |
| 03 | Libraries Comparison | → |
| 04 | Installation | → |
| 05 | Implementation | → |
| 06 | Real-Time Streaming | → |
| 07 | Integration Guide | → |
| 08 | Troubleshooting | → |
| 09 | Cheatsheet | ← you are here |