ASR Cheatsheet

About 883 wordsAbout 3 min

2026-03-21

Quick-reference for install commands, model selection, code snippets, and common fixes.

Install Commands

# Core (recommended stack)
pip install faster-whisper soundfile numpy

# Real-time microphone
pip install sounddevice

# Silero VAD (for silence detection)
pip install torch torchaudio
# torch.hub.load("snakers4/silero-vad", "silero_vad")

# RealtimeSTT (batteries-included real-time)
pip install RealtimeSTT

# openai-whisper (reference impl)
pip install openai-whisper

# HuggingFace Whisper / Wav2Vec2
pip install transformers accelerate

# Vosk (offline, small models)
pip install vosk

# Accuracy evaluation
pip install jiwer

# FastAPI ASR endpoint
pip install fastapi uvicorn python-multipart httpx

# System dependencies (Ubuntu)
sudo apt update && sudo apt install -y ffmpeg libsndfile1 portaudio19-dev

Model Selection Guide

Model	Size	Speed vs large-v3	WER (LibriSpeech)	Best for
`tiny.en`	39 MB	~32×	~5.7%	Live demo, embedded
`base.en`	74 MB	~16×	~4.2%	General offline use
`small.en`	244 MB	~6×	~3.0%	Good accuracy + speed
`medium.en`	769 MB	~2×	~2.4%	High accuracy, server
`large-v2`	1.5 GB	1×	~2.0%	Best accuracy (stable)
`large-v3`	1.5 GB	1×	~1.8%	Best accuracy (current)
`turbo`	809 MB	~8×	~2.1%	Best accuracy/speed ratio

Remove .en suffix for multilingual models (e.g. "base" instead of "base.en").

Compute Type Guide (faster-whisper)

compute_type	Device	Memory	Speed	Use when
`float32`	CPU	High	Slow	Debugging only
`int8`	CPU	Low	Fast	CPU production
`float16`	GPU	Medium	Fast	GPU with 2+ GB VRAM
`int8_float16`	GPU	Low	Fastest	GPU with limited VRAM
`bfloat16`	GPU	Medium	Fast	A100 / H100 GPUs

faster-whisper One-Liners

from faster_whisper import WhisperModel
model = WhisperModel("base.en", device="cpu", compute_type="int8")

# --- Transcribe a file ---
segs, info = model.transcribe("audio.wav", beam_size=5)
text = " ".join(s.text.strip() for s in segs)

# --- With word timestamps ---
segs, _ = model.transcribe("audio.wav", word_timestamps=True)
for seg in segs:
    for w in seg.words:
        print(f"{w.start:.2f}-{w.end:.2f}: {w.word}")

# --- Skip silence with built-in VAD ---
segs, _ = model.transcribe("audio.wav", vad_filter=True)

# --- Force language ---
segs, _ = model.transcribe("audio.wav", language="en")

# --- Translate to English ---
segs, _ = model.transcribe("audio.wav", task="translate")

# --- Fast (greedy, no beam search) ---
segs, _ = model.transcribe("audio.wav", beam_size=1, best_of=1)

# --- From numpy float32 array ---
import numpy as np
audio = np.zeros(16000 * 3, dtype=np.float32)  # 3s silence
segs, _ = model.transcribe(audio, beam_size=5)

# --- Detect language ---
_, info = model.transcribe("audio.wav", beam_size=1)
print(info.language, info.language_probability)

VAD + ASR Pipeline

import torch
import numpy as np
from faster_whisper import WhisperModel

# Load models once
vad_model, utils = torch.hub.load("snakers4/silero-vad", "silero_vad")
get_speech_ts = utils[0]
asr_model = WhisperModel("base.en", device="cpu", compute_type="int8")

def transcribe_with_vad(audio: np.ndarray, sr: int = 16000) -> str:
    tensor = torch.tensor(audio)
    speech_timestamps = get_speech_ts(tensor, vad_model, sampling_rate=sr)
    
    if not speech_timestamps:
        return ""
    
    parts = []
    for ts in speech_timestamps:
        chunk = audio[ts["start"]:ts["end"]]
        segs, _ = asr_model.transcribe(chunk, beam_size=5)
        parts.append(" ".join(s.text.strip() for s in segs))
    
    return " ".join(parts)

RealtimeSTT Quick Start

from RealtimeSTT import AudioToTextRecorder

def on_text(text):
    print(f">> {text}")

recorder = AudioToTextRecorder(
    model="base.en",
    language="en",
    silero_sensitivity=0.4,
    on_realtime_transcription_stabilized=on_text,
)

print("Speak... (Ctrl+C to quit)")
try:
    while True:
        recorder.text(on_text)
except KeyboardInterrupt:
    recorder.stop()

WER Evaluation

from jiwer import wer, cer

reference = "the quick brown fox jumps over the lazy dog"
hypothesis = "the quick brown fox jumped over the lazy dog"

print(f"WER: {wer(reference, hypothesis):.1%}")
print(f"CER: {cer(reference, hypothesis):.1%}")

SRT Generation One-Liner

from faster_whisper import WhisperModel

def to_srt_time(t):
    h,m,s,ms = int(t//3600),int(t%3600//60),int(t%60),int(t%1*1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

model = WhisperModel("small", device="cpu", compute_type="int8")
segs, _ = model.transcribe("video.mp4", beam_size=5)

with open("subtitles.srt", "w") as f:
    for i, seg in enumerate(segs, 1):
        f.write(f"{i}\n{to_srt_time(seg.start)} --> {to_srt_time(seg.end)}\n{seg.text.strip()}\n\n")

Common Error Fixes

Error	Cause	Fix
`FileNotFoundError: ffmpeg`	ffmpeg not installed	`sudo apt install ffmpeg`
`CUDA error: out of memory`	Model too large for VRAM	Use smaller model or `int8_float16`
Hallucinations on silence	No VAD	Add `vad_filter=True` or pre-filter with silero
Wrong language	Auto-detect failed	Set `language="en"` explicitly
Words cut off	No context between windows	Set `condition_on_previous_text=True`
Very slow on CPU	Default float32	Use `compute_type="int8"`
`InvalidInputDevice`	Wrong audio device	Run `python -c "import sounddevice as sd; print(sd.query_devices())"`
Duplicate words	High temperature	Set `temperature=0` in transcribe
`No module named 'torch'`	Missing dep for VAD	`pip install torch torchaudio`

Key Parameters Reference

model.transcribe(
    audio,                          # path str or float32 numpy array
    language=None,                  # None=auto, "en", "fr", "de", ...
    task="transcribe",              # "transcribe" or "translate"
    beam_size=5,                    # 1=greedy (fast), 5=default, 10=slower
    best_of=5,                      # number of candidates with temperature>0
    patience=1.0,                   # beam search patience factor
    temperature=0,                  # 0=deterministic, 0.2-1.0=sampling
    compression_ratio_threshold=2.4,# filter repeated outputs
    log_prob_threshold=-1.0,        # filter low-confidence segments
    no_speech_threshold=0.6,        # filter silent/noise segments
    condition_on_previous_text=True,# use prev text as context
    initial_prompt=None,            # seed Whisper with domain vocabulary
    word_timestamps=False,          # enable per-word timestamps
    vad_filter=False,               # built-in silero VAD
    vad_parameters=None,            # dict with VAD settings
)

All ASR Files

#	Topic	Link
01	Introduction & Overview	→
02	Algorithms & Theory	→
03	Libraries Comparison	→
04	Installation	→
05	Implementation	→
06	Real-Time Streaming	→
07	Integration Guide	→
08	Troubleshooting	→
09	Cheatsheet	← you are here

AI

VAD

ASR

TTS

llama-swap

llama.cpp

Embedded Sytems

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

DevOps

Conan

Artifactory

Jenkins

ASR Cheatsheet

Install Commands

Model Selection Guide

Compute Type Guide (faster-whisper)

faster-whisper One-Liners

VAD + ASR Pipeline

RealtimeSTT Quick Start

WER Evaluation

SRT Generation One-Liner

Common Error Fixes

Key Parameters Reference

All ASR Files