ASR Troubleshooting
About 1240 wordsAbout 4 min
2026-03-21
Common problems and their solutions when working with ASR libraries, particularly faster-whisper and openai-whisper.
Hallucinations on Silence
Symptom: Whisper outputs random text, music notation, or "[MUSIC]" when the audio contains silence or low-level noise.
Root cause: Whisper was trained on 30-second clips and fills empty audio with plausible outputs. It does not inherently know "there is nothing to transcribe here."
Fixes:
from faster_whisper import WhisperModel
model = WhisperModel("base.en", device="cpu", compute_type="int8")
# 1. Enable built-in VAD filter
segments, info = model.transcribe(
"audio.wav",
vad_filter=True,
vad_parameters={
"threshold": 0.5, # speech prob threshold (0–1)
"min_speech_duration_ms": 250, # ignore very short bursts
"min_silence_duration_ms": 600,
"speech_pad_ms": 200,
},
)
# 2. Filter by no_speech_prob after the fact
for seg in segments:
if seg.no_speech_prob > 0.6:
continue # skip silent/noise segments
print(seg.text)
# 3. Filter by log-probability (very low = hallucination)
for seg in segments:
if seg.avg_logprob < -1.0:
continue
print(seg.text)Run VAD before ASR (strongest fix):
# Use silero-vad to extract only speech windows first,
# then feed only those chunks to Whisper.
# See /kb/ai/vad/real-time-streaming/ for implementation.Wrong Language Detected
Symptom: Whisper transcribes in the wrong language, or transliterates instead of transcribing.
Root cause: Auto-detection uses the first 30 seconds. If that segment is noisy or contains music, the prediction can be wrong.
Fixes:
# Force the language explicitly
segments, info = model.transcribe(
"audio.wav",
language="en", # ISO 639-1 code: "fr", "de", "es", "ja", ...
)
# Check what was detected before forcing
_, info = model.transcribe("audio.wav", beam_size=1)
print(f"Detected: {info.language} ({info.language_probability:.1%})")
# If probability is low (< 0.7), force your expected language
if info.language_probability < 0.7:
_, _ = model.transcribe("audio.wav", language="en")# openai-whisper: detect language from first 30s only
import whisper
model = whisper.load_model("base")
audio = whisper.load_audio("audio.wav")
audio_clip = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio_clip).to(model.device)
_, probs = model.detect_language(mel)
print(max(probs, key=probs.get))CUDA Out of Memory
Symptom: torch.cuda.OutOfMemoryError or CUDA error: out of memory.
Fixes:
# 1. Use a smaller model
model = WhisperModel("small.en", device="cuda", compute_type="float16")
# instead of: WhisperModel("large-v3", ...)
# 2. Use int8 quantization (halves VRAM usage)
model = WhisperModel("base.en", device="cuda", compute_type="int8_float16")
# int8 weights, float16 activations — best balance
# 3. Reduce batch size (openai-whisper)
result = model.transcribe(audio, fp16=True, verbose=False)
# 4. Force CPU fallback if GPU OOM
try:
model = WhisperModel("base.en", device="cuda", compute_type="float16")
except Exception:
print("GPU failed, falling back to CPU")
model = WhisperModel("base.en", device="cpu", compute_type="int8")VRAM requirements reference:
| Model | float16 | int8_float16 | int8 (CPU) |
|---|---|---|---|
| tiny | 1 GB | 0.5 GB | — |
| base | 1 GB | 0.5 GB | — |
| small | 2 GB | 1 GB | — |
| medium | 5 GB | 2.5 GB | — |
| large-v3 | 10 GB | 5 GB | — |
Slow CPU Inference
Symptom: Transcription takes far too long — e.g., 2 minutes to transcribe 30 seconds of audio.
Fixes:
# 1. Use int8 compute type (most impactful)
model = WhisperModel("base.en", device="cpu", compute_type="int8")
# vs default float32 which is 2–4× slower
# 2. Use the smallest model that meets your accuracy needs
# "tiny.en" is ~32× faster than "large-v3"
# 3. Reduce beam size (biggest speed lever, small accuracy trade-off)
segments, _ = model.transcribe("audio.wav", beam_size=1) # greedy
# default beam_size=5 does 5× more work
# 4. Disable word timestamps if not needed
segments, _ = model.transcribe("audio.wav", word_timestamps=False)
# 5. Set num_workers for CTranslate2
model = WhisperModel(
"base.en",
device="cpu",
compute_type="int8",
cpu_threads=8, # use all physical cores
num_workers=2, # parallel decode workers
)Rule of thumb:
tiny.en+int8+beam_size=1→ ~10–15× real-time on a 4-core laptop.
Words Cut Off at Segment Boundaries
Symptom: The last word of one segment is repeated at the start of the next, or words are swallowed at boundaries.
Root cause: Whisper processes 30-second windows. Without context, it does not know what was said in the previous window.
Fix:
segments, _ = model.transcribe(
"audio.wav",
condition_on_previous_text=True, # feed prev segment text as prompt
# Default is True; if you're seeing repetition, try False:
# condition_on_previous_text=False,
initial_prompt="This is a technical discussion about software engineering.",
# ^ Providing context improves first-segment accuracy
)For streaming, always add padding before and after the speech window:
# Add 200ms of audio before speech start (pre-roll)
PREROLL = int(0.2 * sample_rate)
speech_audio = full_audio[max(0, start_idx - PREROLL) : end_idx + PREROLL]Word Timestamps Misaligned
Symptom: Word-level timestamps are off by 200–500ms or more.
Root cause: DTW alignment uses cross-attention weights which can be imprecise near segment boundaries.
Fixes:
# faster-whisper exposes alignment heads — try larger models for better alignment
model = WhisperModel("small.en", ...) # better timestamps than tiny
segments, _ = model.transcribe(
"audio.wav",
word_timestamps=True,
# Do not mix with vad_filter if you need precise timestamps
# vad_filter shifts the timeline
vad_filter=False,
)
# Post-process: clamp to segment boundaries
for seg in segments:
for w in (seg.words or []):
w_start = max(w.start, seg.start)
w_end = min(w.end, seg.end)ffmpeg Not Found
Symptom: FileNotFoundError: [Errno 2] No such file or directory: 'ffmpeg' or AudioStream error.
Fix:
# Ubuntu / Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows (Chocolatey)
choco install ffmpeg
# Verify
ffmpeg -versionIf ffmpeg is installed but not on PATH in a virtual environment:
import os
os.environ["PATH"] += os.pathsep + "/usr/bin" # adjust path as neededfaster-whisper Model Format Errors
Symptom: ctranslate2.converters.OptimizerError or ValueError: Unsupported model type.
Cause: Using an OpenAI .pt checkpoint directly instead of a CTranslate2-converted model.
Fix:
# Convert openai-whisper checkpoint to CTranslate2 format
pip install ctranslate2
ct2-opus-mt-converter ... # not for Whisper
# Use the faster-whisper auto-conversion instead:
python3 -c "
from faster_whisper import WhisperModel
# This auto-downloads the correct CTranslate2 model format:
model = WhisperModel('base.en')
print('Model loaded OK')
"Or convert manually:
ct2-whisper-converter --model openai/whisper-base.en --output_dir whisper-base-ct2 --quantization int8PortAudio / sounddevice Errors
Symptom: OSError: [Errno -9996] Invalid input device or ALSA lib pcm.c:8526.
Fixes:
# List available input devices
python3 -c "import sounddevice as sd; print(sd.query_devices())"
# Set device explicitly by index
import sounddevice as sd
sd.default.device = (1, None) # (input_device_id, output_device_id)
# Linux: suppress ALSA spam
# Add to ~/.asoundrc:
# pcm.!default { type hw card 0 }
# ctl.!default { type hw card 0 }# Install portaudio headers if build fails
sudo apt install portaudio19-dev
pip install sounddeviceAudio Quality Issues
| Issue | Cause | Fix |
|---|---|---|
| High WER on clean speech | Sample rate mismatch | Ensure 16kHz mono input |
| Constant "uh", "um" in output | Filler word not suppressed | Post-process with regex |
| Foreign words transliterated | Wrong language set | Force language= param |
| Numbers said as words | No custom vocabulary | Use initial_prompt with domain words |
| Acronyms wrong | BPE tokenizer splits them | Add to initial_prompt |
Quick Diagnostic Checklist
# diagnostics.py
import soundfile as sf
import numpy as np
def diagnose_audio(path: str):
audio, sr = sf.read(path, always_2d=False)
duration = len(audio) / sr
rms = np.sqrt(np.mean(audio**2))
print(f"File: {path}")
print(f"Sample rate: {sr} Hz (need 16000)")
print(f"Channels: {audio.ndim} (need 1 mono)")
print(f"Duration: {duration:.2f}s")
print(f"RMS level: {rms:.4f} (0.01–0.3 OK, <0.005 = silent)")
print(f"Dtype: {audio.dtype}")
if sr != 16000:
print("WARNING: Resample to 16kHz before ASR")
if audio.ndim > 1:
print("WARNING: Convert to mono (average channels)")
if rms < 0.005:
print("WARNING: Audio may be too quiet — check gain")
if __name__ == "__main__":
import sys
diagnose_audio(sys.argv[1])See Also
- ASR Installation
- ASR Implementation
- ASR Cheatsheet
- VAD Troubleshooting — Many ASR problems (hallucinations, empty results) originate in the VAD stage
- TTS Troubleshooting — Use ASR as a round-trip intelligibility test for TTS output