VAD Cheatsheet
About 936 wordsAbout 3 min
2026-03-21
Quick reference for everything VAD — install commands, config, code snippets, and decision tables.
Install Commands
# Minimal (no ML framework)
pip install webrtcvad soundfile numpy scipy
# Recommended general purpose
pip install silero-vad torch torchaudio soundfile sounddevice numpy
# Full (all libraries)
pip install silero-vad webrtcvad pyannote.audio speechbrain \
torch torchaudio soundfile sounddevice numpy scipy librosa
# Audio I/O
pip install pyaudio # microphone input (PyAudio)
pip install sounddevice # microphone input (sounddevice — easier)
pip install soundfile # WAV read/write
# System dependencies (Ubuntu/Debian)
sudo apt install -y ffmpeg portaudio19-dev build-essential python3-devLibrary Selection
Need <1ms latency, no ML, embedded? → webrtcvad (mode 2)
Need best accuracy, general purpose? → silero-vad (threshold 0.5)
Need speaker diarization + VAD? → pyannote.audio
No PyTorch allowed, simple splitting? → auditok
Full SpeechBrain ASR pipeline? → speechbrain VAD
NVIDIA GPU + NeMo? → MarbleNetAudio Requirements
| Property | webrtcvad | silero-vad | pyannote |
|---|---|---|---|
| Sample rate | 8000 / 16000 / 32000 Hz | 8000 / 16000 Hz | any (resamples) |
| Channels | mono | mono | mono |
| Format | 16-bit PCM bytes | float32 torch tensor | file path or waveform |
| Frame size | exactly 10/20/30ms | 512 (16kHz) or 256 (8kHz) | N/A |
silero-vad — File-Based
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
model = load_silero_vad()
wav = read_audio("audio.wav", sampling_rate=16000)
segments = get_speech_timestamps(
wav, model,
sampling_rate=16000,
threshold=0.5, # 0.3 (sensitive) ↔ 0.8 (strict)
min_speech_duration_ms=250, # ignore < 250ms
min_silence_duration_ms=100, # gap to split segments
speech_pad_ms=30, # padding before/after
return_seconds=True,
)
# → [{"start": 1.23, "end": 4.56}, ...]silero-vad — Chunk Streaming
import torch
import numpy as np
from silero_vad import load_silero_vad
model = load_silero_vad()
model.reset_states() # IMPORTANT: reset before each new stream
chunk = torch.zeros(512) # 512 samples at 16kHz = 32ms
prob = model(chunk, 16000).item() # → float [0.0, 1.0]
is_speech = prob > 0.5webrtcvad — File-Based
import webrtcvad, soundfile as sf
vad = webrtcvad.Vad(mode=2) # aggressiveness 0–3
audio, sr = sf.read("audio.wav", dtype="int16", always_2d=False)
frame_ms = 20
frame_size = int(sr * frame_ms / 1000) # samples
pcm = audio.tobytes()
for i in range(0, len(pcm) - frame_size*2 + 1, frame_size*2):
frame = pcm[i:i + frame_size*2]
speech = vad.is_speech(frame, sr)Energy VAD (zero dependencies)
import numpy as np, soundfile as sf
def energy_vad(filepath, threshold_db=-40.0, frame_ms=20):
audio, sr = sf.read(filepath, dtype="float32", always_2d=False)
if audio.ndim > 1: audio = audio.mean(axis=1)
threshold = 10 ** (threshold_db / 20.0)
frame_size = int(sr * frame_ms / 1000)
results = []
for i in range(0, len(audio) - frame_size + 1, frame_size):
rms = np.sqrt(np.mean(audio[i:i+frame_size] ** 2))
results.append(rms > threshold)
return results # [True/False per frame]Real-Time VAD (sounddevice + silero)
import sounddevice as sd, torch, queue, numpy as np
from silero_vad import load_silero_vad
SR = 16000; CHUNK = 512; THRESH = 0.5; SILENCE_LIMIT = 20
model = load_silero_vad(); model.reset_states()
q = queue.Queue()
def cb(indata, frames, time, status):
q.put(indata[:, 0].copy())
speech_buf = []; silence_cnt = 0; speaking = False
with sd.InputStream(samplerate=SR, channels=1, dtype="float32",
blocksize=CHUNK, callback=cb):
while True:
chunk = q.get()
prob = model(torch.from_numpy(chunk), SR).item()
if prob > THRESH:
speaking = True; silence_cnt = 0; speech_buf.append(chunk)
elif speaking:
speech_buf.append(chunk); silence_cnt += 1
if silence_cnt >= SILENCE_LIMIT:
utterance = np.concatenate(speech_buf)
# → send utterance to Whisper / ASR
speech_buf = []; silence_cnt = 0; speaking = False
model.reset_states()VAD + Whisper
import whisper, numpy as np
wmodel = whisper.load_model("base.en") # "tiny", "small", "medium", "large"
def transcribe(audio: np.ndarray, sr: int = 16000) -> str:
return wmodel.transcribe(audio, fp16=False)["text"].strip()
# faster-whisper (recommended)
from faster_whisper import WhisperModel
fmodel = WhisperModel("base.en", device="cpu", compute_type="int8")
def transcribe_fast(audio: np.ndarray) -> str:
segs, _ = fmodel.transcribe(audio, language="en", beam_size=3)
return " ".join(s.text for s in segs).strip()VAD API Call (llama.cpp / llama-swap)
import requests
def ask_llm(text: str, base_url="http://localhost:8080") -> str:
r = requests.post(f"{base_url}/v1/chat/completions", json={
"model": "llama-3.2-3b-instruct",
"messages": [{"role": "user", "content": text}],
"max_tokens": 200,
})
return r.json()["choices"][0]["message"]["content"].strip()Merge Speech Segments
def merge_segments(segments, gap_s=0.5):
if not segments: return []
merged = [dict(segments[0])]
for s in segments[1:]:
if s["start"] - merged[-1]["end"] <= gap_s:
merged[-1]["end"] = s["end"]
else:
merged.append(dict(s))
return mergedSave Speech Segments
import soundfile as sf, os, numpy as np
def save_segments(filepath, segments, out_dir="segments", pad=0.3):
audio, sr = sf.read(filepath, dtype="float32", always_2d=False)
os.makedirs(out_dir, exist_ok=True)
for i, s in enumerate(segments):
start = max(0, int((s["start"] - pad) * sr))
end = min(len(audio), int((s["end"] + pad) * sr))
sf.write(f"{out_dir}/seg_{i:04d}.wav", audio[start:end], sr)Tuning Reference
| Symptom | Parameter | Change |
|---|---|---|
| Too many false positives (noise) | threshold | Raise (0.5 → 0.7) |
| Missing quiet speech | threshold | Lower (0.5 → 0.3) |
| Choppy utterances (split at pauses) | min_silence_duration_ms | Raise (100 → 400) |
| Short noise bursts flagged | min_speech_duration_ms | Raise (250 → 500) |
| Late reaction to speech start | ONSET_CHUNKS | Lower (3 → 1) |
| Cuts speech too early | SILENCE_LIMIT | Raise (15 → 25) |
| Background noise | HP filter cutoff | 80–150 Hz |
Common Errors
| Error | Cause | Fix |
|---|---|---|
Error: 10 (webrtcvad) | Wrong sample rate or frame size | Use 8000/16000/32000 Hz, 10/20/30ms frames |
| silero all-zero probs | LSTM state not reset | Call model.reset_states() |
| silero wrong shape | Chunk size ≠ 512 (16kHz) | Pad to exactly 512 samples |
| float out of range | Audio is int16, not float32 | Divide by 32768.0 |
OSError: -9996 (PyAudio) | Wrong device index | List devices and use correct index |
401 Unauthorized (pyannote) | Invalid/missing HF token | Set HUGGINGFACE_TOKEN env var |
See Also
- Introduction to VAD
- VAD Algorithms & Theory
- VAD Libraries Comparison
- VAD Installation
- VAD Implementation
- Real-Time Streaming VAD
- VAD Integration Guide
- VAD Troubleshooting
- Introduction to llama-swap
- llama.cpp Server
- ASR Cheatsheet — Quick reference for the transcription stage that consumes VAD output
- TTS Cheatsheet — Quick reference for the speech synthesis stage