VAD Libraries Comparison
About 1014 wordsAbout 3 min
2026-03-21
A detailed comparison of every major Python VAD library, with real-world use case guidance.
Quick Comparison Matrix
| Library | Approach | Model Size | GPU Required | Min Python | License | HuggingFace Required |
|---|---|---|---|---|---|---|
webrtcvad | GMM | <1 MB | No | 3.6 | BSD-3 | No |
silero-vad | LSTM (TorchScript) | ~2 MB | No (optional) | 3.8 | MIT | No |
pyannote.audio | Transformer | ~300 MB | Recommended | 3.8 | MIT* | Yes |
speechbrain | ECAPA-TDNN | ~200 MB | Recommended | 3.8 | Apache-2.0 | No |
nemo | MarbleNet | ~40 MB | Recommended | 3.8 | Apache-2.0 | No |
auditok | Energy-based | 0 | No | 3.6 | MIT | No |
librosa + numpy | Custom spectral | 0 | No | 3.7 | ISC | No |
*pyannote.audio requires accepting HuggingFace model terms of service.
1. webrtcvad
GitHub: py-webrtcvad — Python bindings for Google's WebRTC VAD C library
Strengths
- Battle-tested in production (used in Google Meet, WhatsApp, Zoom)
- Extremely low latency (<1ms per frame)
- No ML framework, runs anywhere (Raspberry Pi, etc.)
- Only ~50 KB binary dependency
Weaknesses
- Only 3 supported sample rates: 8000, 16000, 32000 Hz
- Frame duration must be exactly 10, 20, or 30ms
- Lower accuracy than ML-based approaches in noisy environments
- No probability score — only binary decision
Sample Code
import webrtcvad
import wave
vad = webrtcvad.Vad(mode=2) # aggressiveness 0-3
with wave.open("audio.wav", "rb") as wf:
sample_rate = wf.getframerate() # must be 8000, 16000, or 32000
frame_duration = 20 # ms — must be 10, 20, or 30
frame_size = int(sample_rate * frame_duration / 1000)
while True:
raw = wf.readframes(frame_size)
if len(raw) < frame_size * 2:
break
is_speech = vad.is_speech(raw, sample_rate)
print("SPEECH" if is_speech else "silence")Best For
- Edge devices, embedded Linux, Raspberry Pi
- Real-time telephony (8kHz)
- When you need absolute minimum latency
- No internet, no ML setup
2. silero-vad
GitHub: snakers4/silero-vad
The community's favourite general-purpose VAD. Small, fast, accurate.
Strengths
- State-of-art accuracy (~97% F1) with tiny 2MB model
- Works in CPU mode (fast enough for real-time)
- Returns probability scores, not just binary
- Official PyTorch Hub loading or local file
- Utility functions for timestamps, splitting audio
- Supports 8000 and 16000 Hz
Weaknesses
- Requires PyTorch (~200MB install)
- Chunk size must match: 256 (8kHz) or 512 (16kHz) samples
- Model load takes ~0.5s on first use
Loading Methods
# Method 1: PyTorch Hub (requires internet first time)
import torch
model, utils = torch.hub.load(
repo_or_dir="snakers4/silero-vad",
model="silero_vad",
force_reload=False
)
# Method 2: pip install + local (recommended for offline)
# pip install silero-vad
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
model = load_silero_vad()Getting Timestamps
from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
model = load_silero_vad()
wav = read_audio("speech.wav", sampling_rate=16000)
timestamps = get_speech_timestamps(
wav,
model,
sampling_rate=16000,
threshold=0.5, # speech probability threshold
min_speech_duration_ms=250,
max_speech_duration_s=float("inf"),
min_silence_duration_ms=100,
speech_pad_ms=30,
)
# timestamps: [{"start": 1234, "end": 5678}, ...] (in samples)Chunk-by-Chunk Inference
import torch
model = load_silero_vad()
model.reset_states() # important: reset between audio streams
chunk = torch.zeros(512) # 512 samples at 16kHz = 32ms
prob = model(chunk, 16000).item()
print(f"Speech probability: {prob:.3f}")Best For
- General purpose VAD in any Python application
- When you want probability scores
- CPU inference on servers
- Getting precise speech timestamps from files
3. pyannote.audio
GitHub: pyannote/pyannote-audio
Enterprise-grade pipeline combining VAD + speaker diarization.
Strengths
- Highest quality VAD + diarization in one pipeline
- Speaker-aware segmentation
- Pre-trained models on diverse datasets
- Active research project with frequent updates
Weaknesses
- Requires GPU for real-time use
- Requires accepting HuggingFace model license agreements
- Heavy dependency tree (~1GB total with transformers)
- High latency for short utterances
Setup (includes HuggingFace token)
# 1. Accept terms at: https://hf.co/pyannote/voice-activity-detection
# 2. Get your token from https://hf.co/settings/tokens
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/voice-activity-detection",
use_auth_token="YOUR_HF_TOKEN"
)
# Run on a file
output = pipeline("audio.wav")
# Iterate over speech regions
for speech in output.get_timeline().support():
print(f"Speech: {speech.start:.2f}s → {speech.end:.2f}s")With GPU
import torch
pipeline = Pipeline.from_pretrained("pyannote/voice-activity-detection",
use_auth_token="YOUR_HF_TOKEN")
pipeline.to(torch.device("cuda"))Best For
- Speaker diarization ("who spoke when")
- Meeting transcription with speaker labels
- High-quality offline processing (not real-time)
4. auditok
GitHub: amsehili/auditok
Pure Python, energy-based VAD with a convenient CLI and API.
Strengths
- Zero ML dependencies
- Works with any audio format via ffmpeg
- Great CLI tool for splitting audio files
- Simple Python API
Sample Code
import auditok
# Split audio file into speech regions
regions = auditok.split(
"audio.wav",
min_dur=0.2, # minimum speech duration (seconds)
max_dur=10, # maximum speech duration (seconds)
max_silence=0.3, # max silence inside a segment (seconds)
energy_threshold=55 # audio energy threshold (dB)
)
for i, region in enumerate(regions):
region.save(f"speech_{i:03d}.wav")
print(f"Segment {i}: {region.meta.start:.2f}s → {region.meta.end:.2f}s")CLI Usage
auditok -i audio.wav -e 55 -m 0.2 -s 0.3 --save-to-file speech_{N}.wavBest For
- Splitting podcast/lecture recordings into segments
- CLI batch processing
- When you cannot install PyTorch
5. SpeechBrain
Full ASR toolkit with built-in VAD.
Sample Code
from speechbrain.pretrained import VAD
vad = VAD.from_hparams(
source="speechbrain/vad-crdnn-libriparty",
savedir="pretrained_models/vad-crdnn"
)
# Get speech boundaries
boundaries = vad.get_speech_segments(
"audio.wav",
large_chunk_size=30,
small_chunk_size=10,
overlap_small_chunk=True,
apply_energy_VAD=True,
double_check=True,
)
# Save to file
vad.save_boundaries(boundaries, save_path="boundaries.csv")Best For
- End-to-end ASR pipeline with SpeechBrain
- When you're already using SpeechBrain for ASR
6. NVIDIA NeMo (MarbleNet)
import nemo.collections.asr as nemo_asr
# Load MarbleNet VAD model
vad_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(
"vad_marblenet"
)
# Process a file
vad_model.setup_test_data(test_data_config={
"sample_rate": 16000,
"manifest_filepath": "manifest.json",
"labels": ["background", "speech"],
"batch_size": 1,
"shuffle": False,
})Best For
- NVIDIA GPU servers with NeMo already set up
- Integration with NeMo's full ASR/NLP pipeline
Decision Flowchart
Start: What are your constraints?
│
├─ No ML framework allowed / embedded device
│ └─ webrtcvad or auditok
│
├─ Need probability scores (not just binary)
│ └─ silero-vad
│
├─ Need speaker identity ("who spoke when")
│ └─ pyannote.audio
│
├─ Running in production with NVIDIA GPUs + NeMo
│ └─ NeMo MarbleNet
│
├─ Already using SpeechBrain for ASR
│ └─ SpeechBrain VAD
│
└─ General purpose, accuracy matters most
└─ silero-vad (recommended default)See Also
- Introduction to VAD
- VAD Installation
- VAD Implementation
- Real-Time Streaming VAD
- ASR Libraries Comparison — faster-whisper includes a built-in silero-vad filter; library choices affect which VAD integrates best
- TTS Libraries Comparison — understanding the full pipeline stack helps when selecting VAD