VAD Algorithms & Theory
About 2260 wordsAbout 8 min
2026-03-21
Understanding how VAD algorithms work internally helps you choose the right one, tune its parameters, and debug unexpected behavior.
Signal Fundamentals
Audio is a waveform — a time series of amplitude values. VAD splits this waveform into short frames and classifies each frame.
Amplitude
| ___ ___________ ___
| / \ / \ / \
|______/ \_____/ \_____/ \____
↑ Time
|← frame →|← frame →|← frame →|← frame →|Sample Rate and Frame Size
The frame size controls how fine-grained the VAD decision is:
| Sample Rate | Frame (10ms) | Frame (20ms) | Frame (30ms) |
|---|---|---|---|
| 8,000 Hz | 80 samples | 160 samples | 240 samples |
| 16,000 Hz | 160 samples | 320 samples | 480 samples |
| 32,000 Hz | 320 samples | 640 samples | 960 samples |
Shorter frames = finer resolution but noisier decisions.
Longer frames = smoother but higher latency.
Algorithm 1: Energy-Based VAD
How It Works
Compute the Root Mean Square (RMS) energy of each frame and compare against a threshold.
Erms=N1i=0∑N−1xi2
import numpy as np
def energy_vad(samples: np.ndarray, threshold: float = 0.02) -> bool:
rms = np.sqrt(np.mean(samples.astype(np.float32) ** 2))
return rms > thresholdAdaptive Threshold
A static threshold fails in changing noise conditions. Adaptive threshold tracks background noise:
class AdaptiveEnergyVAD:
def __init__(self, noise_alpha=0.95, speech_threshold_factor=3.0):
self.noise_level = 0.01
self.noise_alpha = noise_alpha
self.factor = speech_threshold_factor
def is_speech(self, frame: np.ndarray) -> bool:
rms = np.sqrt(np.mean(frame.astype(np.float32) ** 2))
threshold = self.noise_level * self.factor
if rms < threshold:
# Update noise estimate (exponential moving average)
self.noise_level = self.noise_alpha * self.noise_level + (1 - self.noise_alpha) * rms
return rms > thresholdPros and Cons
| Pros | Cons |
|---|---|
| Zero dependencies | Fails with background music |
| Microsecond latency | Sensitive to microphone gain |
| No model needed | HVAC / fan noise = false positives |
| Easy to tune | Quiet speech = false negatives |
Algorithm 2: Zero Crossing Rate (ZCR)
ZCR counts the number of times the signal crosses zero per frame.
ZCR=N−11i=1∑N−11[xi⋅xi−1<0]
def zero_crossing_rate(frame: np.ndarray) -> float:
signs = np.sign(frame)
signs[signs == 0] = 1
crossings = np.sum(np.diff(signs) != 0)
return crossings / len(frame)- Silence: very low ZCR
- Speech: moderate ZCR (voiced ~0.1, unvoiced consonants higher)
- White noise: very high ZCR
ZCR alone is weak. It's best combined with energy:
def combined_vad(frame, energy_threshold=0.02, zcr_threshold=0.3):
energy = np.sqrt(np.mean(frame.astype(float) ** 2))
zcr = zero_crossing_rate(frame)
return energy > energy_threshold or zcr > zcr_thresholdAlgorithm 3: Spectral-Based VAD
Speech has a characteristic spectral distribution. Spectral features are more robust than raw energy.
MFCC-Based Features
Mel-Frequency Cepstral Coefficients (MFCCs) represent the short-term power spectrum of audio in a way that mirrors human hearing.
import librosa
def spectral_vad(audio, sr=16000, threshold=0.5):
# Extract MFCC features
mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
# Simple energy in lower MFCCs (proxy for voiced speech)
energy = np.mean(mfcc[:4, :] ** 2, axis=0)
return energy > thresholdSpectral Centroid
Speech has a higher spectral centroid than low-frequency noise (HVAC, traffic):
centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)Algorithm 4: GMM-Based VAD (WebRTC)
Google's WebRTC VAD (used in Chrome, Zoom, WhatsApp) uses Gaussian Mixture Models (GMMs) trained on a fixed set of features.
How It Works
- Extract 6 features per frame (MFCC sub-band energies)
- Score each frame against two GMMs: one for speech, one for noise
- Compute a likelihood ratio
- Apply aggressiveness mode to set the threshold
Features: [f1, f2, f3, f4, f5, f6]
│
┌──────────┴──────────┐
▼ ▼
P(frame|speech) P(frame|noise)
│ │
└──────────┬──────────┘
▼
Likelihood Ratio
│
> threshold? → SPEECHAggressiveness Modes
| Mode | Description | False Positives | False Negatives |
|---|---|---|---|
| 0 | Least aggressive — passes most audio | High | Low |
| 1 | Balanced | Medium | Medium |
| 2 | More aggressive filtering | Low | Medium |
| 3 | Most aggressive — only clear speech | Very Low | High |
Limitations
- Fixed GMM trained on specific data — doesn't adapt to all microphones
- Only works with 8000, 16000, or 32000 Hz input
- Frame size must be exactly 10, 20, or 30ms
Algorithm 5: DNN/ML-Based VAD
Silero VAD Architecture
Silero VAD is a JITSCRIPT (TorchScript) compiled LSTM model:
Input: float32 audio frame (512 or 256 samples at 16kHz)
↓
Encoder: 1D Conv layers → feature extraction
↓
RNN: LSTM × 2 layers → temporal context
↓
Output: single float in [0, 1] — speech probabilityKey properties:
- ~2 MB model size (JIT compiled)
- Processes 512 samples (32ms) at 16kHz
- Returns speech probability per chunk
- No internet required at inference
- CPU inference: ~0.5ms per chunk on modern CPU
pyannote.audio Architecture
Input: 1–5 second audio windows
↓
WavLM/ Pre-trained Transformer (SSL features)
HuBERT: ↓
Frame-level speech probability
↓
Output: Annotation with speech/non-speech timestampsMuch heavier (300MB+) but extremely accurate. Also outputs speaker identity (diarization).
Smoothing and Post-processing
Raw per-frame VAD output is noisy. Post-processing is critical.
Median Filtering
from scipy.ndimage import median_filter
# Smooth the binary VAD output over a window of 5 frames
smoothed = median_filter(vad_output, size=5)Hysteresis (Onset/Offset Delay)
Avoid chopping speech mid-utterance by requiring N consecutive frames before switching state:
class HysteresisVAD:
def __init__(self, onset_frames=3, offset_frames=10):
self.onset_frames = onset_frames # frames before activating
self.offset_frames = offset_frames # frames before deactivating
self.speech_counter = 0
self.silence_counter = 0
self.is_speaking = False
def update(self, is_speech: bool) -> bool:
if is_speech:
self.speech_counter += 1
self.silence_counter = 0
if self.speech_counter >= self.onset_frames:
self.is_speaking = True
else:
self.silence_counter += 1
self.speech_counter = 0
if self.silence_counter >= self.offset_frames:
self.is_speaking = False
return self.is_speakingSpeech Segment Padding
Always add padding before speech onset and after offset to avoid clipping:
def pad_segments(segments, pre_pad=0.3, post_pad=0.5, audio_duration=None):
padded = []
for seg in segments:
start = max(0.0, seg["start"] - pre_pad)
end = seg["end"] + post_pad
if audio_duration:
end = min(audio_duration, end)
padded.append({"start": start, "end": end})
return paddedEvaluation Metrics
| Metric | Formula | Meaning |
|---|---|---|
| Precision | TP / (TP + FP) | Of detected speech, how much is real speech? |
| Recall | TP / (TP + FN) | Of real speech, how much was detected? |
| F1 Score | 2 × P × R / (P + R) | Balance of precision and recall |
| Detection Error Rate | (FA + Miss) / Total speech | pyannote standard metric |
| Latency | ms from speech onset to detection | Time to react |
False Alarm Rate and Miss Rate
FAR=Total non-speech framesNon-speech classified as speech
MR=Total speech framesSpeech classified as non-speech
Detection Cost Function (DCF)
The standard NIST evaluation metric balances the two error types weighted by their prior probability:
Cdet=CFA⋅PFA⋅Ptarget+Cmiss⋅Pmiss⋅(1−Ptarget)
normalized DCF=θminCdet,defaultCdet
where Cdet,default=min(CFA⋅Ptarget, Cmiss⋅(1−Ptarget)).
In the standard NIST SRE setting: CFA=1, Cmiss=10, Ptarget=0.01 — misses are penalized 10× more than false alarms.
Cascade Effect: VAD Errors on ASR
VAD errors propagate directly into ASR quality:
| VAD Error | Effect on ASR |
|---|---|
| False positive (noise → speech) | Whisper hallucinates text from silence; WER increases |
| False negative (speech → silence) | Beginning/end of utterance clipped; words deleted |
| Late onset detection | First word(s) missing from transcript |
| Early offset detection | Sentence split mid-utterance; two partial transcripts |
| Choppy segments (<300ms) | Whisper cannot decode very short clips reliably |
This is why the choice of VAD threshold, silence padding, and minimum segment duration directly controls ASR WER — not just VAD accuracy alone.
# Measure VAD impact on ASR WER
from jiwer import wer
def vad_asr_roundtrip(audio: np.ndarray, sr: int,
vad_threshold: float, asr_model) -> float:
"""Run VAD then ASR; return WER vs ground truth."""
timestamps = get_speech_timestamps(
torch.tensor(audio), vad_model,
threshold=vad_threshold, sampling_rate=sr,
return_seconds=True
)
segments_text = []
for ts in timestamps:
chunk = audio[int(ts["start"]*sr):int(ts["end"]*sr)]
segs, _ = asr_model.transcribe(chunk, beam_size=5)
segments_text.append(" ".join(s.text for s in segs))
return " ".join(segments_text)Benchmark Comparison (16kHz, clean speech)
| Library | F1 Score | Latency | CPU Load |
|---|---|---|---|
| silero-vad | ~0.97 | 1–2ms/chunk | Low |
| pyannote VAD | ~0.95 | 50–200ms | High |
| webrtcvad (mode 3) | ~0.89 | <1ms | Minimal |
| energy-based | ~0.75 | <0.1ms | Negligible |
Pitch (F0) as a Speech Discriminator
Voiced human speech is produced by periodic vocal-fold vibration, resulting in a fundamental frequency (F0) of 80–300 Hz for most speakers. Stationary noises (HVAC, fans, electrical hum) lack this periodicity. Detecting F0 is therefore a powerful speech discriminator in challenging environments.
Autocorrelation F0 Estimation
The simplest F0 detector uses the autocorrelation of the waveform:
R[k]=n=0∑N−1−kx[n]⋅x[n+k]
The lag k∗ at which R[k] peaks (after the zero-lag peak) gives the period T0=k∗/fs, so F0=fs/k∗.
import numpy as np
def detect_f0(frame: np.ndarray, sr: int = 16000,
f0_min: float = 60, f0_max: float = 400) -> float | None:
"""
Estimate F0 from a single audio frame using autocorrelation.
Returns F0 in Hz, or None if frame is unvoiced.
"""
lag_min = int(sr / f0_max) # minimum lag for f0_max
lag_max = int(sr / f0_min) # maximum lag for f0_min
# Normalize frame
frame = frame - frame.mean()
if np.abs(frame).max() < 1e-6:
return None # silence
# Compute autocorrelation
corr = np.correlate(frame, frame, mode="full")
corr = corr[len(corr) // 2:] # keep positive lags only
# Normalize by zero-lag power
if corr[0] < 1e-10:
return None
corr_norm = corr / corr[0]
# Find peak in valid lag range
search = corr_norm[lag_min:lag_max]
if len(search) == 0:
return None
peak_lag = lag_min + np.argmax(search)
# Voicing threshold: peak autocorrelation > 0.3
if corr_norm[peak_lag] < 0.3:
return None # unvoiced
return sr / peak_lag
def f0_vad(frame: np.ndarray, sr: int = 16000) -> bool:
"""Speech detected if F0 is found in the human speech range."""
return detect_f0(frame, sr) is not NoneWhen to use F0-based VAD: effective when the background has energy in the same frequency bands as speech (music, narrow-band noise). Pure energy VAD would fail there; F0 detection correctly classifies voiced speech.
CMVN — Cepstral Mean and Variance Normalization
CMVN normalizes acoustic features across a recording or session to reduce the effect of microphone variation, room acoustics, and channel mismatch. It is standard in HMM-GMM ASR and also applied as a pre-processing step to ASR input features.
Global CMVN
Given a sequence of T feature vectors {ft}t=1T (e.g., MFCCs or log-mel), compute the global mean and variance and normalize:
f^t=σft−μ,μ=T1t∑ft,σk2=T1t∑(ft,k−μk)2
import numpy as np
def apply_cmvn(features: np.ndarray, eps: float = 1e-8) -> np.ndarray:
"""
Apply Cepstral Mean and Variance Normalization to a feature matrix.
features: shape (n_features, time_frames) — e.g. MFCCs
returns: normalized features of the same shape
"""
mean = features.mean(axis=1, keepdims=True)
std = features.std(axis=1, keepdims=True) + eps
return (features - mean) / std
# Sliding-window CMVN (for streaming — normalize over a 3s window)
from collections import deque
class SlidingCMVN:
def __init__(self, window_frames: int = 300): # 300 × 10ms = 3s
self.buf = deque(maxlen=window_frames)
def normalize(self, frame: np.ndarray) -> np.ndarray:
self.buf.append(frame)
stack = np.stack(list(self.buf), axis=1) # (n_feat, window)
mean = stack.mean(axis=1)
std = stack.std(axis=1) + 1e-8
return (frame - mean) / stdWhy CMVN matters for VAD: A VAD trained on clean close-mic data may fail on a different microphone or in a reverberant room. CMVN normalizes the feature distribution so the model sees the same statistical shape regardless of recording conditions.
Multi-Microphone / Beamforming VAD
Real-world devices — smart speakers, conference systems, hearing aids — use microphone arrays. Beamforming combines multiple microphone signals to suppress noise from all directions except the target direction, dramatically improving VAD accuracy in noisy environments.
Delay-and-Sum Beamforming
Given M microphones and a target direction θ, the delay-and-sum beamformer aligns and adds signals:
y[n]=M1m=1∑Mxm[n−dm(θ)]
where dm(θ)=⌊fs⋅τm(θ)⌋ is the delay in samples for microphone m given target direction θ, and τm(θ) is the time-of-arrival difference relative to a reference mic.
import numpy as np
def delay_and_sum(signals: np.ndarray, delays: list[int]) -> np.ndarray:
"""
Delay-and-sum beamforming.
signals: shape (M, T) — M microphone signals, T samples
delays: list of M integer delays (samples)
returns: beamformed signal, shape (T,)
"""
M, T = signals.shape
max_delay = max(delays)
output_len = T - max_delay
output = np.zeros(output_len, dtype=np.float32)
for m, d in enumerate(delays):
output += signals[m, d:d + output_len]
return output / M
def estimate_delays(mic_positions: np.ndarray, direction_deg: float,
sr: int = 16000, speed_of_sound: float = 343.0) -> list[int]:
"""
Estimate inter-microphone delays for a given target direction.
mic_positions: shape (M, 2) — 2D positions in meters
direction_deg: target direction in degrees (0 = positive x-axis)
"""
theta = np.radians(direction_deg)
unit_vec = np.array([np.cos(theta), np.sin(theta)])
# Project each mic position onto the direction vector
projections = mic_positions @ unit_vec
# Convert to samples (relative to the mic at position 0)
projections -= projections[0]
delays = -np.round(projections * sr / speed_of_sound).astype(int)
# Make all delays non-negative
delays -= delays.min()
return delays.tolist()Practical impact: Delay-and-sum beamforming on a 4-mic linear array can improve SNR by ~6 dB (10log10M). Combined with Silero VAD, this reduces false positives from diffuse room noise by up to 90% in reverberant environments.
See Also
- Introduction to VAD
- VAD Libraries Comparison
- VAD Implementation
- VAD Cheatsheet
- ASR Algorithms & Theory — Log-Mel spectrograms, MFCC, and CMVN used in VAD are the same features consumed by ASR acoustic models
- TTS Algorithms & Theory — Pitch (F0) and energy features in TTS prosody modeling are the same acoustic cues VAD uses for speech/non-speech classification