Introduction to Voice Activity Detection (VAD)

About 747 wordsAbout 2 min

2026-03-21

Voice Activity Detection (VAD) is the foundational technology that determines when a human is speaking in an audio stream. Every modern voice pipeline — from smart speakers to speech-to-text APIs — depends on VAD to work reliably and efficiently.

What is VAD?

VAD is an algorithm or model that classifies each frame of audio as either speech or non-speech (silence/noise). The output is typically a binary signal or probability score over time.

Audio stream:
[noise][noise][SPEECH SPEECH SPEECH][noise][SPEECH][noise][noise]
                ↑                  ↑       ↑      ↑
             start               end    start    end

Without VAD, downstream systems must process all audio — including silence — which wastes compute, increases latency, and reduces accuracy.

Why VAD Matters

Problem Without VAD	How VAD Solves It
ASR processes silence → hallucinations	Only speech frames are sent to ASR
Continuous microphone → high CPU/GPU	Activate processing only during speech
Privacy: always-on recording	Discard silence frames, never store them
Bandwidth: sending all audio	Transmit only detected speech segments
Battery drain on mobile/edge devices	Sleep CPU between utterances

The VAD Pipeline Position

VAD is always placed before any downstream task:

Microphone/Audio file
        │
        ▼
  ┌───────────┐
  │    VAD    │  ← Are these frames speech?
  └─────┬─────┘
        │ speech only
        ▼
  ┌────────────────┐
  │ ASR / Whisper  │  ← Transcription
  └────────────────┘
        │
        ▼
  ┌────────────────┐
  │  LLM / NLP    │  ← Understanding / Generation
  └────────────────┘

Types of VAD

1. Energy-Based (Traditional)

Compares the RMS energy of a frame against a threshold. Fast, zero dependencies, but fragile in noisy environments.

speech if RMS(frame) > threshold

Pros: microsecond latency, no model, no GPU
Cons: breaks in noise, music, fan sounds

Library	Approach	Size	GPU Needed	Best For
`webrtcvad`	GMM (WebRTC)	<1 MB	No	Low-latency edge/embedded
`silero-vad`	LSTM (PyTorch)	~2 MB	Optional	General purpose, high accuracy
`pyannote.audio`	Transformer	~300 MB	Recommended	Diarization + VAD
`speechbrain`	ECAPA-TDNN	~200 MB	Recommended	Full ASR toolkit
`nemo`	MarbleNet	~40 MB	Recommended	Production ASR pipelines
Energy + scipy	Custom math	0	No	Quick prototyping

When to Use What

Need very low latency (<20ms)?
├─ Yes → webrtcvad
└─ No → need high accuracy?
         ├─ Yes → silero-vad (best general choice)
         │        or pyannote (if you also need diarization)
         └─ No → energy-based (prototyping only)

Running on edge/embedded (no internet, no GPU)?
└─ webrtcvad or silero-vad (CPU mode)

Want speaker diarization alongside VAD?
└─ pyannote.audio

Full speech pipeline (ASR + VAD + NLP)?
└─ SpeechBrain or NeMo

Key Concepts and Terminology

Term	Definition
Frame	Short chunk of audio (typically 10–30ms)
Chunk size	Number of samples per frame (e.g. 512 at 16kHz = 32ms)
Aggressiveness	How aggressively to filter non-speech (0–3 in WebRTC)
Threshold	Probability above which a frame is considered speech
Onset	The moment speech begins
Offset	The moment speech ends
Padding	Extra silence added before/after detected speech segment
Hysteresis	Delay before switching from speech→silence to avoid choppy cuts
False Positive	Non-speech classified as speech (noise mistaken for voice)
False Negative	Speech classified as non-speech (voice missed)

VAD Output Formats

Binary per frame

[0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]
#  silence       SPEECH        SPEECH

Probability score per frame

[0.02, 0.05, 0.91, 0.98, 0.87, 0.03, 0.01]

Timestamp segments

[
    {"start": 1.2, "end": 3.8},
    {"start": 5.1, "end": 7.4},
]

Sample Audio Requirements

Most VAD libraries have strict audio requirements:

Property	Typical Requirement
Sample rate	8000, 16000, or 32000 Hz
Channels	Mono (1 channel)
Bit depth	16-bit PCM
Frame size	10, 20, or 30ms

AI

VAD

ASR

TTS

llama-swap

llama.cpp

Embedded Sytems

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

DevOps

Conan

Artifactory

Jenkins