ASR Libraries Comparison
About 1207 wordsAbout 4 min
2026-03-21
Detailed breakdown of every major Python ASR library — models, performance, features, and the right use case for each.
Quick Comparison Matrix
| Library | Model | Size | GPU? | Offline | Streaming | Languages | License |
|---|---|---|---|---|---|---|---|
openai-whisper | Whisper | 39MB–2.9GB | Optional | ✅ | ❌ (30s chunks) | 100+ | MIT |
faster-whisper | Whisper (CT2) | same | Optional | ✅ | ✅ | 100+ | MIT |
RealtimeSTT | faster-whisper | same | Optional | ✅ | ✅ | 100+ | MIT |
transformers Wav2Vec2 | Wav2Vec2/HuBERT | 300MB–1GB | Recommended | ✅ | ✅ | Per-model | Apache-2 |
vosk | Vosk/Kaldi | 40MB–2GB | No | ✅ | ✅ | 20+ | Apache-2 |
speechbrain | Various | 200MB+ | Recommended | ✅ | ✅ | Multi | Apache-2 |
nemo | Conformer | 50–500MB | Recommended | ✅ | ✅ | Multi | Apache-2 |
speech_recognition | Cloud APIs | 0 | Cloud | ❌ | ❌ | 100+ | BSD-3 |
whisper.cpp | Whisper GGML | 39MB–1.5GB | Optional | ✅ | ✅ | 100+ | MIT |
1. openai-whisper
GitHub: openai/whisper
The original Whisper implementation in PyTorch.
Strengths
- Reference implementation — always up to date
- Simple API
- Word-level timestamps
- Translation mode (any language → English)
- Supports 100+ languages and automatic language detection
Weaknesses
- Slower than faster-whisper (2–4×)
- High VRAM usage at large model sizes
- Not designed for true real-time streaming
Quick Usage
import whisper
model = whisper.load_model("base") # tiny/base/small/medium/large-v3/turbo
# Transcribe a file
result = model.transcribe("audio.wav")
print(result["text"])
# With options
result = model.transcribe(
"audio.wav",
language="en", # force language (None = auto-detect)
task="transcribe", # "transcribe" or "translate"
fp16=True, # use float16 (GPU only)
word_timestamps=True, # enable word-level timestamps
verbose=False,
)Best For
- Offline file transcription
- Multilingual projects
- When you want the simplest possible Whisper API
2. faster-whisper
GitHub: SYSTRAN/faster-whisper
Whisper reimplemented using CTranslate2 — the community's top pick for production.
Strengths
- 2–4× faster than openai-whisper on CPU
- Up to 4× less VRAM on GPU
- INT8 quantization (CPU) for even lower memory
- Word-level timestamps via DTW
- Built-in VAD filter (silero-vad integration)
- Streaming support via
model.transcribe()generator
Weaknesses
- External dependency (CTranslate2 compiled binaries)
- Slightly different API from openai-whisper
Models
from faster_whisper import WhisperModel
# CPU (int8 quantization — recommended for CPU-only setups)
model = WhisperModel("base.en", device="cpu", compute_type="int8")
# GPU (float16 — recommended for GPU)
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# GPU (int8 — saves VRAM, slight accuracy trade-off)
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
# Turbo (best speed/accuracy balance)
model = WhisperModel("turbo", device="cuda", compute_type="float16")Transcription
segments, info = model.transcribe(
"audio.wav",
language="en", # None = auto-detect
task="transcribe",
beam_size=5,
best_of=5,
patience=1.0,
temperature=0.0, # 0 = deterministic
condition_on_previous_text=True,
vad_filter=True, # built-in silero VAD to skip silence
vad_parameters={
"threshold": 0.5,
"min_speech_duration_ms": 250,
"min_silence_duration_ms": 2000,
},
word_timestamps=True,
)
# Iterate segments (generator — lazy evaluation)
for seg in segments:
print(f"[{seg.start:.2f}s → {seg.end:.2f}s] {seg.text}")
if seg.words:
for w in seg.words:
print(f" {w.start:.2f}-{w.end:.2f}: {w.word}")Best For
- Production transcription — this is the recommended default
- CPU-only servers (int8 mode)
- When you need word timestamps
- Real-time streaming pipelines
3. RealtimeSTT
GitHub: KoljaB/RealtimeSTT
A high-level library wrapping faster-whisper with a ready-to-use real-time transcription pipeline.
Strengths
- Easiest way to get microphone → real-time text working
- Built-in VAD (silero) + audio buffering
- Callback-based API
- Handles all the threading complexity for you
Usage
from RealtimeSTT import AudioToTextRecorder
def process_text(text: str):
print(f"Transcribed: {text}")
recorder = AudioToTextRecorder(
model="base.en",
language="en",
silero_sensitivity=0.4,
webrtc_sensitivity=2,
on_realtime_transcription_stabilized=process_text,
)
print("Speak now...")
recorder.start()
input("Press Enter to stop...")
recorder.stop()Best For
- Rapid prototyping of voice applications
- When you don't want to manage VAD + ASR threading manually
4. transformers (Wav2Vec2 / HuBERT / MMS)
GitHub: huggingface/transformers
HuggingFace transformers gives access to hundreds of ASR models.
Key Models
| Model | Size | Best For |
|---|---|---|
facebook/wav2vec2-base-960h | 360MB | English, fast |
facebook/wav2vec2-large-960h-lv60-self | 1.18GB | English, best quality |
facebook/hubert-large-ls960-ft | 1.25GB | English, very accurate |
facebook/mms-300m | 1.7GB | 1000+ languages |
openai/whisper-large-v3 | 3GB | Via pipeline API |
Usage: pipeline API (Simplest)
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="facebook/wav2vec2-base-960h",
device=0, # 0 = first GPU, -1 = CPU
)
result = asr("audio.wav")
print(result["text"])
# With chunking for long audio
result = asr(
"long_audio.wav",
chunk_length_s=30,
stride_length_s=5,
return_timestamps="word",
)Usage: Manual (More Control)
import torch
import soundfile as sf
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
audio, sr = sf.read("audio.wav")
inputs = processor(
audio,
sampling_rate=sr,
return_tensors="pt",
padding=True
)
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription[0])Fine-tuning on Custom Data
Wav2Vec2 is the go-to for domain adaptation (medical terms, names, jargon):
from transformers import Wav2Vec2ForCTC, TrainingArguments, Trainer
model = Wav2Vec2ForCTC.from_pretrained(
"facebook/wav2vec2-base",
ctc_loss_reduction="mean",
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer),
)
model.freeze_feature_encoder() # keep CNN weights frozen
# Use HuggingFace Trainer or custom training loopBest For
- Fine-tuning on custom vocabulary/domain
- 1000+ language support (MMS model)
- Research and experimentation
5. Vosk
GitHub: alphacep/vosk-api
Offline streaming ASR based on Kaldi, designed for embedded and real-time use.
Strengths
- Very small models (40MB English small model)
- True token-by-token streaming output
- Works on Raspberry Pi, Android, iOS
- No GPU required
- 20+ language models available
Weaknesses
- Lower accuracy than Whisper
- English small model Word Error Rate ~8–12%
- No multilingual single model
Usage
from vosk import Model, KaldiRecognizer
import wave, json
model = Model("vosk-model-small-en-us-0.15") # download from vosk website
wf = wave.open("audio.wav", "rb")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)
results = []
while True:
data = wf.readframes(4000)
if not data:
break
if rec.AcceptWaveform(data):
r = json.loads(rec.Result())
results.append(r)
else:
partial = json.loads(rec.PartialResult())
print(f"Partial: {partial.get('partial', '')}", end="\r")
final = json.loads(rec.FinalResult())
print("\nFinal:", final.get("text", ""))Best For
- Embedded / Raspberry Pi / Android / iOS
- True word-by-word streaming output
- Applications with strict memory constraints (<100MB)
6. speech_recognition (SpeechRecognition)
GitHub: Uberi/SpeechRecognition
Python wrapper for multiple ASR backends: Google, Azure, IBM, Sphinx, Whisper.
Backends Supported
| Backend | Offline | Cost |
|---|---|---|
| Google Cloud STT | No | Pay-per-use |
| Google Web Speech API | No | Free (unofficial) |
| Azure Cognitive Services | No | Pay-per-use |
| IBM Watson | No | Pay-per-use |
| CMU Sphinx | Yes | Free |
| Whisper (local) | Yes | Free |
Usage
import speech_recognition as sr
r = sr.Recognizer()
# From microphone
with sr.Microphone() as source:
r.adjust_for_ambient_noise(source, duration=1)
print("Speak:")
audio = r.listen(source)
# Recognize
try:
# Google (requires internet)
text = r.recognize_google(audio)
print(f"Google: {text}")
# Whisper (offline)
text = r.recognize_whisper(audio, model="base.en")
print(f"Whisper: {text}")
except sr.UnknownValueError:
print("Could not understand audio")
except sr.RequestError as e:
print(f"Recognition failed: {e}")Best For
- Quick prototyping with cloud backends
- When you need a one-liner and accuracy doesn't matter yet
- Educational/demo projects
Decision Flowchart
What matters most?
│
├─ Speed + accuracy (production)
│ ├─ GPU available → faster-whisper (turbo/large-v3, float16)
│ └─ CPU only → faster-whisper (base/small, int8)
│
├─ Easiest real-time setup
│ └─ RealtimeSTT
│
├─ Fine-tuning on custom vocabulary
│ └─ Wav2Vec2 via transformers
│
├─ Embedded / Raspberry Pi / offline no-GPU
│ └─ Vosk (small model)
│
├─ Cloud API (prototype, don't care about cost)
│ └─ speech_recognition (Google)
│
├─ 1000+ language support
│ └─ MMS (facebook/mms-300m via transformers)
│
└─ NVIDIA production inference stack
└─ NeMo Conformer-CTCSee Also
- Introduction to ASR
- ASR Installation
- ASR Implementation
- Real-Time Streaming ASR
- VAD Libraries Comparison — faster-whisper's built-in VAD uses silero; library selection affects which VAD to pair with
- TTS Libraries Comparison — ASR WER on TTS output (round-trip test) is a standard quality metric