ASR Libraries Comparison

About 1207 wordsAbout 4 min

2026-03-21

Detailed breakdown of every major Python ASR library — models, performance, features, and the right use case for each.

Quick Comparison Matrix

Library	Model	Size	GPU?	Offline	Streaming	Languages	License
`openai-whisper`	Whisper	39MB–2.9GB	Optional	✅	❌ (30s chunks)	100+	MIT
`faster-whisper`	Whisper (CT2)	same	Optional	✅	✅	100+	MIT
`RealtimeSTT`	faster-whisper	same	Optional	✅	✅	100+	MIT
`transformers` Wav2Vec2	Wav2Vec2/HuBERT	300MB–1GB	Recommended	✅	✅	Per-model	Apache-2
`vosk`	Vosk/Kaldi	40MB–2GB	No	✅	✅	20+	Apache-2
`speechbrain`	Various	200MB+	Recommended	✅	✅	Multi	Apache-2
`nemo`	Conformer	50–500MB	Recommended	✅	✅	Multi	Apache-2
`speech_recognition`	Cloud APIs	0	Cloud	❌	❌	100+	BSD-3
`whisper.cpp`	Whisper GGML	39MB–1.5GB	Optional	✅	✅	100+	MIT

1. openai-whisper

GitHub: openai/whisper
The original Whisper implementation in PyTorch.

Strengths

Reference implementation — always up to date
Simple API
Word-level timestamps
Translation mode (any language → English)
Supports 100+ languages and automatic language detection

Weaknesses

Slower than faster-whisper (2–4×)
High VRAM usage at large model sizes
Not designed for true real-time streaming

Quick Usage

import whisper

model = whisper.load_model("base")         # tiny/base/small/medium/large-v3/turbo

# Transcribe a file
result = model.transcribe("audio.wav")
print(result["text"])

# With options
result = model.transcribe(
    "audio.wav",
    language="en",          # force language (None = auto-detect)
    task="transcribe",      # "transcribe" or "translate"
    fp16=True,              # use float16 (GPU only)
    word_timestamps=True,   # enable word-level timestamps
    verbose=False,
)

Best For

Offline file transcription
Multilingual projects
When you want the simplest possible Whisper API

2. faster-whisper

GitHub: SYSTRAN/faster-whisper
Whisper reimplemented using CTranslate2 — the community's top pick for production.

Strengths

2–4× faster than openai-whisper on CPU
Up to 4× less VRAM on GPU
INT8 quantization (CPU) for even lower memory
Word-level timestamps via DTW
Built-in VAD filter (silero-vad integration)
Streaming support via model.transcribe() generator

Weaknesses

External dependency (CTranslate2 compiled binaries)
Slightly different API from openai-whisper

Models

from faster_whisper import WhisperModel

# CPU (int8 quantization — recommended for CPU-only setups)
model = WhisperModel("base.en", device="cpu", compute_type="int8")

# GPU (float16 — recommended for GPU)
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

# GPU (int8 — saves VRAM, slight accuracy trade-off)
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")

# Turbo (best speed/accuracy balance)
model = WhisperModel("turbo", device="cuda", compute_type="float16")

Transcription

segments, info = model.transcribe(
    "audio.wav",
    language="en",               # None = auto-detect
    task="transcribe",
    beam_size=5,
    best_of=5,
    patience=1.0,
    temperature=0.0,             # 0 = deterministic
    condition_on_previous_text=True,
    vad_filter=True,             # built-in silero VAD to skip silence
    vad_parameters={
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "min_silence_duration_ms": 2000,
    },
    word_timestamps=True,
)

# Iterate segments (generator — lazy evaluation)
for seg in segments:
    print(f"[{seg.start:.2f}s → {seg.end:.2f}s] {seg.text}")
    if seg.words:
        for w in seg.words:
            print(f"   {w.start:.2f}-{w.end:.2f}: {w.word}")

Best For

Production transcription — this is the recommended default
CPU-only servers (int8 mode)
When you need word timestamps
Real-time streaming pipelines

3. RealtimeSTT

GitHub: KoljaB/RealtimeSTT
A high-level library wrapping faster-whisper with a ready-to-use real-time transcription pipeline.

Strengths

Easiest way to get microphone → real-time text working
Built-in VAD (silero) + audio buffering
Callback-based API
Handles all the threading complexity for you

Usage

from RealtimeSTT import AudioToTextRecorder

def process_text(text: str):
    print(f"Transcribed: {text}")

recorder = AudioToTextRecorder(
    model="base.en",
    language="en",
    silero_sensitivity=0.4,
    webrtc_sensitivity=2,
    on_realtime_transcription_stabilized=process_text,
)

print("Speak now...")
recorder.start()
input("Press Enter to stop...")
recorder.stop()

Best For

Rapid prototyping of voice applications
When you don't want to manage VAD + ASR threading manually

4. transformers (Wav2Vec2 / HuBERT / MMS)

GitHub: huggingface/transformers
HuggingFace transformers gives access to hundreds of ASR models.

Key Models

Model	Size	Best For
`facebook/wav2vec2-base-960h`	360MB	English, fast
`facebook/wav2vec2-large-960h-lv60-self`	1.18GB	English, best quality
`facebook/hubert-large-ls960-ft`	1.25GB	English, very accurate
`facebook/mms-300m`	1.7GB	1000+ languages
`openai/whisper-large-v3`	3GB	Via pipeline API

Usage: pipeline API (Simplest)

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="facebook/wav2vec2-base-960h",
    device=0,    # 0 = first GPU, -1 = CPU
)

result = asr("audio.wav")
print(result["text"])

# With chunking for long audio
result = asr(
    "long_audio.wav",
    chunk_length_s=30,
    stride_length_s=5,
    return_timestamps="word",
)

Usage: Manual (More Control)

import torch
import soundfile as sf
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

audio, sr = sf.read("audio.wav")
inputs = processor(
    audio,
    sampling_rate=sr,
    return_tensors="pt",
    padding=True
)

with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription[0])

Fine-tuning on Custom Data

Wav2Vec2 is the go-to for domain adaptation (medical terms, names, jargon):

from transformers import Wav2Vec2ForCTC, TrainingArguments, Trainer

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-base",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)
model.freeze_feature_encoder()  # keep CNN weights frozen

# Use HuggingFace Trainer or custom training loop

Best For

Fine-tuning on custom vocabulary/domain
1000+ language support (MMS model)
Research and experimentation

5. Vosk

GitHub: alphacep/vosk-api
Offline streaming ASR based on Kaldi, designed for embedded and real-time use.

Strengths

Very small models (40MB English small model)
True token-by-token streaming output
Works on Raspberry Pi, Android, iOS
No GPU required
20+ language models available

Weaknesses

Lower accuracy than Whisper
English small model Word Error Rate ~8–12%
No multilingual single model

Usage

from vosk import Model, KaldiRecognizer
import wave, json

model = Model("vosk-model-small-en-us-0.15")  # download from vosk website
wf = wave.open("audio.wav", "rb")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)

results = []
while True:
    data = wf.readframes(4000)
    if not data:
        break
    if rec.AcceptWaveform(data):
        r = json.loads(rec.Result())
        results.append(r)
    else:
        partial = json.loads(rec.PartialResult())
        print(f"Partial: {partial.get('partial', '')}", end="\r")

final = json.loads(rec.FinalResult())
print("\nFinal:", final.get("text", ""))

Best For

Embedded / Raspberry Pi / Android / iOS
True word-by-word streaming output
Applications with strict memory constraints (<100MB)

6. speech_recognition (SpeechRecognition)

GitHub: Uberi/SpeechRecognition
Python wrapper for multiple ASR backends: Google, Azure, IBM, Sphinx, Whisper.

Backends Supported

Backend	Offline	Cost
Google Cloud STT	No	Pay-per-use
Google Web Speech API	No	Free (unofficial)
Azure Cognitive Services	No	Pay-per-use
IBM Watson	No	Pay-per-use
CMU Sphinx	Yes	Free
Whisper (local)	Yes	Free

Usage

import speech_recognition as sr

r = sr.Recognizer()

# From microphone
with sr.Microphone() as source:
    r.adjust_for_ambient_noise(source, duration=1)
    print("Speak:")
    audio = r.listen(source)

# Recognize
try:
    # Google (requires internet)
    text = r.recognize_google(audio)
    print(f"Google: {text}")

    # Whisper (offline)
    text = r.recognize_whisper(audio, model="base.en")
    print(f"Whisper: {text}")
except sr.UnknownValueError:
    print("Could not understand audio")
except sr.RequestError as e:
    print(f"Recognition failed: {e}")

Best For

Quick prototyping with cloud backends
When you need a one-liner and accuracy doesn't matter yet
Educational/demo projects

Decision Flowchart

What matters most?
│
├─ Speed + accuracy (production)
│   ├─ GPU available → faster-whisper (turbo/large-v3, float16)
│   └─ CPU only → faster-whisper (base/small, int8)
│
├─ Easiest real-time setup
│   └─ RealtimeSTT
│
├─ Fine-tuning on custom vocabulary
│   └─ Wav2Vec2 via transformers
│
├─ Embedded / Raspberry Pi / offline no-GPU
│   └─ Vosk (small model)
│
├─ Cloud API (prototype, don't care about cost)
│   └─ speech_recognition (Google)
│
├─ 1000+ language support
│   └─ MMS (facebook/mms-300m via transformers)
│
└─ NVIDIA production inference stack
    └─ NeMo Conformer-CTC

VAD

ASR

TTS

llama-swap

llama.cpp

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

Conan

Artifactory

Jenkins

ASR Libraries Comparison