Troubleshooting

About 1180 wordsAbout 4 min

2026-03-21

This guide covers the most common build failures, model load errors, and runtime issues encountered when using llama.cpp, along with specific resolution steps.

Diagnostic First Steps

Before diving into specific errors, gather information:

# 1. Enable verbose output
llama-cli -m model.gguf -p "test" --verbose

# 2. Inspect the GGUF file
./build/bin/llama-gguf-info -m model.gguf | head -40

# 3. Check available RAM
free -h

# 4. Check GPU and VRAM (NVIDIA)
nvidia-smi

# 5. Check Metal (macOS)
system_profiler SPDisplaysDataType | grep VRAM

# 6. Check file integrity
sha256sum model.gguf   # compare with published hash

Build Errors

`CUDA not found` / `nvcc not found`

Error:

CMake Error: Could not find CUDA toolkit.

Fix:

# Verify CUDA installation
nvcc --version
nvidia-smi

# If installed but not found, set paths
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# Retry build
cmake -B build -DGGML_CUDA=ON

`No CMAKE_CXX_COMPILER found`

Error:

CMake Error: No CMAKE_CXX_COMPILER could be found.

Fix:

# Linux
sudo apt install build-essential

# macOS (must install Xcode command line tools)
xcode-select --install

# Verify
c++ --version

`__hgt` undefined / CUDA compute capability error

Error:

ggml-cuda.cu: error: identifier "__hgt" is undefined

Cause: CUDA toolkit too old (needs 12.0+ for half-precision comparisons).

Fix: Upgrade CUDA toolkit to 12.x, or limit compute architectures to SM 7.5+:

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="75;80;86"

Metal Framework Not Found (macOS)

Error:

No APPLE_FW_METAL framework found

Fix: Ensure Xcode (not just command line tools) is installed and active:

xcode-select -p   # should print /Applications/Xcode.app/Contents/Developer
sudo xcode-select -s /Applications/Xcode.app/Contents/Developer

Vulkan Header Not Found

Error:

Could NOT find Vulkan

Fix:

# Linux
sudo apt install libvulkan-dev vulkan-tools

# Alternatively, install the LunarG SDK:
# https://vulkan.lunarg.com/

C++17 Filesystem Error (Older macOS)

Error:

'filesystem' file not found

Cause: macOS < 10.15 does not have std::filesystem.

Fix: Set deployment target to 10.15 or upgrade macOS:

cmake -B build -DCMAKE_OSX_DEPLOYMENT_TARGET=10.15

Model Load Errors

`unknown model architecture`

Error:

llama_model_loader: error: unknown model architecture: 'gemma3'

Cause: Your llama.cpp build is too old to support this architecture.

Fix: Update llama.cpp:

git pull
cmake -B build && cmake --build build --config Release -j$(nproc)

Old GGML `.bin` Format

Error:

error: failed to open model file

or the model loads but produces garbage.

Cause: The model file is in the old GGML format (not GGUF). All models must now be GGUF.

Fix: Download the GGUF version of the model, or convert:

python convert_hf_to_gguf.py /path/to/hf-model --outfile model.gguf

GGUF Version Mismatch

Error:

error: unsupported model file version: X

Fix: The model was created with a newer llama.cpp than your binary. Update your build:

git pull && cmake --build build --config Release -j$(nproc)

Tensor Load Failure

Error:

llama_model_loader: error loading tensor data
GGML_ASSERT: read_size == tensor_size

Cause: File is corrupted or truncated (incomplete download).

Fix: Re-download the model, verify the file size matches the expected size:

huggingface-cli download <repo> <file> --force-download

Permission Denied

Error:

error opening file: Permission denied

Fix:

chmod 644 model.gguf
ls -la model.gguf   # confirm readable

Runtime Errors

`GGML_ASSERT: n_tokens == 0`

Cause: Prompt was empty or the context overflowed before generating began.

Fix: Check that your prompt is non-empty. If the context is full, reduce --ctx-size or the prompt length.

Out of Memory — CPU

Error:

terminate called after throwing an instance of 'std::bad_alloc'

Fixes in order of impact:

Use a smaller quantization (Q4_K_M instead of Q8_0)
Reduce context size: --ctx-size 2048
Use a smaller model (7B instead of 13B)
Enable KV cache quantization: --cache-type-k q8_0 --cache-type-v q8_0
Disable mmap and let the OS manage: --no-mmap

Out of Memory — GPU / `cudaMalloc failed`

Error:

CUDA error 2 at /src/ggml-cuda.cu:XXX: out of memory

Fixes:

Reduce --n-gpu-layers to offload fewer layers
Reduce --ctx-size
Use KV cache quantization: --cache-type-k q8_0
Enable Flash Attention to reduce KV size: --flash-attn

Garbage Output / Repetitive Text

Symptoms: Model repeats phrases, outputs nonsense, or ignores the prompt.

Cause	Fix
Wrong chat template	Add `--chat-template llama3` or correct template
Temperature too high	Use `--temp 0.7` instead of > 1.5
No repeat penalty	Add `--repeat-penalty 1.1`
Model weights corrupted	Verify SHA256 of model file
Context already full at start	Reduce `--ctx-size` or shorten prompt
Using wrong model type (base vs instruct)	Use `-Instruct` variant

Very Slow CPU Inference

Fixes in order of impact:

Check --threads is set to physical core count: -t $(nproc)
Enable AVX2: rebuild with cmake -B build -DGGML_AVX2=ON -DGGML_NATIVE=ON
Use lower quantization (Q3_K_M for speed over Q6_K quality)
Reduce context size
Use GPU if available

llama-server Returns 503 `{"error":"server busy"}`

Cause: All parallel slots are occupied.

Fixes:

Increase --n-parallel: --n-parallel 4
Increase --ctx-size proportionally: --ctx-size 16384
Enable continuous batching: --cont-batching
Increase --timeout for slow hardware

Python `encode failed` / `tokenize failed`

Error (llama-cpp-python):

ValueError: Failed to encode string

Fix: Ensure input is a plain Python string (not bytes):

text = "your text here"   # not b"..."
tokens = llm.tokenize(text.encode("utf-8"))  # must pass bytes to tokenize

Platform-Specific Issues

macOS: Segfault on Metal

Symptom: Process crashes immediately after model load on Apple Silicon.

Fix:

Delete compiled Metal shaders cache:
```
rm -rf ~/Library/Caches/ggml_metallib*
```

Rebuild:

cmake -B build && cmake --build build --config Release

Windows: DLL Not Found

Error:

The code execution cannot proceed because LLVM.dll was not found.

Fix: Install the Visual C++ Redistributable from Microsoft, and ensure CUDA toolkit runtime DLLs are in PATH.

Windows: Slow Compared to Linux

Symptom: Same hardware, significantly fewer tokens/sec on Windows.

Fix: Disable memory mapping (Windows mmap has high overhead):

llama-cli -m model.gguf --no-mmap

Linux: `GLIBCXX_3.4.30 not found` with Prebuilt Binary

Error:

./llama-cli: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.30 not found

Cause: Prebuilt binary was compiled on a newer system with a newer glibc/libstdc++.

Fix: Build from source on your system (avoids glibc incompatibility):

cmake -B build && cmake --build build --config Release -j$(nproc)

AMD ROCm: `hipErrorNoBinaryForGpu`

Error:

hipErrorNoBinaryForGpu: Unable to find code object for all current devices

Fix: The binary was compiled for a different GPU architecture. Rebuild with the correct target:

cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1100"  # replace with your GPU's gfx code
cmake --build build --config Release -j$(nproc)

Find your GPU's gfx code:

rocminfo | grep "gfx"

Getting Help

If none of the above resolves your issue:

Run with --verbose and capture the full output
Check GitHub Issues — search by error message
Open a new issue with: OS, GPU, CUDA/ROCm version, llama.cpp commit (git rev-parse HEAD), model name, full error output

VAD

ASR

TTS

llama-swap

llama.cpp

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

Conan

Artifactory

Jenkins

Troubleshooting