Troubleshooting
About 1180 wordsAbout 4 min
2026-03-21
This guide covers the most common build failures, model load errors, and runtime issues encountered when using llama.cpp, along with specific resolution steps.
Diagnostic First Steps
Before diving into specific errors, gather information:
# 1. Enable verbose output
llama-cli -m model.gguf -p "test" --verbose
# 2. Inspect the GGUF file
./build/bin/llama-gguf-info -m model.gguf | head -40
# 3. Check available RAM
free -h
# 4. Check GPU and VRAM (NVIDIA)
nvidia-smi
# 5. Check Metal (macOS)
system_profiler SPDisplaysDataType | grep VRAM
# 6. Check file integrity
sha256sum model.gguf # compare with published hashBuild Errors
CUDA not found / nvcc not found
Error:
CMake Error: Could not find CUDA toolkit.Fix:
# Verify CUDA installation
nvcc --version
nvidia-smi
# If installed but not found, set paths
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
# Retry build
cmake -B build -DGGML_CUDA=ONNo CMAKE_CXX_COMPILER found
Error:
CMake Error: No CMAKE_CXX_COMPILER could be found.Fix:
# Linux
sudo apt install build-essential
# macOS (must install Xcode command line tools)
xcode-select --install
# Verify
c++ --version__hgt undefined / CUDA compute capability error
Error:
ggml-cuda.cu: error: identifier "__hgt" is undefinedCause: CUDA toolkit too old (needs 12.0+ for half-precision comparisons).
Fix: Upgrade CUDA toolkit to 12.x, or limit compute architectures to SM 7.5+:
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="75;80;86"Metal Framework Not Found (macOS)
Error:
No APPLE_FW_METAL framework foundFix: Ensure Xcode (not just command line tools) is installed and active:
xcode-select -p # should print /Applications/Xcode.app/Contents/Developer
sudo xcode-select -s /Applications/Xcode.app/Contents/DeveloperVulkan Header Not Found
Error:
Could NOT find VulkanFix:
# Linux
sudo apt install libvulkan-dev vulkan-tools
# Alternatively, install the LunarG SDK:
# https://vulkan.lunarg.com/C++17 Filesystem Error (Older macOS)
Error:
'filesystem' file not foundCause: macOS < 10.15 does not have std::filesystem.
Fix: Set deployment target to 10.15 or upgrade macOS:
cmake -B build -DCMAKE_OSX_DEPLOYMENT_TARGET=10.15Model Load Errors
unknown model architecture
Error:
llama_model_loader: error: unknown model architecture: 'gemma3'Cause: Your llama.cpp build is too old to support this architecture.
Fix: Update llama.cpp:
git pull
cmake -B build && cmake --build build --config Release -j$(nproc)Old GGML .bin Format
Error:
error: failed to open model fileor the model loads but produces garbage.
Cause: The model file is in the old GGML format (not GGUF). All models must now be GGUF.
Fix: Download the GGUF version of the model, or convert:
python convert_hf_to_gguf.py /path/to/hf-model --outfile model.ggufGGUF Version Mismatch
Error:
error: unsupported model file version: XFix: The model was created with a newer llama.cpp than your binary. Update your build:
git pull && cmake --build build --config Release -j$(nproc)Tensor Load Failure
Error:
llama_model_loader: error loading tensor data
GGML_ASSERT: read_size == tensor_sizeCause: File is corrupted or truncated (incomplete download).
Fix: Re-download the model, verify the file size matches the expected size:
huggingface-cli download <repo> <file> --force-downloadPermission Denied
Error:
error opening file: Permission deniedFix:
chmod 644 model.gguf
ls -la model.gguf # confirm readableRuntime Errors
GGML_ASSERT: n_tokens == 0
Cause: Prompt was empty or the context overflowed before generating began.
Fix: Check that your prompt is non-empty. If the context is full, reduce --ctx-size or the prompt length.
Out of Memory — CPU
Error:
terminate called after throwing an instance of 'std::bad_alloc'Fixes in order of impact:
- Use a smaller quantization (Q4_K_M instead of Q8_0)
- Reduce context size:
--ctx-size 2048 - Use a smaller model (7B instead of 13B)
- Enable KV cache quantization:
--cache-type-k q8_0 --cache-type-v q8_0 - Disable mmap and let the OS manage:
--no-mmap
Out of Memory — GPU / cudaMalloc failed
Error:
CUDA error 2 at /src/ggml-cuda.cu:XXX: out of memoryFixes:
- Reduce
--n-gpu-layersto offload fewer layers - Reduce
--ctx-size - Use KV cache quantization:
--cache-type-k q8_0 - Enable Flash Attention to reduce KV size:
--flash-attn
Garbage Output / Repetitive Text
Symptoms: Model repeats phrases, outputs nonsense, or ignores the prompt.
| Cause | Fix |
|---|---|
| Wrong chat template | Add --chat-template llama3 or correct template |
| Temperature too high | Use --temp 0.7 instead of > 1.5 |
| No repeat penalty | Add --repeat-penalty 1.1 |
| Model weights corrupted | Verify SHA256 of model file |
| Context already full at start | Reduce --ctx-size or shorten prompt |
| Using wrong model type (base vs instruct) | Use -Instruct variant |
Very Slow CPU Inference
Fixes in order of impact:
- Check
--threadsis set to physical core count:-t $(nproc) - Enable AVX2: rebuild with
cmake -B build -DGGML_AVX2=ON -DGGML_NATIVE=ON - Use lower quantization (Q3_K_M for speed over Q6_K quality)
- Reduce context size
- Use GPU if available
llama-server Returns 503 {"error":"server busy"}
Cause: All parallel slots are occupied.
Fixes:
- Increase
--n-parallel:--n-parallel 4 - Increase
--ctx-sizeproportionally:--ctx-size 16384 - Enable continuous batching:
--cont-batching - Increase
--timeoutfor slow hardware
Python encode failed / tokenize failed
Error (llama-cpp-python):
ValueError: Failed to encode stringFix: Ensure input is a plain Python string (not bytes):
text = "your text here" # not b"..."
tokens = llm.tokenize(text.encode("utf-8")) # must pass bytes to tokenizePlatform-Specific Issues
macOS: Segfault on Metal
Symptom: Process crashes immediately after model load on Apple Silicon.
Fix:
- Delete compiled Metal shaders cache:
rm -rf ~/Library/Caches/ggml_metallib* - Rebuild:
cmake -B build && cmake --build build --config Release
Windows: DLL Not Found
Error:
The code execution cannot proceed because LLVM.dll was not found.Fix: Install the Visual C++ Redistributable from Microsoft, and ensure CUDA toolkit runtime DLLs are in PATH.
Windows: Slow Compared to Linux
Symptom: Same hardware, significantly fewer tokens/sec on Windows.
Fix: Disable memory mapping (Windows mmap has high overhead):
llama-cli -m model.gguf --no-mmapLinux: GLIBCXX_3.4.30 not found with Prebuilt Binary
Error:
./llama-cli: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.30 not foundCause: Prebuilt binary was compiled on a newer system with a newer glibc/libstdc++.
Fix: Build from source on your system (avoids glibc incompatibility):
cmake -B build && cmake --build build --config Release -j$(nproc)AMD ROCm: hipErrorNoBinaryForGpu
Error:
hipErrorNoBinaryForGpu: Unable to find code object for all current devicesFix: The binary was compiled for a different GPU architecture. Rebuild with the correct target:
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1100" # replace with your GPU's gfx code
cmake --build build --config Release -j$(nproc)Find your GPU's gfx code:
rocminfo | grep "gfx"Getting Help
If none of the above resolves your issue:
- Run with
--verboseand capture the full output - Check GitHub Issues — search by error message
- Open a new issue with: OS, GPU, CUDA/ROCm version, llama.cpp commit (
git rev-parse HEAD), model name, full error output
See Also
- Installation & Build — build flags and prerequisites
- GPU Acceleration — GPU-specific configuration
- Performance Tuning — OOM and slow inference optimizations
- GGUF & Quantization — model format issues