--- url: /kb/ai/llama-cpp/troubleshooting/index.md --- # Troubleshooting This guide covers the most common build failures, model load errors, and runtime issues encountered when using llama.cpp, along with specific resolution steps. ## Diagnostic First Steps Before diving into specific errors, gather information: ```bash # 1. Enable verbose output llama-cli -m model.gguf -p "test" --verbose # 2. Inspect the GGUF file ./build/bin/llama-gguf-info -m model.gguf | head -40 # 3. Check available RAM free -h # 4. Check GPU and VRAM (NVIDIA) nvidia-smi # 5. Check Metal (macOS) system_profiler SPDisplaysDataType | grep VRAM # 6. Check file integrity sha256sum model.gguf # compare with published hash ``` ## Build Errors ### `CUDA not found` / `nvcc not found` **Error**: ``` CMake Error: Could not find CUDA toolkit. ``` **Fix**: ```bash # Verify CUDA installation nvcc --version nvidia-smi # If installed but not found, set paths export CUDA_HOME=/usr/local/cuda export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH # Retry build cmake -B build -DGGML_CUDA=ON ``` ### `No CMAKE_CXX_COMPILER found` **Error**: ``` CMake Error: No CMAKE_CXX_COMPILER could be found. ``` **Fix**: ```bash # Linux sudo apt install build-essential # macOS (must install Xcode command line tools) xcode-select --install # Verify c++ --version ``` ### `__hgt` undefined / CUDA compute capability error **Error**: ``` ggml-cuda.cu: error: identifier "__hgt" is undefined ``` **Cause**: CUDA toolkit too old (needs 12.0+ for half-precision comparisons). **Fix**: Upgrade CUDA toolkit to 12.x, or limit compute architectures to SM 7.5+: ```bash cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="75;80;86" ``` ### Metal Framework Not Found (macOS) **Error**: ``` No APPLE_FW_METAL framework found ``` **Fix**: Ensure Xcode (not just command line tools) is installed and active: ```bash xcode-select -p # should print /Applications/Xcode.app/Contents/Developer sudo xcode-select -s /Applications/Xcode.app/Contents/Developer ``` ### Vulkan Header Not Found **Error**: ``` Could NOT find Vulkan ``` **Fix**: ```bash # Linux sudo apt install libvulkan-dev vulkan-tools # Alternatively, install the LunarG SDK: # https://vulkan.lunarg.com/ ``` ### C++17 Filesystem Error (Older macOS) **Error**: ``` 'filesystem' file not found ``` **Cause**: macOS < 10.15 does not have `std::filesystem`. **Fix**: Set deployment target to 10.15 or upgrade macOS: ```bash cmake -B build -DCMAKE_OSX_DEPLOYMENT_TARGET=10.15 ``` ## Model Load Errors ### `unknown model architecture` **Error**: ``` llama_model_loader: error: unknown model architecture: 'gemma3' ``` **Cause**: Your llama.cpp build is too old to support this architecture. **Fix**: Update llama.cpp: ```bash git pull cmake -B build && cmake --build build --config Release -j$(nproc) ``` ### Old GGML `.bin` Format **Error**: ``` error: failed to open model file ``` or the model loads but produces garbage. **Cause**: The model file is in the old GGML format (not GGUF). All models must now be GGUF. **Fix**: Download the GGUF version of the model, or convert: ```bash python convert_hf_to_gguf.py /path/to/hf-model --outfile model.gguf ``` ### GGUF Version Mismatch **Error**: ``` error: unsupported model file version: X ``` **Fix**: The model was created with a newer llama.cpp than your binary. Update your build: ```bash git pull && cmake --build build --config Release -j$(nproc) ``` ### Tensor Load Failure **Error**: ``` llama_model_loader: error loading tensor data GGML_ASSERT: read_size == tensor_size ``` **Cause**: File is corrupted or truncated (incomplete download). **Fix**: Re-download the model, verify the file size matches the expected size: ```bash huggingface-cli download --force-download ``` ### Permission Denied **Error**: ``` error opening file: Permission denied ``` **Fix**: ```bash chmod 644 model.gguf ls -la model.gguf # confirm readable ``` ## Runtime Errors ### `GGML_ASSERT: n_tokens == 0` **Cause**: Prompt was empty or the context overflowed before generating began. **Fix**: Check that your prompt is non-empty. If the context is full, reduce `--ctx-size` or the prompt length. ### Out of Memory — CPU **Error**: ``` terminate called after throwing an instance of 'std::bad_alloc' ``` Fixes in order of impact: 1. Use a smaller quantization (Q4\_K\_M instead of Q8\_0) 2. Reduce context size: `--ctx-size 2048` 3. Use a smaller model (7B instead of 13B) 4. Enable KV cache quantization: `--cache-type-k q8_0 --cache-type-v q8_0` 5. Disable mmap and let the OS manage: `--no-mmap` ### Out of Memory — GPU / `cudaMalloc failed` **Error**: ``` CUDA error 2 at /src/ggml-cuda.cu:XXX: out of memory ``` Fixes: 1. Reduce `--n-gpu-layers` to offload fewer layers 2. Reduce `--ctx-size` 3. Use KV cache quantization: `--cache-type-k q8_0` 4. Enable Flash Attention to reduce KV size: `--flash-attn` ### Garbage Output / Repetitive Text Symptoms: Model repeats phrases, outputs nonsense, or ignores the prompt. | Cause | Fix | |-------|-----| | Wrong chat template | Add `--chat-template llama3` or correct template | | Temperature too high | Use `--temp 0.7` instead of > 1.5 | | No repeat penalty | Add `--repeat-penalty 1.1` | | Model weights corrupted | Verify SHA256 of model file | | Context already full at start | Reduce `--ctx-size` or shorten prompt | | Using wrong model type (base vs instruct) | Use `-Instruct` variant | ### Very Slow CPU Inference Fixes in order of impact: 1. Check `--threads` is set to physical core count: `-t $(nproc)` 2. Enable AVX2: rebuild with `cmake -B build -DGGML_AVX2=ON -DGGML_NATIVE=ON` 3. Use lower quantization (Q3\_K\_M for speed over Q6\_K quality) 4. Reduce context size 5. Use GPU if available ### llama-server Returns 503 `{"error":"server busy"}` **Cause**: All parallel slots are occupied. Fixes: 1. Increase `--n-parallel`: `--n-parallel 4` 2. Increase `--ctx-size` proportionally: `--ctx-size 16384` 3. Enable continuous batching: `--cont-batching` 4. Increase `--timeout` for slow hardware ### Python `encode failed` / `tokenize failed` **Error** (llama-cpp-python): ``` ValueError: Failed to encode string ``` **Fix**: Ensure input is a plain Python string (not bytes): ```python text = "your text here" # not b"..." tokens = llm.tokenize(text.encode("utf-8")) # must pass bytes to tokenize ``` ## Platform-Specific Issues ### macOS: Segfault on Metal **Symptom**: Process crashes immediately after model load on Apple Silicon. **Fix**: 1. Delete compiled Metal shaders cache: ```bash rm -rf ~/Library/Caches/ggml_metallib* ``` 2. Rebuild: ```bash cmake -B build && cmake --build build --config Release ``` ### Windows: DLL Not Found **Error**: ``` The code execution cannot proceed because LLVM.dll was not found. ``` **Fix**: Install the Visual C++ Redistributable from Microsoft, and ensure CUDA toolkit runtime DLLs are in PATH. ### Windows: Slow Compared to Linux **Symptom**: Same hardware, significantly fewer tokens/sec on Windows. **Fix**: Disable memory mapping (Windows mmap has high overhead): ```bash llama-cli -m model.gguf --no-mmap ``` ### Linux: `GLIBCXX_3.4.30 not found` with Prebuilt Binary **Error**: ``` ./llama-cli: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.30 not found ``` **Cause**: Prebuilt binary was compiled on a newer system with a newer glibc/libstdc++. **Fix**: Build from source on your system (avoids glibc incompatibility): ```bash cmake -B build && cmake --build build --config Release -j$(nproc) ``` ### AMD ROCm: `hipErrorNoBinaryForGpu` **Error**: ``` hipErrorNoBinaryForGpu: Unable to find code object for all current devices ``` **Fix**: The binary was compiled for a different GPU architecture. Rebuild with the correct target: ```bash cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx1100" # replace with your GPU's gfx code cmake --build build --config Release -j$(nproc) ``` Find your GPU's gfx code: ```bash rocminfo | grep "gfx" ``` ## Getting Help If none of the above resolves your issue: 1. Run with `--verbose` and capture the full output 2. Check [GitHub Issues](https://github.com/ggml-org/llama.cpp/issues) — search by error message 3. Open a new issue with: OS, GPU, CUDA/ROCm version, llama.cpp commit (`git rev-parse HEAD`), model name, full error output ## See Also * [Installation & Build](/kb/ai/llama-cpp/installation-build/) — build flags and prerequisites * [GPU Acceleration](/kb/ai/llama-cpp/gpu-acceleration/) — GPU-specific configuration * [Performance Tuning](/kb/ai/llama-cpp/performance-tuning/) — OOM and slow inference optimizations * [GGUF & Quantization](/kb/ai/llama-cpp/gguf-quantization/) — model format issues