Installation & Build Guide
About 895 wordsAbout 3 min
2026-03-21
llama.cpp offers multiple paths to get running: build from source for maximum hardware optimization, use prebuilt binaries for convenience, or install the Python bindings via pip. This guide covers all three.
Prerequisites
| Tool | Minimum Version | Purpose |
|---|---|---|
| Git | 2.x | Cloning the repository |
| CMake | 3.14 | Build system |
| C++ compiler | C++17 support | Compilation |
| CUDA Toolkit | 11.8+ (12.x recommended) | CUDA backend |
| ROCm | 5.x | AMD GPU backend |
| Vulkan SDK | 1.3+ | Vulkan backend |
Install Prerequisites
Linux (Debian/Ubuntu):
sudo apt update && sudo apt install -y \
git cmake build-essentialmacOS:
xcode-select --install
brew install cmakeWindows:
- Install Visual Studio 2022 (or Build Tools) with Desktop C++ workload
- Or install MinGW-w64 via MSYS2 or winget
- Install CMake from cmake.org
Clone the Repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cppTo use a specific release tag:
git clone --branch b4820 https://github.com/ggml-org/llama.cppCPU-Only Build
cmake -B build
cmake --build build --config Release -j$(nproc)Binaries are placed in build/bin/.
SIMD Optimization Flags
By default, CMake auto-detects AVX/AVX2 support. Override explicitly:
| Flag | Default | Description |
|---|---|---|
GGML_AVX | ON (auto) | Enable AVX instructions |
GGML_AVX2 | ON (auto) | Enable AVX2 (recommended) |
GGML_AVX512 | OFF | Enable AVX-512 (Intel only) |
GGML_FMA | ON (auto) | Fused multiply-add |
GGML_F16C | ON (auto) | F16 conversion intrinsics |
GGML_NATIVE | OFF | Optimize for the exact host CPU (-march=native) |
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)CUDA Build (NVIDIA GPUs)
Requirements
- NVIDIA GPU with Compute Capability 5.0+ (Maxwell or newer)
- CUDA Toolkit 11.8 or 12.x installed
nvccin PATH
Build
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="75;80;86;89;90"
cmake --build build --config Release -j$(nproc)Auto-detect current GPU only (faster compile):
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="native"Common CUDA architectures:
| Architecture | Value | GPU Series |
|---|---|---|
| Turing | 75 | RTX 2000 series |
| Ampere | 80, 86 | RTX 3000 series, A100 |
| Ada Lovelace | 89 | RTX 4000 series |
| Hopper | 90 | H100 |
| Blackwell | 100 | RTX 5000 series |
Metal Build (Apple Silicon & AMD)
Metal is automatically detected and enabled on macOS. No extra flags needed.
cmake -B build
cmake --build build --config Release -j$(nproc)To explicitly disable Metal:
cmake -B build -DGGML_METAL=OFFMetal provides access to the unified memory pool on Apple Silicon, allowing the GPU to use ordinary RAM. A MacBook Pro M2 with 32 GB RAM can run a 13B Q4_K_M model entirely on GPU.
Vulkan Build (Cross-Platform GPU)
Vulkan works on NVIDIA, AMD, and Intel GPUs across Linux and Windows as a fallback when CUDA/ROCm are unavailable.
Linux prerequisites:
sudo apt install libvulkan-dev vulkan-tools glslcWindows: Install the LunarG Vulkan SDK from vulkan.lunarg.com.
Build:
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)ROCm / HIP Build (AMD GPUs)
Requirements
- ROCm 5.x or 6.x installed
- Supported AMD GPU (see architecture table below)
CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ \
cmake -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1100;gfx1030;gfx906"
cmake --build build --config Release -j$(nproc)Common AMD GPU architecture codes:
| Code | GPU Example |
|---|---|
gfx1100 | RX 7900 XT/XTX (RDNA 3) |
gfx1030 | RX 6800/6900 (RDNA 2) |
gfx906 | MI50, RX 5700 XT (Vega 20) |
gfx90a | MI200 series |
gfx940 | MI300 series |
OpenBLAS Build (CPU Linear Algebra)
sudo apt install libopenblas-dev # Linux
brew install openblas # macOS
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release -j$(nproc)Windows Build
MSVC (Visual Studio):
cmake -B build -G "Visual Studio 17 2022" -A x64
cmake --build build --config ReleaseMSVC + Ninja (faster):
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build buildComplete CMake Flags Reference
| Flag | Default | Description |
|---|---|---|
GGML_CUDA | OFF | Enable CUDA backend |
GGML_METAL | ON (macOS) | Enable Metal backend |
GGML_VULKAN | OFF | Enable Vulkan backend |
GGML_HIP | OFF | Enable ROCm/HIP backend |
GGML_BLAS | OFF | Enable BLAS backend |
GGML_BLAS_VENDOR | Generic | OpenBLAS, MKL, Accelerate |
GGML_NATIVE | OFF | -march=native optimization |
GGML_LTO | OFF | Link-time optimization |
GGML_STATIC | OFF | Static linking |
LLAMA_BUILD_TESTS | OFF | Build test binaries |
LLAMA_BUILD_EXAMPLES | ON | Build example binaries |
CMAKE_CUDA_ARCHITECTURES | — | CUDA arch string |
AMDGPU_TARGETS | — | ROCm arch string |
CMAKE_BUILD_TYPE | Release | Release/Debug/RelWithDebInfo |
Prebuilt Binaries
If you don't need custom compilation flags, prebuilt binaries are available:
| Platform | URL |
|---|---|
| Linux (CPU) | github.com/ggml-org/llama.cpp/releases |
| Linux (CUDA 12) | Same; look for cudart suffix |
| macOS (arm64) | Same; Metal enabled |
| Windows (CPU) | Same; win-x64 suffix |
| Windows (CUDA) | Same; win-cuda12 suffix |
# Example: Linux CUDA 12 release
wget https://github.com/ggml-org/llama.cpp/releases/download/b4820/llama-b4820-bin-ubuntu-x64.zip
unzip llama-b4820-bin-ubuntu-x64.zipPython pip Install
For Python-only usage, install via pip (builds from source or uses prebuilt wheels):
# CPU
pip install llama-cpp-python
# CUDA
CMAKE_ARGS="-DGGML_CUDA=ON" pip install llama-cpp-python
# Metal
CMAKE_ARGS="-DGGML_METAL=ON" pip install llama-cpp-pythonFor full details see the Python Bindings guide.
Verifying the Build
# Check binary exists and shows help
./build/bin/llama-cli --help
# Quick model load test (replace with your model path)
./build/bin/llama-cli \
-m ./models/llama-3.2-3b-instruct-q4_k_m.gguf \
-p "Hello" \
-n 10Expected first lines of output:
llama_model_loader: loaded meta data with ... key-value pairs
llama_model_loader: - general.architecture = llamaSee Also
- CLI Usage — using the built binaries
- GPU Acceleration — choosing
n-gpu-layersafter a CUDA/Metal build - GGUF & Quantization — converting and quantizing models
- Server — the HTTP API server binary
- Troubleshooting — build errors and dependency issues