How to Compile and Run llama.cpp with Vulkan
If you're like me and looking for a way to compile and run llama.cpp with Vulkan, here's how to do it.
For Windows and especially older GPUs just follow along. For Linux and ROCm supported cards scroll down until you see the Linux header.
Windows
Running llama-bench with Vulkan on RX 480
I ran llama-bench.exein msys2 UCRT64 on my Radeon RX 480 GPU using the Vulkan backend. Below are the details of the test:
Benchmark Results for Sapphire Radeon RX 480 Nitro 8G OC
| Model | Size | Params | Backend | NGL | Test | Tokens/sec |
|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | pp512 | 124.54 ± 0.83 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | tg128 | 15.26 ± 0.13 |
Not your card? Check some more llama.cpp Vulkan benchmarks
Now lets get into the building process:
Real-Life-Tested Build Instructions for Commit 4ccea21
The build instructions for MSYS2 from the official llama.cpp documentation did not work under Windows.
Here is a compact "From Zero to Llama-Vulkan" guide with exactly the steps that actually worked in the end.
All commands in the following Windows guide are supposed to be run in the msys2 UCRT64 terminal!
1) Install and Update Prerequisites
1.1 Install MSYS2 (UCRT64)
-
Start MSYS2 UCRT64 with admin privileges and perform a full system update:
pacman -Syu -
Install required dependencies:
pacman -S git \ mingw-w64-ucrt-x86_64-gcc \ mingw-w64-ucrt-x86_64-cmake \ mingw-w64-ucrt-x86_64-vulkan-devel \ mingw-w64-ucrt-x86_64-shaderc -
Restart MSYS2 with admin privileges and repeat the update if necessary.
Check if MSYS2 can locate glslc:
```msys2
which glslc
```
If no path is returned, add the Vulkan SDK path to the environment variables by completing step 1.2, 2) and 3)
If a path is returned, proceed with step 4) or 5)
1.2 Install Vulkan SDK from LunarG
- Download Vulkan SDK (e.g., version 1.4.304.1)
- Keep track of the downloaded version and if deviating from "1.4.304.1" replace "1.4.304.1" with your downloaded version!
- Install and optionally restart your PC.
- Verify in MSYS2 that
glslcis found in 2) and 3)
2) Determine the Vulkan SDK Path
-
Find the Vulkan SDK installation directory in Windows Explorer. The typical path is:
C:\VulkanSDK\1.4.304.1\Bin\glslc.exe -
Verify the file exists in MSYS2:
test -f C:/VulkanSDK/1.4.304.1/Bin/glslc.exe && echo true || echo falseIf
trueis returned, the file exists at the expected location.
3) Set Up MSYS2 Environment Variables
Once you confirm where the SDK is installed (e.g., C:\VulkanSDK\1.4.304.1), configure MSYS2:
export VULKAN_SDK=/c/VulkanSDK/1.4.304.1
export PATH=$VULKAN_SDK/Bin:$PATH
Important Notes:
-
As soon as you close the current UCRT64 terminal, those exports are gone! For recompilation, redo step 3)
-
MSYS2 uses the converted path format
/c/...instead ofC:\.... -
Verify setup by running:
which glslc glslc --version
Example Output:
$ which glslc
/c/VulkanSDK/1.4.304.1/Bin/glslc
$ test -f C:/VulkanSDK/1.4.304.1/Bin/glslc.exe && echo true || echo false
true
Now, glslc should be recognized by MSYS2.
4) (Optional) Update AMD Graphics Driver (for Vulkan 1.4)
-
Download AMD Driver (e.g., for RX 480)
-
Install the latest "Adrenalin Software" and restart Windows (keep in mind you might need to redo step 2) and at least the exports step 3)!)
-
Verify Vulkan support in MSYS2:
vulkaninfo | grep "Vulkan API"If Vulkan 1.3 is still listed, try updating the driver or disabling overlays.
5) Clone or Update llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
From now on, the project directory is referred to as llama.cpp.
6) Build with Vulkan Support
Important: Define _WIN32_WINNT
Relevant for handling a Bug in Commit 0b3863f (and probably other builds - using msys2):
To ensure PrefetchVirtualMemory is available and avoid "unavailable" errors, define the Windows API level during compilation:
-DCMAKE_CXX_FLAGS=-D_WIN32_WINNT=0x602
If you are adventurous just build without this flag and share your experience for the respective commit Bug
Check cmake installation
You can verify CMake is correctly installed by running:
which cmake
cmake --version
Expected output should look like:
cmake version 3.X.X
Build Steps
-
Clear the build folder and compile:
cd path/to/llama.cpp rm -rf build cmake -B build -DGGML_VULKAN=ON -DCMAKE_C_FLAGS="-fopenmp" -DCMAKE_CXX_FLAGS="-fopenmp -D_WIN32_WINNT=0x0602" cmake --build build --config Release -
Copy the model file into
build/bin(or use an absolute path):cp /path/to/gguf-model/llama-2-7b.Q4_0.gguf build/bin/ -
Change to
build/binand run:cd build/bin ./llama-bench.exe -m ./llama-2-7b.Q4_0.gguf -ngl 50 --verbose-ngl 50reduces VRAM usage (compared to-ngl 100), useful for GPUs with limited memory.- If you have enough memory, try
-ngl 100.
If the output shows that tensors are successfully offloaded to Vulkan and the model loads, the build is successful!
7) Summary of Working Steps
-
Install the latest Vulkan SDK + AMD drivers to ensure Vulkan 1.4 support.
-
Set up environment variables to ensure MSYS2 can locate
glslc:export VULKAN_SDK=/c/VulkanSDK/1.4.304.1 export PATH=$VULKAN_SDK/Bin:$PATH -
Clear previous builds:
rm -rf build -
Use CMake with Vulkan and
_WIN32_WINNTdefinition:cmake -B build -DGGML_VULKAN=ON -DCMAKE_CXX_FLAGS=-D_WIN32_WINNT cmake --build build --config Release -
Copy the model (e.g.,
llama-2-7b.Q4_0.gguf) tobuild/binand run:cd build/bin cp /path/to/model/llama-2-7b.Q4_0.gguf . ./llama-bench.exe -m ./llama-2-7b.Q4_0.gguf -ngl 50 --verbose
With these steps, the build should work successfully on Windows!
8) Benchmarking
System & Command
Maddin@DESKTOP-5ND4LK3 UCRT6464 /h/llama_cpp_vulkan_2/llama.cpp/build/bin
$ ./llama-bench.exe -m ./llama-2-7b.Q4_0.gguf -ngl 100
Vulkan Device Info
ggml_vulkan: Found 1 Vulkan device:
ggml_vulkan: 0 = Radeon (TM) RX 480 Graphics (AMD proprietary driver) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 32768 | matrix cores: none
Benchmark Results
| Model | Size | Params | Backend | NGL | Test | Tokens/sec |
|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | pp512 | 124.54 ± 0.83 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 100 | tg128 | 15.26 ± 0.13 |
Build Info
build: cc473cac (4800)
These results show that using Vulkan on the RX 480 provides reasonable performance, though without support for FP16 or matrix cores, performance may not be optimal compared to newer GPUs.
Dont forget to post your benchmarks here!
9) Running the llama.cpp Server
To use llama.cpp as a local inference server, run the following command in the MSYS2 terminal:
./llama-server.exe -m ./llama-2-7b.Q4_0.gguf -ngl 50 -c 4096 --port 8000 --verbose
9.1 Accessing the Server
Once the server is running, navigate to:
http://127.0.0.1:8000
You can now start chatting with the model via API requests or by using a compatible client.
9.2 Troubleshooting
If you encounter errors, check the terminal logs for hints on misconfigured parameters. You may need to adjust:
- Context size (
-c) based on your VRAM availability - Layer offloading (
-ngl) to optimize GPU memory usage - Model selection (
-m) to ensure compatibility with your setup
Consult your loyal LLM with the attached logs for more furter assitance if needed!
For advanced customization, explore additional parameters like temperature, system prompts, and batch sizes to fine-tune responses.
Enjoy experimenting with llama.cpp as your local and private AI assistant!
Linux
Caution: This Dockerized version may not work on WSL2 if your AMD GPU lacks ROCm support (e.g., RX 480/580). WSL2 does not natively support AMD GPUs for compute workloads without ROCm, and AMD's WSL2 ROCm support is limited to newer GPUs. If your GPU is unsupported, Vulkan-based GPU acceleration won't work in WSL2. Check AMD’s ROCm compatibility list before proceeding. For non-ROCm GPUs e.g. RX 480/580, head back to the Windows guide at the top, or use a Linux host with proper AMD GPU drivers.
Check this out if you want the OG-Guide: https://github.com/ollama/ollama/pull/5059#issuecomment-2502095958
Otherwise continue this pasted guide:
Steps to Follow
1. Clone the Repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
2. Modify the Dockerfile
Open .devops/llama-server-vulkan.Dockerfile and comment out the following lines:
# WORKDIR /
# RUN cp /app/build/bin/llama-server /llama-server && \
# rm -rf /app
Next, replace the ENTRYPOINT line at the end of the file with the following and save:
ENTRYPOINT [ "/app/build/bin/llama-server" ]
3. Build the Docker Image
Run the following command to compile the project:
docker build -t llama-cpp-vulkan -f .devops/llama-server-vulkan.Dockerfile .
4. Create a docker-compose.yml File
Create a docker-compose.yml file with the following content:
services:
llamacpp-server:
image: llama-cpp-vulkan
container_name: llamacpp-server
environment:
# Alternatively, use "LLAMA_ARG_MODEL_URL" to download the model
LLAMA_ARG_MODEL: /app/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
LLAMA_ARG_CTX_SIZE: 8192 # Context size
LLAMA_ARG_N_GPU_LAYERS: 100 # More layers = more performance (utilizes more GPU VRAM)
LLAMA_ARG_N_PARALLEL: 6
LLAMA_ARG_PORT: 8080
devices:
- /dev/dri/renderD128:/dev/dri/renderD128
- /dev/dri/card1:/dev/dri/card1
volumes:
- ./models:/app/models
ports:
- 8080:8080
restart: unless-stopped
5. Download a Model
Download and place the model in a folder named models. For this example, we'll use the Qwen 2.5 Coder 7B in Q4_K_M format. Here's the download link:
Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
6. Run the LLM
Run the following command to start the LLM:
docker compose up -d
Wait a few seconds, and then access your LLM at:
http://<IP_OF_YOUR_SERVER>:8080
Notes for LLM Beginners
- To achieve the full performance of your GPU, ensure the model fits entirely into your VRAM.
- This setup does not require any ROCm or CUDA installation. 🎉
Enjoy chatting with your LLM!
Example Performance: AMD 6600 XT with Vulkan
Here's my performance using llama.cpp and Vulkan with the prompt:
Code me a webserver in python3 using flask latest recommendation
Results:
prompt eval time = 271.08 ms / 32 tokens ( 8.47 ms per token, 118.05 tokens per second)
eval time = 19773.05 ms / 849 tokens ( 23.29 ms per token, 42.94 tokens per second)
total time = 20044.13 ms / 881 tokens
Notes:
- The current limit for token computation in Vulkan is approximately 50 tokens per second on a 0.5B model by Qwen with my 6600 XT.