How to Compile and Run `llama.cpp` with Vulkan

If you're like me and looking for a way to compile and run llama.cpp with Vulkan, here's how to do it.

For Windows and especially older GPUs just follow along. For Linux and ROCm supported cards scroll down until you see the Linux header.

Windows

Running `llama-bench` with Vulkan on RX 480

I ran llama-bench.exein msys2 UCRT64 on my Radeon RX 480 GPU using the Vulkan backend. Below are the details of the test:

Benchmark Results for Sapphire Radeon RX 480 Nitro 8G OC

Model	Size	Params	Backend	NGL	Test	Tokens/sec
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	124.54 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	15.26 ± 0.13

Not your card? Check some more llama.cpp Vulkan benchmarks

Now lets get into the building process:

Real-Life-Tested Build Instructions for Commit 4ccea21

The build instructions for MSYS2 from the official llama.cpp documentation did not work under Windows.

Here is a compact "From Zero to Llama-Vulkan" guide with exactly the steps that actually worked in the end.

All commands in the following Windows guide are supposed to be run in the msys2 UCRT64 terminal!

1) Install and Update Prerequisites

1.1 Install MSYS2 (UCRT64)

Download MSYS2
Start MSYS2 UCRT64 with admin privileges and perform a full system update:
```
pacman -Syu
```

Install required dependencies:

pacman -S git \
mingw-w64-ucrt-x86_64-gcc \
mingw-w64-ucrt-x86_64-cmake \
mingw-w64-ucrt-x86_64-vulkan-devel \
mingw-w64-ucrt-x86_64-shaderc

Restart MSYS2 with admin privileges and repeat the update if necessary.

Check if MSYS2 can locate glslc:

```msys2
which glslc
```

If no path is returned, add the Vulkan SDK path to the environment variables by completing step 1.2, 2) and 3)

If a path is returned, proceed with step 4) or 5)

1.2 Install Vulkan SDK from LunarG

Download Vulkan SDK (e.g., version 1.4.304.1)
Keep track of the downloaded version and if deviating from "1.4.304.1" replace "1.4.304.1" with your downloaded version!
Install and optionally restart your PC.
Verify in MSYS2 that glslc is found in 2) and 3)

2) Determine the Vulkan SDK Path

Find the Vulkan SDK installation directory in Windows Explorer. The typical path is:
```
C:\VulkanSDK\1.4.304.1\Bin\glslc.exe
```
Verify the file exists in MSYS2:
```
test -f C:/VulkanSDK/1.4.304.1/Bin/glslc.exe && echo true || echo false
```
If true is returned, the file exists at the expected location.

3) Set Up MSYS2 Environment Variables

Once you confirm where the SDK is installed (e.g., C:\VulkanSDK\1.4.304.1), configure MSYS2:

export VULKAN_SDK=/c/VulkanSDK/1.4.304.1
export PATH=$VULKAN_SDK/Bin:$PATH

Important Notes:

As soon as you close the current UCRT64 terminal, those exports are gone! For recompilation, redo step 3)
MSYS2 uses the converted path format /c/... instead of C:\....
Verify setup by running:
```
which glslc
glslc --version
```

Example Output:

$ which glslc
/c/VulkanSDK/1.4.304.1/Bin/glslc

$ test -f C:/VulkanSDK/1.4.304.1/Bin/glslc.exe && echo true || echo false
true

Now, glslc should be recognized by MSYS2.

4) (Optional) Update AMD Graphics Driver (for Vulkan 1.4)

Download AMD Driver (e.g., for RX 480)
Install the latest "Adrenalin Software" and restart Windows (keep in mind you might need to redo step 2) and at least the exports step 3)!)
Verify Vulkan support in MSYS2:
```
vulkaninfo | grep "Vulkan API"
```
If Vulkan 1.3 is still listed, try updating the driver or disabling overlays.

5) Clone or Update `llama.cpp`

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

From now on, the project directory is referred to as llama.cpp.

6) Build with Vulkan Support

Important: Define `_WIN32_WINNT`

Relevant for handling a Bug in Commit 0b3863f (and probably other builds - using msys2):

To ensure PrefetchVirtualMemory is available and avoid "unavailable" errors, define the Windows API level during compilation:

-DCMAKE_CXX_FLAGS=-D_WIN32_WINNT=0x602

If you are adventurous just build without this flag and share your experience for the respective commit Bug

Check cmake installation

You can verify CMake is correctly installed by running:

which cmake
cmake --version

Expected output should look like:

cmake version 3.X.X

Build Steps

Clear the build folder and compile:

cd path/to/llama.cpp
rm -rf build
cmake -B build -DGGML_VULKAN=ON   -DCMAKE_C_FLAGS="-fopenmp"   -DCMAKE_CXX_FLAGS="-fopenmp -D_WIN32_WINNT=0x0602"
cmake --build build --config Release

Copy the model file into build/bin (or use an absolute path):
```
cp /path/to/gguf-model/llama-2-7b.Q4_0.gguf build/bin/
```
Change to build/bin and run:
```
cd build/bin
./llama-bench.exe -m ./llama-2-7b.Q4_0.gguf -ngl 50 --verbose
```
- -ngl 50 reduces VRAM usage (compared to -ngl 100), useful for GPUs with limited memory.
- If you have enough memory, try -ngl 100.

If the output shows that tensors are successfully offloaded to Vulkan and the model loads, the build is successful!

7) Summary of Working Steps

Install the latest Vulkan SDK + AMD drivers to ensure Vulkan 1.4 support.

Set up environment variables to ensure MSYS2 can locate glslc:

export VULKAN_SDK=/c/VulkanSDK/1.4.304.1
export PATH=$VULKAN_SDK/Bin:$PATH

Clear previous builds:
```
rm -rf build
```

Use CMake with Vulkan and _WIN32_WINNT definition:

cmake -B build -DGGML_VULKAN=ON -DCMAKE_CXX_FLAGS=-D_WIN32_WINNT
cmake --build build --config Release

Copy the model (e.g., llama-2-7b.Q4_0.gguf) to build/bin and run:

cd build/bin
cp /path/to/model/llama-2-7b.Q4_0.gguf .
./llama-bench.exe -m ./llama-2-7b.Q4_0.gguf -ngl 50 --verbose

With these steps, the build should work successfully on Windows!

8) Benchmarking

System & Command

Maddin@DESKTOP-5ND4LK3 UCRT6464 /h/llama_cpp_vulkan_2/llama.cpp/build/bin
$ ./llama-bench.exe -m ./llama-2-7b.Q4_0.gguf -ngl 100

Vulkan Device Info

ggml_vulkan: Found 1 Vulkan device:
ggml_vulkan: 0 = Radeon (TM) RX 480 Graphics (AMD proprietary driver) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 32768 | matrix cores: none

Benchmark Results

Model	Size	Params	Backend	NGL	Test	Tokens/sec
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	124.54 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	15.26 ± 0.13

Build Info

build: cc473cac (4800)

These results show that using Vulkan on the RX 480 provides reasonable performance, though without support for FP16 or matrix cores, performance may not be optimal compared to newer GPUs.

Dont forget to post your benchmarks here!

9) Running the `llama.cpp` Server

To use llama.cpp as a local inference server, run the following command in the MSYS2 terminal:

./llama-server.exe -m ./llama-2-7b.Q4_0.gguf -ngl 50 -c 4096 --port 8000 --verbose

9.1 Accessing the Server

Once the server is running, navigate to:

http://127.0.0.1:8000

You can now start chatting with the model via API requests or by using a compatible client.

9.2 Troubleshooting

If you encounter errors, check the terminal logs for hints on misconfigured parameters. You may need to adjust:

Context size (-c) based on your VRAM availability
Layer offloading (-ngl) to optimize GPU memory usage
Model selection (-m) to ensure compatibility with your setup

Consult your loyal LLM with the attached logs for more furter assitance if needed!

For advanced customization, explore additional parameters like temperature, system prompts, and batch sizes to fine-tune responses.

Enjoy experimenting with llama.cpp as your local and private AI assistant!

Linux

Caution: This Dockerized version may not work on WSL2 if your AMD GPU lacks ROCm support (e.g., RX 480/580). WSL2 does not natively support AMD GPUs for compute workloads without ROCm, and AMD's WSL2 ROCm support is limited to newer GPUs. If your GPU is unsupported, Vulkan-based GPU acceleration won't work in WSL2. Check AMD’s ROCm compatibility list before proceeding. For non-ROCm GPUs e.g. RX 480/580, head back to the Windows guide at the top, or use a Linux host with proper AMD GPU drivers.

Check this out if you want the OG-Guide: https://github.com/ollama/ollama/pull/5059#issuecomment-2502095958

Otherwise continue this pasted guide:

Steps to Follow

1. Clone the Repository

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

2. Modify the Dockerfile

Open .devops/llama-server-vulkan.Dockerfile and comment out the following lines:

# WORKDIR /
# RUN cp /app/build/bin/llama-server /llama-server && \
#     rm -rf /app

Next, replace the ENTRYPOINT line at the end of the file with the following and save:

ENTRYPOINT [ "/app/build/bin/llama-server" ]

3. Build the Docker Image

Run the following command to compile the project:

docker build -t llama-cpp-vulkan -f .devops/llama-server-vulkan.Dockerfile .

4. Create a `docker-compose.yml` File

Create a docker-compose.yml file with the following content:

services:
  llamacpp-server:
    image: llama-cpp-vulkan
    container_name: llamacpp-server
    environment:
      # Alternatively, use "LLAMA_ARG_MODEL_URL" to download the model
      LLAMA_ARG_MODEL: /app/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
      LLAMA_ARG_CTX_SIZE: 8192 # Context size
      LLAMA_ARG_N_GPU_LAYERS: 100 # More layers = more performance (utilizes more GPU VRAM)
      LLAMA_ARG_N_PARALLEL: 6
      LLAMA_ARG_PORT: 8080
    devices:
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/card1:/dev/dri/card1
    volumes:
      - ./models:/app/models
    ports:
      - 8080:8080
    restart: unless-stopped

5. Download a Model

Download and place the model in a folder named models. For this example, we'll use the Qwen 2.5 Coder 7B in Q4_K_M format. Here's the download link:

Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

6. Run the LLM

Run the following command to start the LLM:

docker compose up -d

Wait a few seconds, and then access your LLM at:

http://<IP_OF_YOUR_SERVER>:8080

Notes for LLM Beginners

To achieve the full performance of your GPU, ensure the model fits entirely into your VRAM.
This setup does not require any ROCm or CUDA installation. 🎉

Enjoy chatting with your LLM!

Example Performance: AMD 6600 XT with Vulkan

Here's my performance using llama.cpp and Vulkan with the prompt:

Code me a webserver in python3 using flask latest recommendation

Results:

prompt eval time =     271.08 ms /    32 tokens (    8.47 ms per token,   118.05 tokens per second)
       eval time =   19773.05 ms /   849 tokens (   23.29 ms per token,    42.94 tokens per second)
      total time =   20044.13 ms /   881 tokens

Notes:

The current limit for token computation in Vulkan is approximately 50 tokens per second on a 0.5B model by Qwen with my 6600 XT.

How to Compile and Run llama.cpp with Vulkan