GPU-Accelerated Local LLMs for Everyone (Vulkan + Ilm — "knowledge")
VulkanIlm is a Python-first wrapper and CLI around llama.cpp
's Vulkan backend that brings fast local LLM inference to AMD, Intel, and NVIDIA GPUs — no CUDA required. Built for developers with legacy or non-NVIDIA hardware.
- What: Python library + CLI to run LLMs locally using Vulkan GPU acceleration.
- Why: Most acceleration tooling targets CUDA/NVIDIA — VulkanIlm opens up AMD & Intel users.
- Quick result: Small models can run orders of magnitude faster on iGPUs; mid/large legacy GPUs get ~4–6× speedups vs CPU.
- 🚀 Significant speedups vs CPU on legacy GPUs and iGPUs
- 🎮 Broad GPU support: AMD, Intel, NVIDIA (via Vulkan)
- 🐍 Python-first API + easy CLI tools
- ⚡ Auto detection + GPU-specific optimizations
- 📦 Auto build/install of
llama.cpp
Vulkan backend - 🔄 Real-time streaming token generation
- ✅ Reproducible benchmark scripts in
benchmarks/
Benchmarks measured with Gemma-3n-E4B-it (6.9B) unless noted. Results depend on model quantization, GPU drivers, OS, and system load.
Hardware (OS) | Model | CPU time | Vulkan (GPU) time | Speedup |
---|---|---|---|---|
Dell E7250 (i7-5600U, integrated GPU) — Fedora 42 Workstation | TinyLLaMA-1.1B-Chat (Q4_K_M) | 121 s | 3 s | 33× |
AMD RX 580 8GB — Ubuntu 22.04.5 LTS (Jammy) | Gemma-3n-E4B-it (6.9B) | 188.47 s | 44.74 s | 4.21× |
Intel Arc A770 | Gemma-3n-E4B-it (6.9B) | ~120 s | ~25 s | ~4.8× |
AMD RX 6600 | Gemma-3n-E4B-it (6.9B) | ~90 s | ~18 s | ~5.0× |
iGPU notes
- The Dell E7250 iGPU result shows older integrated GPUs can be very effective for smaller LLMs when using Vulkan.
- Smaller models and appropriate quantizations are more iGPU-friendly. Driver/version differences significantly affect results.
Other tested (functional) models
DeepSeek-R1-Distill-Qwen-1.5B-unsloth-bnb-4bit
— runs (not benchmarked).LLaMA 3.1 8B
— runs (not benchmarked).
- ROCm is not officially supported for
gfx803
(RX 580). - Some community members try ROCm 5/6 workarounds on RX 580, but they are unstable/unsupported.
- VulkanIlm offers a Vulkan-based path that avoids ROCm on legacy AMD cards.
Quick start
git clone https://github.com/Talnz007/VulkanIlm.git
cd VulkanIlm
pip install -e .
Prerequisites
- Python 3.9+
- Vulkan-capable GPU (AMD RX 400+, Intel Arc/Xe, NVIDIA GTX 900+)
- Vulkan drivers installed and working
Install Vulkan tools (if needed)
Ubuntu / Debian:
sudo apt update
sudo apt install vulkan-tools libvulkan-dev
Fedora / RHEL:
sudo dnf install vulkan-tools vulkan-devel
Verify:
vulkaninfo
# Auto-install llama.cpp with Vulkan support
vulkanilm install
# Check your GPU setup
vulkanilm vulkan-info
# Search and download models (if supported)
vulkanilm search "llama"
vulkanilm download microsoft/DialoGPT-medium
# Generate text
vulkanilm ask path/to/model.gguf --prompt "Explain quantum computing"
# Stream tokens in real-time
vulkanilm stream path/to/model.gguf "Tell me a story about AI"
# Run a benchmark
vulkanilm benchmark path/to/model.gguf --prompt "Benchmark prompt" --repeat 3
from vulkan_ilm import Llama
# Load model (auto GPU optimization)
llm = Llama("path/to/model.gguf", gpu_layers=16)
# Synchronous generation
response = llm.ask("Explain the term 'ilm' in AI context.")
print(response)
# Streaming generation
for token in llm.stream_ask_real("Tell me about Vulkan API"):
print(token, end='', flush=True)
- Use the exact model file & quantization referenced in
/benchmarks
(GGUF + quantization). - Use the benchmark script in
benchmarks/run_benchmark.sh
. - Record: driver version, OS version, CPU frequency governor, and system load.
- Run benchmarks multiple times (cold and warm cache) and average results.
- Activate venv and reinstall:
python3 -m venv venv
source venv/bin/activate
pip install -e .
- Or run via Poetry:
poetry run vulkanilm install
- Install
glslc
(Vulkan SDK / vulkan-tools):
# Fedora
sudo dnf install glslc
# Ubuntu/Debian
sudo apt install vulkan-tools
Verify: glslc --version
- Install libcurl dev:
# Fedora
sudo dnf install libcurl-devel
# Ubuntu/Debian
sudo apt install libcurl4-openssl-dev
VulkanIlm/
├── vulkan_ilm/
│ ├── cli.py
│ ├── llama.py
│ ├── vulkan/
│ │ └── detector.py
│ ├── benchmark.py
│ ├── installer.py
│ └── streaming.py
├── benchmarks/ # benchmark scripts & data
├── pyproject.toml
└── README.md
We welcome contributions! Useful areas:
- GPU testing across drivers & OSes
- Additional model formats & quant recipes
- Memory & perf optimizations
- Docs, reproducible benchmarks, and examples
See CONTRIBUTING.md
for details. Look for good-first-issue
tags.
Ilm (علم) = knowledge / wisdom. Combined with Vulkan — “knowledge on fire”: making fast local AI accessible to everyone, regardless of GPU brand or budget. 🔥
MIT — see LICENSE
for details.
- Repo: https://github.com/Talnz007/VulkanIlm
- Issues: Report bugs or request features on GitHub
- Discussions: Community Q&A
Built with passion by @Talnz007 — bringing fast, local AI to legacy GPUs everywhere.