Skip to content

GPU-accelerated LLaMA inference wrapper for legacy Vulkan-capable systems a Pythonic way to run AI with knowledge (Ilm) on fire (Vulkan).

License

Notifications You must be signed in to change notification settings

Talnz007/VulkanIlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VulkanIlm 🚀🔥

GPU-Accelerated Local LLMs for Everyone (Vulkan + Ilm — "knowledge")

VulkanIlm is a Python-first wrapper and CLI around llama.cpp's Vulkan backend that brings fast local LLM inference to AMD, Intel, and NVIDIA GPUs — no CUDA required. Built for developers with legacy or non-NVIDIA hardware.


TL;DR

  • What: Python library + CLI to run LLMs locally using Vulkan GPU acceleration.
  • Why: Most acceleration tooling targets CUDA/NVIDIA — VulkanIlm opens up AMD & Intel users.
  • Quick result: Small models can run orders of magnitude faster on iGPUs; mid/large legacy GPUs get ~4–6× speedups vs CPU.

Key features

  • 🚀 Significant speedups vs CPU on legacy GPUs and iGPUs
  • 🎮 Broad GPU support: AMD, Intel, NVIDIA (via Vulkan)
  • 🐍 Python-first API + easy CLI tools
  • ⚡ Auto detection + GPU-specific optimizations
  • 📦 Auto build/install of llama.cpp Vulkan backend
  • 🔄 Real-time streaming token generation
  • ✅ Reproducible benchmark scripts in benchmarks/

Benchmarks (summary)

Benchmarks measured with Gemma-3n-E4B-it (6.9B) unless noted. Results depend on model quantization, GPU drivers, OS, and system load.

Hardware (OS) Model CPU time Vulkan (GPU) time Speedup
Dell E7250 (i7-5600U, integrated GPU) — Fedora 42 Workstation TinyLLaMA-1.1B-Chat (Q4_K_M) 121 s 3 s 33×
AMD RX 580 8GB — Ubuntu 22.04.5 LTS (Jammy) Gemma-3n-E4B-it (6.9B) 188.47 s 44.74 s 4.21×
Intel Arc A770 Gemma-3n-E4B-it (6.9B) ~120 s ~25 s ~4.8×
AMD RX 6600 Gemma-3n-E4B-it (6.9B) ~90 s ~18 s ~5.0×

iGPU notes

  • The Dell E7250 iGPU result shows older integrated GPUs can be very effective for smaller LLMs when using Vulkan.
  • Smaller models and appropriate quantizations are more iGPU-friendly. Driver/version differences significantly affect results.

Other tested (functional) models

  • DeepSeek-R1-Distill-Qwen-1.5B-unsloth-bnb-4bit — runs (not benchmarked).
  • LLaMA 3.1 8B — runs (not benchmarked).

ROCm / AMD notes

  • ROCm is not officially supported for gfx803 (RX 580).
  • Some community members try ROCm 5/6 workarounds on RX 580, but they are unstable/unsupported.
  • VulkanIlm offers a Vulkan-based path that avoids ROCm on legacy AMD cards.

Install

Quick start

git clone https://github.com/Talnz007/VulkanIlm.git
cd VulkanIlm
pip install -e .

Prerequisites

  • Python 3.9+
  • Vulkan-capable GPU (AMD RX 400+, Intel Arc/Xe, NVIDIA GTX 900+)
  • Vulkan drivers installed and working

Install Vulkan tools (if needed)

Ubuntu / Debian:

sudo apt update
sudo apt install vulkan-tools libvulkan-dev

Fedora / RHEL:

sudo dnf install vulkan-tools vulkan-devel

Verify:

vulkaninfo

Usage

CLI examples

# Auto-install llama.cpp with Vulkan support
vulkanilm install

# Check your GPU setup
vulkanilm vulkan-info

# Search and download models (if supported)
vulkanilm search "llama"
vulkanilm download microsoft/DialoGPT-medium

# Generate text
vulkanilm ask path/to/model.gguf --prompt "Explain quantum computing"

# Stream tokens in real-time
vulkanilm stream path/to/model.gguf "Tell me a story about AI"

# Run a benchmark
vulkanilm benchmark path/to/model.gguf --prompt "Benchmark prompt" --repeat 3

Python API (example)

from vulkan_ilm import Llama

# Load model (auto GPU optimization)
llm = Llama("path/to/model.gguf", gpu_layers=16)

# Synchronous generation
response = llm.ask("Explain the term 'ilm' in AI context.")
print(response)

# Streaming generation
for token in llm.stream_ask_real("Tell me about Vulkan API"):
    print(token, end='', flush=True)

Reproduce benchmarks (quick checklist)

  1. Use the exact model file & quantization referenced in /benchmarks (GGUF + quantization).
  2. Use the benchmark script in benchmarks/run_benchmark.sh.
  3. Record: driver version, OS version, CPU frequency governor, and system load.
  4. Run benchmarks multiple times (cold and warm cache) and average results.

Troubleshooting (Linux)

vulkanilm: command not found

  • Activate venv and reinstall:
python3 -m venv venv
source venv/bin/activate
pip install -e .
  • Or run via Poetry:
poetry run vulkanilm install

Could NOT find Vulkan (missing: glslc)

  • Install glslc (Vulkan SDK / vulkan-tools):
# Fedora
sudo dnf install glslc

# Ubuntu/Debian
sudo apt install vulkan-tools

Verify: glslc --version

Could NOT find CURL

  • Install libcurl dev:
# Fedora
sudo dnf install libcurl-devel

# Ubuntu/Debian
sudo apt install libcurl4-openssl-dev

Project structure

VulkanIlm/
├── vulkan_ilm/
│   ├── cli.py
│   ├── llama.py
│   ├── vulkan/
│   │   └── detector.py
│   ├── benchmark.py
│   ├── installer.py
│   └── streaming.py
├── benchmarks/             # benchmark scripts & data
├── pyproject.toml
└── README.md

Contributing

We welcome contributions! Useful areas:

  • GPU testing across drivers & OSes
  • Additional model formats & quant recipes
  • Memory & perf optimizations
  • Docs, reproducible benchmarks, and examples

See CONTRIBUTING.md for details. Look for good-first-issue tags.


The story behind the name

Ilm (علم) = knowledge / wisdom. Combined with Vulkan — “knowledge on fire”: making fast local AI accessible to everyone, regardless of GPU brand or budget. 🔥


License

MIT — see LICENSE for details.


Links & support


Built with passion by @Talnz007 — bringing fast, local AI to legacy GPUs everywhere.

About

GPU-accelerated LLaMA inference wrapper for legacy Vulkan-capable systems a Pythonic way to run AI with knowledge (Ilm) on fire (Vulkan).

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages