A comprehensive CLI tool for benchmarking GPU performance across CUDA, Triton, and PyTorch implementations. Designed as both a practical benchmarking tool and an educational resource for learning GPU programming and optimization.
- Multi-Backend Benchmarking - Compare performance across CUDA C++, Triton, and PyTorch
- Educational Examples - Learn GPU programming concepts with progressively complex examples
- Comprehensive Metrics - Memory bandwidth, FLOPS, latency, and throughput analysis
- Configurable Benchmarks - Test various problem sizes, data types, and configurations
- Extensible Architecture - Easy to add new kernels and benchmarks
- Vector addition, multiplication
- Memory bandwidth tests
- Coalesced vs non-coalesced memory access patterns
- Matrix multiplication (naive and shared memory algorithms)
- Reduction operations (sum)
- Element-wise operations (ReLU, GELU, Softmax)
# Install from source
git clone <repo-url>
cd gpu-benchmark-suite
pip install -e .
# Or using uv
uv pip install -e .
# Run all benchmarks
gpu-bench run-all
# Run specific category
gpu-bench run memory
# Compare implementations
gpu-bench compare vector-add --sizes 1024,4096,16384
# Profile a specific kernel
gpu-bench profile matmul-cuda --size 1024
gpu-benchmark-suite/
├── src/gpu_benchmark/
│ ├── cli/ # CLI interface
│ ├── benchmarks/ # Benchmark implementations
│ │ ├── memory.py # Memory benchmarks
│ │ ├── math.py # Math benchmarks
│ │ ├── cuda/ # Cuda wrapper implementation*
│ │ ├── triton/ # Triton implementation*
│ │ └── pytorch/ # PyTorch implementation*
│ │ # *New benchmarks will use this structure for cleaner organization
│ ├── core/ # Core benchmark infrastructure
│ ├── metrics/ # Performance metrics collection (stub)
│ ├── profiling/ # Profiling and analysis tools (stub)
│ └── utils/ # Utilities and helpers (empty)
├── kernels/ # CUDA source files
├── examples/ # Educational examples
├── tests/ # Basic test suite
└── docs/ # Documentation
- Memory Benchmarks - vector addition, multiplication, memory copy, strided access
- Math Benchmarks - matrix multiplication (naive & shared memory), reduction sum, activation functions (ReLU, GELU, Softmax)
- Multi-Backend Support - CUDA, Triton, and PyTorch implementations for core benchmarks
- CLI Interface - full command-line interface with device info, benchmark listing, running, and comparison
- Educational Examples - basics tutorial in examples directory
- Performance Profiling - Basic profiling command available but limited functionality
- Metrics Collection - Core infrastructure present but detailed analysis TBD
- Advanced kernels (convolutions, attention mechanisms)
- NVIDIA Nsight integration for detailed profiling
- Memory hierarchy analysis tools
- Cache optimization benchmarks
- Convolutions
- Attention mechanisms
- Custom compute patterns
- Shared memory utilization
- Cache optimization
- Memory hierarchy analysis
- Reduction operations (max, min, mean, etc.)
- Advanced matrix operations (decompositions, solvers)
- FFT and signal processing kernels
- NVIDIA Nsight integration for detailed profiling
- Memory access pattern analysis
- Occupancy optimization tools
- Energy consumption benchmarking
- Additional tutorial examples beyond basics
- Interactive Jupyter notebooks for each concept
- Performance optimization workshops
- Explore
examples/
directory for comprehensive learning materials. See examples/README.md for the complete learning path from basic concepts to advanced optimization techniques. - Run Basic Benchmarks; begin with memory operations (
memory/vector-add
) - Use
compare
command to understand trade-offs between CUDA, Triton, and PyTorch. - Try matrix multiplication and reduction benchmarks and explore math ops.
- Add your own kernels following existing patterns
Note: Currently only 01_cuda_basics
example is implemented. Additional examples listed in examples/README.md are planned for future development. :)
- NVIDIA GPU (Compute Capability 7.0+)
- CUDA Toolkit 11.8+
- Python 3.11+
- PyTorch 2.0+
- Triton 2.0+