Summary
CubeCL 0.6.0 introduces significant enhancements to performance, functionality, and compatibility across various backends. Key features include n-dimensional convolution, multi-stage matrix multiplication (matmul), and dynamic shared memory support for CUDA. Performance optimizations, such as reworked into_contiguous
and double buffering, improve efficiency. New functionality like random number generation, fp8/fp6 support, and recursive profiling enhance the library's capabilities.
Bug fixes address issues in backends (Metal, HIP, Vulkan, WASM), memory alignment, and deadlocks.
What's New
Features
- N-Dimensional Convolution: Added support for n-dimensional convolution operations (@wingertge, #649).
- Multi-Stage Convolution: Implemented multi-stage convolution for enhanced processing (@wingertge, #602).
- Matrix Multiplication Enhancements:
- Added double-stage matmul with
k > 1
(@louisfd, #653). - Generalized tilewise loading for multiple tiles (@louisfd, #655).
- Introduced ordered double buffering (@louisfd, #680).
- Added specialized configs, event listener refactoring, and selection improvements (@louisfd, @nathanielsimard, #710, #711, #719, #722, #749, #751).
- Unit matmul with plane matmul merging and double buffering (@louisfd, #686, #697).
- Added double-stage matmul with
- Random Number Generation: Added random number generation with vectorized kernels and improved tests (@Cielbird, #673, #677, #679, #681, #682).
- Low-Precision Support: Added fp8, fp6, and theoretical fp4 support (@wingertge, #675).
- Dynamic Shared Memory on CUDA: Enabled dynamic shared memory allocation for CUDA (@wingertge, #620).
- Intrinsic Macro: Introduced intrinsic macro support for enhanced flexibility (@wingertge, #639).
- Recursive Profiling: Added recursive profiling capabilities (@nathanielsimard, #674).
- Sync Plane Instruction: Added
sync_plane
instruction for synchronization (@louisfd, #676). - CubeCL Configuration: Introduced configuration options for CubeCL (@nathanielsimard, #665).
- Multi-Tensor Allocation: Added support for multi-tensor allocation to handle quantization (@wingertge, #661).
- Autotune Enhancements:
- Made autotune optional (@nathanielsimard, #685).
- Added basic error handling for autotune (@nathanielsimard, #738).
- Improved matmul selection and tuner deadlock fixes (@nathanielsimard, #771, #782).
- f16 Support for WGSL: Added f16 support to the WGSL backend (@wingertge, #658).
- GFX10 (RDNA2) Support: Added support for GFX10 architecture (@VirxEC, #662).
- Graphviz Output for SPIR-V: Added Graphviz output to
spirv-dump
for better visualization (@wingertge, #664). - PTX WMMA for CUDA: Added PTX WMMA support for CUDA (@syl20bnr, #668).
- Tunable Priority: Introduced tunable priority for improved control (@nathanielsimard, #768).
Performance Improvements
- Reworked
into_contiguous
for better performance (@wingertge, #621). - Optimized double buffering event cleanup (@nathanielsimard, #663).
- Reduced mixed precision overhead (@nathanielsimard, #619).
- Improved compilation times (@nathanielsimard, #669).
- Sped up SPIR-V compilation and softened matmul autotune key (@nathanielsimard, #740).
Bug Fixes
- Fixed cluster issues caused by merges (@wingertge, #648).
- Corrected edge case in
calculate_cube_count_elemwise
(@wingertge, #646). - Fixed Metal and HIP slice offset issues (@louisfd, #651).
- Resolved inner mutability and register mutability issues (@nathanielsimard, #652, #656).
- Fixed deadlock by avoiding lock captures (@ArthurBrussee, #657).
- Corrected buffer offset alignment and size calculation (@wingertge, #684).
- Fixed WASM by using
cfg(std_io)
(@ArthurBrussee, #670). - Addressed Vulkan atomics issues (@nathanielsimard, #704).
- Fixed configuration environment parsing (@nathanielsimard, #678).
- Corrected random interval and logger profile issues (@laggui, @nathanielsimard, #744, #683).
- Fixed Metal backend tests and removed unused warnings (@louisfd, #762, #763).
- Addressed SPIR-V issues, including CMMA offset and compilation (@marcantoinem, @nathanielsimard, #752, #764).
- Fixed matmul cube count overflow (@louisfd, #760).
- Resolved tuner deadlock (@nathanielsimard, #782).
- Fixed benchmark API for dead code elimination and memory alignment (@nathanielsimard, #712).
Refactorings
- Unified slice implementation across backends (@nathanielsimard, #644).
- Refactored
init
toIntoMut
(@nathanielsimard, #659). - Split
cubecl-linalg
intocubecl-matmul
andcubecl-convolution
(@louisfd, #708). - Moved SPIR-V extension methods to
rspirv-ext
crate (@wingertge, #596). - Refactored matmul tiling scheme, setup, and compute resource dependency (@louisfd, #707, #709, #716).
- Moved profile logging to
ComputeClient
and made it async (@ArthurBrussee, #692). - Improved unit selector and HIP device refactoring (@nathanielsimard, #758, #761).
- Cleaned up SPIR-V backend code (@marcantoinem, #769).
Documentation & Testing
- Fixed typo in CubeCL book (@marcantoinem, #666).
- Improved documentation with additional CubeCL book pages (@marcantoinem, #733, #774).
- Enhanced matmul documentation and refactoring (@louisfd, #772, #775).
- Improved debug information (@nathanielsimard, #689).
- Added finer-grained feature flags for matmul tests (@louisfd, #734).
- Updated matmul benchmarks (@nathanielsimard, #781).
Dependencies & Maintenance
- Bumped version to 0.6.0 (@syl20bnr, #643).
- Updated
cudarc
dependency (@wingertge, #637). - Updated
cubecl-hip-sys
to version 6.4.4348201 (@syl20bnr, #743). - Bumped major versions of dependencies (@ArthurBrussee, #776).
- Silenced
MAPPABLE_PRIMARY_BUFFERS
warning (@ArthurBrussee, #688).
Thank you to all contributors for making CubeCL 0.6.0 possible!