Release v0.6.0 · tracel-ai/cubecl

Summary

CubeCL 0.6.0 introduces significant enhancements to performance, functionality, and compatibility across various backends. Key features include n-dimensional convolution, multi-stage matrix multiplication (matmul), and dynamic shared memory support for CUDA. Performance optimizations, such as reworked into_contiguous and double buffering, improve efficiency. New functionality like random number generation, fp8/fp6 support, and recursive profiling enhance the library's capabilities.
Bug fixes address issues in backends (Metal, HIP, Vulkan, WASM), memory alignment, and deadlocks.

What's New

Features

N-Dimensional Convolution: Added support for n-dimensional convolution operations (@wingertge, #649).
Multi-Stage Convolution: Implemented multi-stage convolution for enhanced processing (@wingertge, #602).
Matrix Multiplication Enhancements:
- Added double-stage matmul with k > 1 (@louisfd, #653).
- Generalized tilewise loading for multiple tiles (@louisfd, #655).
- Introduced ordered double buffering (@louisfd, #680).
- Added specialized configs, event listener refactoring, and selection improvements (@louisfd, @nathanielsimard, #710, #711, #719, #722, #749, #751).
- Unit matmul with plane matmul merging and double buffering (@louisfd, #686, #697).
Random Number Generation: Added random number generation with vectorized kernels and improved tests (@Cielbird, #673, #677, #679, #681, #682).
Low-Precision Support: Added fp8, fp6, and theoretical fp4 support (@wingertge, #675).
Dynamic Shared Memory on CUDA: Enabled dynamic shared memory allocation for CUDA (@wingertge, #620).
Intrinsic Macro: Introduced intrinsic macro support for enhanced flexibility (@wingertge, #639).
Recursive Profiling: Added recursive profiling capabilities (@nathanielsimard, #674).
Sync Plane Instruction: Added sync_plane instruction for synchronization (@louisfd, #676).
CubeCL Configuration: Introduced configuration options for CubeCL (@nathanielsimard, #665).
Multi-Tensor Allocation: Added support for multi-tensor allocation to handle quantization (@wingertge, #661).
Autotune Enhancements:
- Made autotune optional (@nathanielsimard, #685).
- Added basic error handling for autotune (@nathanielsimard, #738).
- Improved matmul selection and tuner deadlock fixes (@nathanielsimard, #771, #782).
f16 Support for WGSL: Added f16 support to the WGSL backend (@wingertge, #658).
GFX10 (RDNA2) Support: Added support for GFX10 architecture (@VirxEC, #662).
Graphviz Output for SPIR-V: Added Graphviz output to spirv-dump for better visualization (@wingertge, #664).
PTX WMMA for CUDA: Added PTX WMMA support for CUDA (@syl20bnr, #668).
Tunable Priority: Introduced tunable priority for improved control (@nathanielsimard, #768).

Performance Improvements

Reworked into_contiguous for better performance (@wingertge, #621).
Optimized double buffering event cleanup (@nathanielsimard, #663).
Reduced mixed precision overhead (@nathanielsimard, #619).
Improved compilation times (@nathanielsimard, #669).
Sped up SPIR-V compilation and softened matmul autotune key (@nathanielsimard, #740).

Bug Fixes

Fixed cluster issues caused by merges (@wingertge, #648).
Corrected edge case in calculate_cube_count_elemwise (@wingertge, #646).
Fixed Metal and HIP slice offset issues (@louisfd, #651).
Resolved inner mutability and register mutability issues (@nathanielsimard, #652, #656).
Fixed deadlock by avoiding lock captures (@ArthurBrussee, #657).
Corrected buffer offset alignment and size calculation (@wingertge, #684).
Fixed WASM by using cfg(std_io) (@ArthurBrussee, #670).
Addressed Vulkan atomics issues (@nathanielsimard, #704).
Fixed configuration environment parsing (@nathanielsimard, #678).
Corrected random interval and logger profile issues (@laggui, @nathanielsimard, #744, #683).
Fixed Metal backend tests and removed unused warnings (@louisfd, #762, #763).
Addressed SPIR-V issues, including CMMA offset and compilation (@marcantoinem, @nathanielsimard, #752, #764).
Fixed matmul cube count overflow (@louisfd, #760).
Resolved tuner deadlock (@nathanielsimard, #782).
Fixed benchmark API for dead code elimination and memory alignment (@nathanielsimard, #712).

Refactorings

Unified slice implementation across backends (@nathanielsimard, #644).
Refactored init to IntoMut (@nathanielsimard, #659).
Split cubecl-linalg into cubecl-matmul and cubecl-convolution (@louisfd, #708).
Moved SPIR-V extension methods to rspirv-ext crate (@wingertge, #596).
Refactored matmul tiling scheme, setup, and compute resource dependency (@louisfd, #707, #709, #716).
Moved profile logging to ComputeClient and made it async (@ArthurBrussee, #692).
Improved unit selector and HIP device refactoring (@nathanielsimard, #758, #761).
Cleaned up SPIR-V backend code (@marcantoinem, #769).

Documentation & Testing

Fixed typo in CubeCL book (@marcantoinem, #666).
Improved documentation with additional CubeCL book pages (@marcantoinem, #733, #774).
Enhanced matmul documentation and refactoring (@louisfd, #772, #775).
Improved debug information (@nathanielsimard, #689).
Added finer-grained feature flags for matmul tests (@louisfd, #734).
Updated matmul benchmarks (@nathanielsimard, #781).

Dependencies & Maintenance

Bumped version to 0.6.0 (@syl20bnr, #643).
Updated cudarc dependency (@wingertge, #637).
Updated cubecl-hip-sys to version 6.4.4348201 (@syl20bnr, #743).
Bumped major versions of dependencies (@ArthurBrussee, #776).
Silenced MAPPABLE_PRIMARY_BUFFERS warning (@ArthurBrussee, #688).

Thank you to all contributors for making CubeCL 0.6.0 possible!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.6.0