Some records of personal CUDA kernel implementations. These implementations are not best optimized and mainly for learning purposes.
- Softmax
- ReLU
- GeLU
- GEMM
- Layer Normalization
- Multi Head Self Attention
- Matrix Transpose
- Multiple Tensor Apply (e.g., Addition)
More kernels are coming...