In this work we implement in C++ FlashAttention based hardware accelerators with our proposed operator ExpMul, that fuses floating-point exponent function calculation and multiplication into simple add and shift operations using fixed-point arithmetic, withought the need for additional conversion back to floating-point domain since the result is given directly as a floating-point number. In order to evaluate power metrics we run inference using the Google's FLAN-T5 LLM. More specifically, we run the pytorch model from huggingface and extracted inter layer results for the different tests included in GLUE dataset to use as inputs in main.cc
.
Most of the floating-point functionality utilizes the Fast-Float4HLS library, publicly available on github.
This repository is organized as follows:
.
├── src
│ ├── attnetion.h
│ ├── bf16_arithm.h
│ ├── defines.h
│ ├── file_io.h
│ ├── fused_operators.h
│ ├── logging.h
│ ├── main.cc
│ ├── math_ops.h
│ └── reduction.h
│
├── utils
│ ├── gen_pwl_coeff.py
│ └── pack.py
│
├── LICENSE
├── README.md
└── setup.sh
./src/
This directory contains the C++ implementation of FlashAttention based accelerators with ExpMul operator.attention.h
file contains the implementation of FlashAttention Acceleratorsfused_operators.h
file contains the implementation of ExpMul operator
./utils/
This directory contains Python utility scripts../setup.sh
A bash script to fetch all required dependencies.
- Python scripts for automatically loading and extracting FLAN-T5 input on GLUE.
- Fix Dependency issues regarding HLS math library and Fast-Float4HLS.
TODO
Currently active: Kosmas Alexandridis and Giorgos Dimitrakopoulos
Fused-ExpMul is licensed with the MIT License. You are completely free to re-distribute your work derived from Fused-ExpMul