In this work we implement in C++ hardware accelerator for attention inference following FlashAttention-2 and our proposed Flash-D approach, that computes the output vector using as weighted sum of its previous value and an incoming value vector. The weight is computed incrementally and its formulation helps eliminate max computation needed for ensuring numerical stability and hide the final vector division operation in a sigmoid function calculation. In order to evaluate power metrics we run inference using Microsoft's PromptBench framework. More specifically, we run pytorch models from huggingface and extracted inter layer results for different prompts to use as inputs in main.cc
.
Most of the floating-point functionality utilizes the Fast-Float4HLS library, publicly available on github.
This repository is organized as follows:
.
├── src
│ ├── defines.h
│ ├── flash_atten.h
│ ├── file_io.h
│ ├── fp_arithm.h
│ ├── logging.h
│ ├── main.cc
│ ├── math_ops.h
│ └── reduction.h
│
├── utils
│ ├── gen_pwl_coeff.py
│ └── pack.py
│
├── LICENSE
├── README.md
└── setup.sh
./src/
This directory contains the C++ source files.flash_atten.h
file contains the implementation of FlashAttention-2 and Flash-D accelerators
./utils/
This directory contains Python utility scripts../setup.sh
A bash script to fetch all required dependencies.
- Update design files for to their optimized version.
- Python scripts flows for running PromptBench outputs on C++ designs.
- Provide extention files for 8-bit reduced precision Fast-Float4HLS datatypes.
TODO
Currently active: Kosmas Alexandridis and Giorgos Dimitrakopoulos
Flash-D is licensed with the MIT License. You are completely free to re-distribute your work derived from Flash-D