Replies: 1 comment
-
It's probably not what you are looking for, but compiling your model to TensorRT will use Flash Attention. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm not sure how to enable BERT with flash attention during the start-up of the Triton server in order to accelerate inference.
Beta Was this translation helpful? Give feedback.
All reactions