Is there a CPU equivalent implementation of the _flash_attention_forward function in llama.cpp? #12193
Unanswered
guoguo1314
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone, I would like to ask if there is an implementation of the _flash_attention_forward function in llama.cpp. You can find the reference here:
https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_flash_attention_utils.py#L231
Of course, the following implementations would also work for me:
1) The actual implementations of the core sub-functions of the _flash_attention_forward function, specifically referring to self-implemented versions of _upad_input, flash_attn_varlen_func, and pad_input.
2)Alternatively, implementations of equivalent functions to these three sub-functions, particularly flash_attn_varlen_func. Having equivalent implementations for all three sub-functions would be even better.
3) Or any other ideas can be provided ?
thanks
Beta Was this translation helpful? Give feedback.
All reactions