Is there a CPU equivalent implementation of the _flash_attention_forward function in llama.cpp? #12193

guoguo1314 · 2025-03-05T10:17:07Z

guoguo1314
Mar 5, 2025

Hello everyone, I would like to ask if there is an implementation of the _flash_attention_forward function in llama.cpp. You can find the reference here:
https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_flash_attention_utils.py#L231

Of course, the following implementations would also work for me:
1） The actual implementations of the core sub-functions of the _flash_attention_forward function, specifically referring to self-implemented versions of _upad_input, flash_attn_varlen_func, and pad_input.

2）Alternatively, implementations of equivalent functions to these three sub-functions, particularly flash_attn_varlen_func. Having equivalent implementations for all three sub-functions would be even better.

3） Or any other ideas can be provided ?

thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a CPU equivalent implementation of the _flash_attention_forward function in llama.cpp? #12193

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Is there a CPU equivalent implementation of the _flash_attention_forward function in llama.cpp? #12193

guoguo1314 Mar 5, 2025

Replies: 0 comments

guoguo1314
Mar 5, 2025