MultiHeadDotProductAttention API #1737

cgarciae · 2021-12-17T22:01:51Z

cgarciae
Dec 17, 2021
Maintainer

Hey! I am very curious about the API exposed by MultiHeadDotProductAttention, currently it has the following signature:

  def __call__(self,
               inputs_q: Array,
               inputs_kv: Array,
               mask: Optional[Array] = None,
               deterministic: Optional[bool] = None):

It seems to be tying keys and values into inputs_kv which (while probably the most common option) seems restrictive, both Keras and Pytorch let you define them separately, I don't know of any case where they might be different but who knows what researches can come up with. Would it be worth separating inputs_kv into inputs_k and inputs_v?

Also the following things come to mind:

It doesn't expose a way to get the attention scores.
This point is probably nitpicking but its named MultiHeadDotProductAttention, however, dot product is just the default for the attention_fn, MultiHeadAttention would be the better name here since contrary to the Pytorch and Keras implementations the user can use whatever form of attention it wants.

marcvanzee · 2021-12-19T18:50:53Z

marcvanzee
Dec 19, 2021
Maintainer

Would it be worth separating inputs_kv into inputs_k and inputs_v?

I think we could definitely separate them, but until now we haven't had anyone asking about this with a concrete use case, so I don't think this is something we should prioritize (we have enough tasks already!).

It doesn't expose a way to get the attention scores.

This is why we have factored out dot_product_attention_weights, which does let you access the weights. This is also the function that Huggingface uses in their Flax code (example).

MultiHeadAttention would be the better name here

I think technically you are right, but on the other hand the name MultiHeadDotProductAttention is known from the literature, and since it is the default I think the current name is acceptable as well. I don't have a strong opinion here.

0 replies

dirmeier · 2023-08-30T16:57:19Z

dirmeier
Aug 30, 2023

Hello!

If this is still on the agenda it would be great to have keys and values separated (I am converting from Haiku where this is implemented in MultiHeadAttention. An example where the separation is needed is attentive neural processes (see Figure 2).

Thanks and kind regards,
Simon

0 replies

chiamp · 2023-10-11T20:58:48Z

chiamp
Oct 11, 2023
Collaborator

After this commit, MultiHeadDotProductAttention can now accept different inputs for the key and value. Check this announcement for more details.

0 replies

carlosgmartin · 2023-11-14T17:39:08Z

carlosgmartin
Nov 14, 2023

@cgarciae @marcvanzee I second renaming to MultiheadAttention like PyTorch and Haiku, or at least adding such an alias.

(Perhaps even just Attention, to be more concise.)

1 reply

chiamp Jan 30, 2024
Collaborator

We now have a MultiHeadAttention alias: https://flax.readthedocs.io/en/latest/api_reference/flax.linen/_autosummary/flax.linen.MultiHeadAttention.html

carlosgmartin · 2024-04-08T15:59:42Z

carlosgmartin
Apr 8, 2024

To think of it, is there any reason not to just call it Attention? Just wondering. 🙂

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiHeadDotProductAttention API #1737

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

MultiHeadDotProductAttention API #1737

cgarciae Dec 17, 2021 Maintainer

Replies: 5 comments · 1 reply

marcvanzee Dec 19, 2021 Maintainer

dirmeier Aug 30, 2023

chiamp Oct 11, 2023 Collaborator

carlosgmartin Nov 14, 2023

chiamp Jan 30, 2024 Collaborator

carlosgmartin Apr 8, 2024

cgarciae
Dec 17, 2021
Maintainer

Replies: 5 comments 1 reply

marcvanzee
Dec 19, 2021
Maintainer

dirmeier
Aug 30, 2023

chiamp
Oct 11, 2023
Collaborator

carlosgmartin
Nov 14, 2023

chiamp Jan 30, 2024
Collaborator

carlosgmartin
Apr 8, 2024