We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We need two sets of kernels for MLA:
head_dim_qk=192, head_dim_vo=128
head_dim_qk=576, head_dim_vo=512
and serving engines are expected to use different kernels according to use cases:
o_1, lse_1 = cross_attention(c_q, q_pe, c_kv)
c_q: (n, 128, 512), q_pe: (n, 128, 64), c_kv: (n_kv, 576), o_1: (n, 128, 512), lse_1: (n, 128)
o_2, lse_2 = self_attention(q, k, v_new)
q: (n, 128, 192), k: (n, 128, 192), v: (n, 128, 128), o_2: (n, 128, 128), lse_2: (n, 128)
o, lse = merge(W_UV(o_1), lse_1, o_2, lse_2)
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Background
We need two sets of kernels for MLA:
head_dim_qk=192, head_dim_vo=128
head_dim_qk=576, head_dim_vo=512
(K=V)and serving engines are expected to use different kernels according to use cases:
o_1, lse_1 = cross_attention(c_q, q_pe, c_kv)
(c_q: (n, 128, 512), q_pe: (n, 128, 64), c_kv: (n_kv, 576), o_1: (n, 128, 512), lse_1: (n, 128)
)o_2, lse_2 = self_attention(q, k, v_new)
(q: (n, 128, 192), k: (n, 128, 192), v: (n, 128, 128), o_2: (n, 128, 128), lse_2: (n, 128)
)o, lse = merge(W_UV(o_1), lse_1, o_2, lse_2)
Milestone
The text was updated successfully, but these errors were encountered: