[Question]: Is it possible to get BOTH the last_hidden_states and pooler_output of embedding model like BERT？ #11274

FdyCN · 2025-01-17T08:21:06Z

FdyCN
Jan 17, 2025

I found llama.cpp can only get last_hidden_states by using LLAMA_POOLING_TYPE_NONE or get pooler_output by using LLAMA_POOLING_TYPE_CLS\MEAN. What if I want to get both of them?？Is it possiable？

Please help me. Thank you.

Answered by ggerganov

Jan 20, 2025

Extracting the embeddings right before the language modelling head or the pooler seems like a relatively common practice in many applications, so we should extend the public API to do this in an easy way. Something like adding llama_get_embeddings_pre() and llama_get_embeddings_post() calls.

But how can I just dump one specific layer output

The callback can tell the engine which tensors to return and which to skip. See how the return value works.

Howerer I wanna make a runtime switch which for example, (i)th llama-decode function call to get pooler_output and then (i+1)th llama-decode function call to get last_hidden_states.

The callback can have any kind of dynamic logic that you wan…

View full answer

ggerganov · 2025-01-17T09:33:37Z

ggerganov
Jan 17, 2025
Maintainer

Programmatically, you can obtain any intermediate result from the computation using the eval callback. See the llama-eval-callback example for a demo.

4 replies

FdyCN Jan 20, 2025
Author

Thank you for the reply, I used callback for dump layer outputs for debug before. But how can I just dump one specific layer output？Only one way I can imagine is judging the tensor name in the callback, it's kind of hardcode. Is there a better way to use callback？

Thank you

FdyCN Jan 20, 2025
Author

I reviewed the callback example. Here is another question: I saw that callback and callback-user-data is init with common_params. Howerer I wanna make a runtime switch which for example, (i)th llama-decode function call to get pooler_output and then (i+1)th llama-decode function call to get last_hidden_states. What I mean is that callback is a kind of static setting but I need a kind of dynamic setting can be controled at runtime. (BTW， always get pooler_output and last_hidden_states together is OK but more flexible is better anyway.) @ggerganov

ggerganov Jan 20, 2025
Maintainer

Extracting the embeddings right before the language modelling head or the pooler seems like a relatively common practice in many applications, so we should extend the public API to do this in an easy way. Something like adding llama_get_embeddings_pre() and llama_get_embeddings_post() calls.

But how can I just dump one specific layer output

The callback can tell the engine which tensors to return and which to skip. See how the return value works.

Howerer I wanna make a runtime switch which for example, (i)th llama-decode function call to get pooler_output and then (i+1)th llama-decode function call to get last_hidden_states.

The callback can have any kind of dynamic logic that you want. The llama-eval-callback is a very basic example - more complex logic can be implemented.

Only one way I can imagine is judging the tensor name in the callback, it's kind of hardcode. Is there a better way to use callback？

Currently, there is no other way. Maybe you can make an initial pass where your callback will analyze the graph and the tensor names in it. And then subsequent calls would know which tensor to extract.

Answer selected by FdyCN

FdyCN Jan 20, 2025
Author

Extracting the embeddings right before the language modelling head or the pooler seems like a relatively common practice in many applications, so we should extend the public API to do this in an easy way. Something like adding llama_get_embeddings_pre() and llama_get_embeddings_post() calls.

But how can I just dump one specific layer output

The callback can tell the engine which tensors to return and which to skip. See how the return value works.

Howerer I wanna make a runtime switch which for example, (i)th llama-decode function call to get pooler_output and then (i+1)th llama-decode function call to get last_hidden_states.

The callback can have any kind of dynamic logic that you want. The llama-eval-callback is a very basic example - more complex logic can be implemented.

Only one way I can imagine is judging the tensor name in the callback, it's kind of hardcode. Is there a better way to use callback？

Currently, there is no other way. Maybe you can make an initial pass where your callback will analyze the graph and the tensor names in it. And then subsequent calls would know which tensor to extract.

Thank you for the reply， I added a switch in callback user data so I can control whether enable dump all embedding or not. It can resolve my problem so far. Thank you again! : )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Is it possible to get BOTH the last_hidden_states and pooler_output of embedding model like BERT？ #11274

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[Question]: Is it possible to get BOTH the last_hidden_states and pooler_output of embedding model like BERT？ #11274

FdyCN Jan 17, 2025

Replies: 1 comment · 4 replies

ggerganov Jan 17, 2025 Maintainer

FdyCN Jan 20, 2025 Author

FdyCN Jan 20, 2025 Author

ggerganov Jan 20, 2025 Maintainer

FdyCN Jan 20, 2025 Author

FdyCN
Jan 17, 2025

Replies: 1 comment 4 replies

ggerganov
Jan 17, 2025
Maintainer

FdyCN Jan 20, 2025
Author

FdyCN Jan 20, 2025
Author

ggerganov Jan 20, 2025
Maintainer

FdyCN Jan 20, 2025
Author