How to decrease time to generate first token? #297
Unanswered
VenkatLohithDasari
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have copied code from example_ws.py to enable text streaming. It's good and all but there is one big problem, It's takes a lot of time to generate the first token, The rest of the tokens are generated pretty fast around 15t/s. Is there any way to fix this problem?
My Chatbot works like this It takes the user message, detects the intent of the message then creates an appropriate prompt for that intent using f-strings...So it means the prompt always changes depending on context. Just saying this if this info is useful for why first token generation is slow!
I want to know what does generation of the first token depends upon. If there is some sort of conversion of the prompt into math equations. Maybe we can cache it by storing it into a variable? and let only the newly appended message to prompt be converted? Is that possible? Sorry If I sound dumb, I'm not AI programmer...
Beta Was this translation helpful? Give feedback.
All reactions