You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used 8*Nvidia H20 to do inference testing based on the sample code, and found that the performance was only about 1 tokens/s, and the GPU could not be fully used. Is this in line with expectations?
Currently, VLLM and SGLang are not adapted to MiniMax. Are there any other inference engines you would recommend?
The text was updated successfully, but these errors were encountered:
Thank you for your feedback. We are currently planning to support our models on open-source inference frameworks. At the same time, we also welcome community developers to join us in advancing the support for our models in open-source inference engines.
The same issue. On the H100 * 8 server, the fixed input token count is 2048 and the output token count is 240, When batch_size is 1, the time consumption reaches 620 seconds; When batch_size is 2, the time consumption reaches 765 seconds. It's too slow! Which open-source inference frameworks will be supported?
We have submitted a 13454 to vllm, and the performance improvement compared to the Hugging Face's implementation is very significant. You might want to give it a try.
Very cool model!
I used 8*Nvidia H20 to do inference testing based on the sample code, and found that the performance was only about 1 tokens/s, and the GPU could not be fully used. Is this in line with expectations?
Currently, VLLM and SGLang are not adapted to MiniMax. Are there any other inference engines you would recommend?
The text was updated successfully, but these errors were encountered: