This repo consist of worker code that you can deploy to a docker container and use it on Runpod Serverless. It uses vLLM under the hood to run inference on a given model. It supports wide range of LLMs including Llama2, Mistral, Falcon, StarCoder, BLOOM, and many more ! (Check out all supported models here)
- 🌟 How to use
- 🏗️ build docker image (Optional)
- 🚀 deploy to Runpod Serverless
- 📦 Request Body
- 🔗 Environment Variables
- 🚀 GPU Type Guide
- 📝 License
- 📚 References
- 🙏 Thanks
- Clone this repository
- build docker image
- push docker image to your docker registry
- deploy to Runpod Serverless
docker build -t <your docker registry>/<your docker image name>:<your docker image tag> .
Push the image to docker with the following command:
docker push <your docker registry>/<your docker image name>:<your docker image tag>
Or you can use
lobstrate/runpod-worker-vllm
image from docker hub
After having docker image on your docker registry, you can deploy to Runpod Serverless. Here is the step by step guide on how you can deploy this on runpod. In this guide, we will set up network volume so that we can download our model from huggingface hub into the network volume. At the end, when our endpoint gets its first request, it will download the model from huggingface hub into the network volume. After that, it will use the model from the network volume for inference. on subsequent requests. (Even if the worker gets scaled down to 0, the model will be persisted in the network volume)
You need to create a network volume for the worker to download your LLM model into from huggingface hub. You can create a network volume from the Runpod UI.
- Click on Storage from runpod sidebar under serverless tab.
- Click on
+ Network Volume
button. - Select a datacenter region closest to your users.
- Give a name to your network volume.
- Select a size for your network volume.
- Click on
Create
button.
Note: To get a rough estimate on how much storage you need, you can check out your model size on https://huggingface.co. Click on files and Versions tab and check how much storage you need to store all the files.
After creating a network volume, you need to create a template for your worker to use. For this,
- Click on Custom Templates from runpod sidebar under serverless tab.
- Click on
New Template
button. - Give a name to your template.
- Enter your docker image name in the
Container Image
field. This is the same image you pushed to your docker registry in the previous step. (Or you can enterlobstrate/runpod-worker-vllm:latest
image from docker hub) - Select Container disk size. (This doesn't matter much as we are using network volume for model storage)
- [IMPORTANT] Enter environment variables for your model.
MODEL_NAME
is required. Which is used to download your model from huggingface hub. (refer Environment Variables section for more details)
After creating a template, you need to create an endpoint for your worker to use. For this,
- Click on Endpoints from runpod sidebar under serverless tab.
- Click on
New Endpoint
button. - Give a name to your endpoint.
- Select template you created in the previous step.
- Select GPU type. You can follow this guide to select the right GPU type for your model. GPU Type Guide
- Enter active and max worker counts
- Check the fast boot option
- Select network volume you created in the previous step.
- Click on
Create
button.
After creating an endpoint, you can test out your endpoint inside runpod UI. For this,
- Click on Requests tab from your endpoint page.
- Click on
Run
button.
You can also modify your request body. Check out Request Body section for more details.
This is the request body you can send to your endpoint:
{
"input": {
"prompt": "Say, Hello World!",
"max_tokens": 50,
// other params...
}
}
All the params you can send to your endpoint are listed here:
prompt
: The prompt you want to generate from.max_tokens
: Maximum number of tokens to generate per output sequence.n
: Number of output sequences to return for the given prompt.best_of
: Number of output sequences that are generated from the prompt. From thesebest_of
sequences, the topn
sequences are returned.best_of
must be greater than or equal ton
. This is treated as the beam width whenuse_beam_search
is True. By default,best_of
is set ton
.presence_penalty
: Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.frequency_penalty
: Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.repetition_penalty
: Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.temperature
: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.top_p
: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.top_k
: Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.use_beam_search
: Whether to use beam search instead of sampling.length_penalty
: Float that penalizes sequences based on their length. Used in beam search.early_stopping
: Controls the stopping condition for beam search. It accepts the following values:True
, where the generation stops as soon as there arebest_of
complete candidates;False
, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates;"never"
, where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).stop
: List of strings that stop the generation when they are generated. The returned output will not contain the stop strings.stop_token_ids
: List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are sepcial tokens.ignore_eos
: Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.logprobs
: Number of log probabilities to return per output token. Note that the implementation follows the OpenAI API: The return result includes the log probabilities on thelogprobs
most likely tokens, as well the chosen tokens. The API will always return the log probability of the sampled token, so there may be up tologprobs+1
elements in the response.prompt_logprobs
: Number of log probabilities to return per prompt token.skip_special_tokens
: Whether to skip special tokens in the output.spaces_between_special_tokens
: Whether to add spaces between special tokens in the output. Defaults to True.logits_processors
: List of functions that modify logits based on previously generated tokens.
These are the environment variables you can define on your runpod template:
key | value | optional |
---|---|---|
MODEL_NAME | your model name | false |
HF_HOME | /runpod-volume | true |
HUGGING_FACE_HUB_TOKEN | your huggingface token | true |
MODEL_REVISION | your model revision | true |
MODEL_BASE_PATH | your model base path | true |
TOKENIZER | your tokenizer | true |
Note: You can get your huggingface token from https://huggingface.co/settings/token
Here is a rough estimate on how much VRAM you need for your model. You can use this table to select the right GPU type for your model.
Model Parameters | Storage & VRAM |
---|---|
7B | 6GB |
13B | 9GB |
33B | 19GB |
65B | 35GB |
70B | 38GB |
This project is licensed under the MIT License - see the LICENSE file for details
Special thanks to @Jorghi12 and @ashleykleynhans for helping out with this project.