Sagemaker support for inference #5163

Tarun3679 · 2023-08-25T16:08:34Z

Tarun3679
Aug 25, 2023

Hi, I am trying to test the throughout using vLLMs while inference. I am using amazon sagemaker. My typical notebook example is this one - https://github.com/huggingface/notebooks/blob/5ef609e9078e6248d73f28106e60ddafa9359db1/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb . Are there any resources which I can use as reference to deploy an endpoint using Vllm on sagemaker?

lanking520 · 2023-08-25T21:40:11Z

lanking520
Aug 25, 2023

There are ways to do this. Current DeepJavaLibrary support your use case

docker pull deepjavalibrary/djl-serving:deepspeed-nightly

Using this container with

serving.properties

engine=Python
option.model_id=openlm-research/open_llama_7b_v2
option.trust_remote_code=true
option.tensor_parallel_degree=max
option.rolling_batch=vllm
option.max_rolling_batch_size=32

requirements.txt

vllm==0.1.3
pandas

will work for your case. Tested with G5 and P4D instances.

https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/large-model-inference/BYOC_template_with_LMI_solution.ipynb

0 replies

Tarun3679 · 2023-08-29T18:58:24Z

Tarun3679
Aug 29, 2023
Author

Are we supposed to mention anything in the model.py?

0 replies

mariokostelac · 2023-10-16T19:11:16Z

mariokostelac
Oct 16, 2023

From what I see, https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/vllm_deploy_llama_13b.html contains the tutorial for doing that.

0 replies

ronyadgar · 2023-12-27T09:29:50Z

ronyadgar
Dec 27, 2023

Is there any options to use vLLM on the model.py file ? When I try this I got

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

%%writefile models2/model.py
from djl_python import Input, Output
from vllm import LLM
# Check whether CUDA (thus Nvidia GPU) is avaiable

# Define model and tokenizer function variable
client = None

# Model loader function
def load_model():
  global client
  client = LLM("mistralai/Mistral-7B-Instruct-v0.2", trust_remote_code=True, tensor_parallel_size=8)

# Handler function
def handle(input: Input):
  print('handler called', flush=True)

  # Check if input is empty
  if input.is_empty():
      return None

  input = input.get_as_json()
  print("ron input", input)
  input_prompt = str(input.get('prompt', ''))

  if len(input_prompt) < 1:
     return None

  # Load the model
  if client is None:
     load_model()

  output_words = client.generate(payloads, sampling_params=sampling_params)

  # Send result to output
  output = Output()
  output.add(output_words)

  return output

0 replies

AnirudhJadhav-toddleapp · 2024-03-05T04:28:46Z

AnirudhJadhav-toddleapp
Mar 5, 2024

There are ways to do this. Current DeepJavaLibrary support your use case
docker pull deepjavalibrary/djl-serving:deepspeed-nightly
Using this container with

serving.properties
engine=Python
option.model_id=openlm-research/open_llama_7b_v2
option.trust_remote_code=true
option.tensor_parallel_degree=max
option.rolling_batch=vllm
option.max_rolling_batch_size=32
requirements.txt
vllm==0.1.3
pandas
will work for your case. Tested with G5 and P4D instances.

https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/large-model-inference/BYOC_template_with_LMI_solution.ipynb

hey @lanking520 , https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/large-model-inference/sample-llm/vllm_rollingbatch_deploy_customized_processing.ipynb
i used this library to deploy but it did not work. followed the same steps but i got the following error.
"Error hosting endpoint lmi-model-2024-03-04-10-30-51-503: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint. "
there were other errors about session.wait_for_endpoint and session.create_endpoint.

0 replies

siddvenk · 2024-06-14T15:31:30Z

siddvenk
Jun 14, 2024

@AnirudhJadhav-toddleapp

The sample that Qing has shared is quite old at this point. I recommend that you follow our guide here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/deployment_guide/deploying-your-endpoint.md#option-2-configuration---environment-variables. Replace HF_MODEL_ID with the huggingface model id you are trying deploy.

If you have a custom model, or artifacts stored in s3, we have some details on using sagemaker's support for uncompressed model artifacts here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/deployment_guide/deploying-your-endpoint.md#option-2-configuration---environment-variables.

Hope this helps.

0 replies

geraldstanje · 2024-07-23T01:44:22Z

geraldstanje
Jul 23, 2024

how can this be run without djl-serving?

can you run vllm/vllm-openai:latest container on aws sagemaker? if not what needs to be changed to make it work?

1 reply

JianyuZhan Aug 14, 2024

Yes, check my vllm-on-sagemaker: https://github.com/JianyuZhan/vllm-on-sagemaker; it runs latest vllm on sagemaker, with a thin fastapi server

vidhig · 2024-11-19T16:44:42Z

vidhig
Nov 19, 2024

When a model is deployed as a sagemaker endpoint using DLJ+vLLM, is it deployed as an openai compatible server or is it following offline inference within the endpoint?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sagemaker support for inference #5163

{{title}}

Replies: 8 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Sagemaker support for inference #5163

Replies: 8 comments · 1 reply

Tarun3679 Aug 29, 2023 Author

Replies: 8 comments 1 reply

Tarun3679
Aug 29, 2023
Author