Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I see you guide to run on local, can you give me an example running on API? #1

Open
Dang-Tu-lang opened this issue Sep 10, 2024 · 1 comment

Comments

@Dang-Tu-lang
Copy link

Hello, The algorithm you provided is quite novel, but I want to run it through the API, can you give me an example, I tried to modify the code but most of the times running on the API gives me Connect error

@l-i-p-f
Copy link

l-i-p-f commented Nov 1, 2024

Hi, I suggest you take a look at the generate function in the IO_System class to see how to use gpt-3.5-turbo. The author provides this example as a guide for using the OPENAI API.

Alternatively, you can refer to my solution below. I have deployed my own vLLM service on the server. I tried it, and it works well. This setup allows me to experiment with different models, and it's faster.

Notice:

  1. Be sure to add required arguments in arguments.py.

2. Verification required: I switched from running the Qwen2.5-7B-Instruct model locally to invoking the Qwen2.5-72B-Instruct API, which showed a 68% speed improvement in my test. However, it resulted in half the number of total calls and seems to process significantly fewer tokens. I've only run this code briefly, so I may check later to confirm whether these findings are due to the different models or my modifications.

↑ The second point is that I previously missed the num_return parameter. After adding it back to the API request, the final time cost and token cosumption returned to normal.

Fixed version:

class IO_System:
    """Input/Output system"""

    def __init__(self, args, tokenizer, model) -> None:
        # ... former code here
        # added
        self.api_url = args.api_url
        self.model_name = args.api_model_name

    def generate(self, model_input, max_tokens: int, num_return: int, stop_tokens):
        import requests
        io_output_list = []

        if self.api == "vllm_api":
            if isinstance(model_input, str):
                model_input = [model_input]

            if isinstance(model_input, list):
                for _input in model_input:
                    params = {
                        "model": self.model_name,
                        "stream": False,
                        "temperature": self.temperature,
                        "top_k": self.top_k,
                        "top_p": self.top_p,
                        "max_tokens": max_tokens,
                        "stop": stop_tokens,
                        "n": num_return,
                        "messages": [{"role": "user", "content": _input}],
                    }
                    try:
                        vllm_response = requests.post(self.api_url, json=params).json()
                        output = vllm_response["choices"][0]["message"]["content"]
                        token_count = vllm_response["usage"]["completion_tokens"]
                    except Exception as e:
                        raise RuntimeError(f"API Requests Error: {e}")

                    io_output_list.append(output)
                    self.call_counter += 1
                    self.token_counter += token_count

            return io_output_list
        # ... former code here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants