-
Notifications
You must be signed in to change notification settings - Fork 39
integrate aiter #516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
integrate aiter #516
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please direct the PR to the upstream vllm (https://github.com/vllm-project/vllm.git) instead of rocm/vllm ?
In your upstream PR, please use
In the upstream, performance gain values and lm_eval has to been attached along side in the PR description. |
@fsx950223 FYI, upstream refers to https://github.com/vllm-project/vllm , and may I know which AITER version are you using? |
8d167e698fb5ecf54d1315e2cae0da6c6a2746b5 |
Is it from a branch? I couldn't find the commit on the main branch of AITER. |
Please direct your PRs to the upstream vllm (https://github.com/vllm-project/vllm.git)
Accepting PRs into the ROCm fork (https://github.com/ROCm/vllm) will require a clear previously communicated exception
Only works with prebuilt aiter, PREBUILD_KERNELS=1 python setup.py develop.
VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve /models/models--amd--Llama-3.1-405B-Instruct-FP8-KV/snapshots/2505537398e7cfda52f6d666f315c03db8e4697c/ --tensor-parallel-size 8 --gpu-memory-utilization 0.9 --trust-remote-code --disable-log-requests --block-size 128 --max-model-len 32768 --dtype float16 --quantization fp8 --no-enable-prefix-caching