-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launch error at 4090 for sageattn_qk_int8_pv_fp8_cuda #61
Comments
L40s meet the same error. |
Another Question, for the arch "sm90", e.g.H100, why do you assign it at the kernel of sageattn_qk_int8_pv_fp16_cuda? It also has powerful fp8 2D capability. |
Sorry, I can not reproduce this error. |
Thank you for your kind reply. CUDA Version is 12.6, and my code is, pls give me a hand, import os os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" import time import torch F.scaled_dot_product_attention = sageattn_qk_int8_pv_fp8_cuda prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance." pipe = CogVideoXPipeline.from_pretrained( pipe.vae.enable_slicing() start = time.time() export_to_video(video, "output/output.mp4", fps=8) -----------------------------------------------------------------------------------------+ |
I suppose it is the problem mentioned in #50 . Please run the code on device 0 again. We will merge this PR as soon as possible. |
@Andy0422 |
@Andy0422 on H100 the mma instruction for fp8 has poor performance so we use fp16 which is more accurate and has roughly the same speed. We are working on H100 kernel which uses wgmma that can offer real speed up. |
see.. on H100 it can run now, but very slow |
@jason-huang03 Fri Dec 6 11:58:12 2024 |
We will try to see how to make our compilation script compatible with multi-type gpu machine. |
Hi,
I think 4090 can support fp8 2D, why has the following error? Thanks.
Exception has occurred: RuntimeError
CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.File "/home/wei.zhao/SageAttention/example/sageattn_cogvideo.py", line 27, in
video = pipe(
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.The text was updated successfully, but these errors were encountered: