Launch error at 4090 for sageattn_qk_int8_pv_fp8_cuda #61

Andy0422 · 2024-12-05T08:08:57Z

Hi,

I think 4090 can support fp8 2D, why has the following error? Thanks.

Exception has occurred: RuntimeError
CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
File "/home/wei.zhao/SageAttention/example/sageattn_cogvideo.py", line 27, in
video = pipe(
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

Andy0422 · 2024-12-05T08:11:04Z

L40s meet the same error.

Andy0422 · 2024-12-05T08:23:31Z

Another Question, for the arch "sm90", e.g.H100, why do you assign it at the kernel of sageattn_qk_int8_pv_fp16_cuda? It also has powerful fp8 2D capability.

jt-zhang · 2024-12-06T09:55:22Z

Sorry, I can not reproduce this error.
By the way, if you want to use FP8, please ensure that the CUDA VERSION is >= 12.4.

Andy0422 · 2024-12-06T11:06:45Z

Sorry, I can not reproduce this error. By the way, if you want to use FP8, please ensure that the CUDA VERSION is >= 12.4.

Thank you for your kind reply. CUDA Version is 12.6,

and my code is, pls give me a hand,

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "3"

import time

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video, export_to_gif
from sageattention import sageattn, sageattn_qk_int8_pv_fp16_triton, sageattn_qk_int8_pv_fp16_cuda, sageattn_qk_int8_pv_fp8_cuda
import torch.nn.functional as F

F.scaled_dot_product_attention = sageattn_qk_int8_pv_fp8_cuda

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
"/home/dataset/SD/cogvideox-2b",
torch_dtype=torch.float16
).to("cuda")

pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

start = time.time()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
print("sage attn timing = ",time.time()-start)

export_to_video(video, "output/output.mp4", fps=8)
export_to_gif(video, "output/output.gif", fps=8)

jason-huang03 · 2024-12-06T11:51:51Z

I suppose it is the problem mentioned in #50 . Please run the code on device 0 again. We will merge this PR as soon as possible.

jason-huang03 · 2024-12-06T11:53:08Z

@Andy0422
Also your environment contains 2 different GPU type. Our compilation script might not support this configuration at the present. Perhaps the fp8 kernel is not compiled because of A100, and when the code runs on H100 there will be error.

jason-huang03 · 2024-12-06T11:56:19Z

@Andy0422 on H100 the mma instruction for fp8 has poor performance so we use fp16 which is more accurate and has roughly the same speed. We are working on H100 kernel which uses wgmma that can offer real speed up.

Andy0422 · 2024-12-06T11:56:45Z

I suppose it is the problem mentioned in #50 . Please run the code on device 0 again. We will merge this PR as soon as possible.

see.. on H100 it can run now, but very slow

Andy0422 · 2024-12-06T11:59:21Z

@jason-huang03
Another problem at run 4090,
Traceback (most recent call last):
File "/home/wei.zhao/SageAttention/example/sageattn_cogvideo.py", line 27, in
video = pipe(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 684, in call
noise_pred = self.transformer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 473, in forward
hidden_states, encoder_hidden_states = block(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 132, in forward
attn_hidden_states, attn_encoder_hidden_states = self.attn1(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 495, in forward
return self.processor(
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1954, in call
hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

jason-huang03 · 2024-12-06T12:00:30Z

We will try to see how to make our compilation script compatible with multi-type gpu machine.

jason-huang03 added the bug Something isn't working label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Launch error at 4090 for sageattn_qk_int8_pv_fp8_cuda #61

Launch error at 4090 for sageattn_qk_int8_pv_fp8_cuda #61

Andy0422 commented Dec 5, 2024

Andy0422 commented Dec 5, 2024

Andy0422 commented Dec 5, 2024

jt-zhang commented Dec 6, 2024

Andy0422 commented Dec 6, 2024

jason-huang03 commented Dec 6, 2024

jason-huang03 commented Dec 6, 2024 •

edited

Loading

jason-huang03 commented Dec 6, 2024

Andy0422 commented Dec 6, 2024

Andy0422 commented Dec 6, 2024

jason-huang03 commented Dec 6, 2024

Launch error at 4090 for sageattn_qk_int8_pv_fp8_cuda #61

Launch error at 4090 for sageattn_qk_int8_pv_fp8_cuda #61

Comments

Andy0422 commented Dec 5, 2024

Andy0422 commented Dec 5, 2024

Andy0422 commented Dec 5, 2024

jt-zhang commented Dec 6, 2024

Andy0422 commented Dec 6, 2024

jason-huang03 commented Dec 6, 2024

jason-huang03 commented Dec 6, 2024 • edited Loading

jason-huang03 commented Dec 6, 2024

Andy0422 commented Dec 6, 2024

Andy0422 commented Dec 6, 2024

jason-huang03 commented Dec 6, 2024

jason-huang03 commented Dec 6, 2024 •

edited

Loading