Groupoffloading introduce bad results

### Describe the bug

I am using groupoffloading for saving gpu memory. I got worse results with a cosine similarity aboud 0.934 on A800, which is unexpected. And I got results with a cosine similarity about 0.78 on 4090, which is worse. 
Could anyone give me any suggestions to fix the precision?

### Reproduction

```
apply_group_offloading(
    transformer,
    onload_device=torch.device(f"cuda:{self.local_rank}"),
    offload_device=torch.device("cpu"),
    offload_type="block_level",
    num_blocks_per_group=1,
    non_blocking=True,
    use_stream=True,
)

### Logs

```shell

```

### System Info

I tried diffusers 0.33.1 and 0.34.

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Groupoffloading introduce bad results #11981

Describe the bug

Reproduction

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Groupoffloading introduce bad results #11981

Description

Describe the bug

Reproduction

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions