Skip to content

Commit

Permalink
add diffusers nexfort example (#998)
Browse files Browse the repository at this point in the history
- [x] sd1.5
- [x] sdxl
- [x] sd2

---------

Co-authored-by: Li Junliang <[email protected]>
  • Loading branch information
marigoold and lijunliangTG authored Jul 12, 2024
1 parent f498be2 commit 5aeb01f
Show file tree
Hide file tree
Showing 7 changed files with 331 additions and 0 deletions.
9 changes: 9 additions & 0 deletions benchmarks/text_to_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def parse_args():
parser.add_argument("--input-image", type=str, default=INPUT_IMAGE)
parser.add_argument("--control-image", type=str, default=CONTROL_IMAGE)
parser.add_argument("--output-image", type=str, default=OUTPUT_IMAGE)
parser.add_argument("--print-output", action="store_true")
parser.add_argument("--throughput", action="store_true")
parser.add_argument("--deepcache", action="store_true")
parser.add_argument(
Expand Down Expand Up @@ -384,6 +385,14 @@ def get_kwarg_inputs():
print(f"Max used CUDA memory : {cuda_mem_after_used:.3f}GiB")
print("=======================================")

if args.print_output:
from onediff.utils.import_utils import is_nexfort_available
if is_nexfort_available():
from nexfort.utils.term_image import print_image

for image in output_images:
print_image(image, max_width=80)

if args.output_image is not None:
output_images[0].save(args.output_image)
else:
Expand Down
Binary file added imgs/nexfort_sd1-5_demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/nexfort_sd2_demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/nexfort_sdxl_demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
108 changes: 108 additions & 0 deletions onediff_diffusers_extensions/examples/sd1.5/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Run SD1.5 with nexfort backend (Beta Release)

1. [Environment Setup](#environment-setup)
- [Set Up OneDiff](#set-up-onediff)
- [Set Up NexFort Backend](#set-up-nexfort-backend)
- [Set Up Diffusers Library](#set-up-diffusers)
- [Set Up SD1.5](#set-up-sd15)
2. [Execution Instructions](#run)
- [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
- [Run With Compilation](#run-with-compilation)
3. [Performance Comparison](#performance-comparison)
4. [Dynamic Shape for SD1.5](#dynamic-shape-for-sd15)
5. [Quality](#quality)

## Environment setup
### Set up onediff
https://github.com/siliconflow/onediff?tab=readme-ov-file#installation

### Set up nexfort backend
https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/backends/nexfort

### Set up diffusers

```
pip3 install --upgrade diffusers[torch]
```
### Set up SD1.5
Model version for diffusers: https://huggingface.co/runwayml/stable-diffusion-v1-5

HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/overview.md

## Run

### Run without compilation (Baseline)
```shell
python3 benchmarks/text_to_image.py \
--model runwayml/stable-diffusion-v1-5 \
--height 512 --width 512 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-v1-5.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler none \
--seed 1 \
--print-output
```

### Run with compilation

```shell
python3 benchmarks/text_to_image.py \
--model runwayml/stable-diffusion-v1-5 \
--height 512 --width 512 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-v1-5-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:benchmark:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}}' \
--seed 1 \
--print-output
```

## Performance comparison

Testing on NVIDIA GeForce RTX 3090 / 4090, with image size of 512*512, iterating 20 steps:
| Metric | RTX3090, 512*512 | RTX4090, 512*512 |
| ------------------------------------ | --------------------- | --------------------- |
| Data update date (yyyy-mm-dd) | 2024-07-10 | 2024-07-10 |
| PyTorch iteration speed | 21.20 it/s | 34.46 it/s |
| OneDiff iteration speed | 48.00 it/s (+126.4%) | 81.81 it/s (+137.4%) |
| PyTorch E2E time | 1.07 s | 0.67 s |
| OneDiff E2E time | 0.48 s (-55.1%) | 0.28 s (-58.2%) |
| PyTorch Max Mem Used | 2.627 GiB | 2.616 GiB |
| OneDiff Max Mem Used | 2.587 GiB | 2.709 GiB |
| PyTorch Warmup with Run time | | |
| OneDiff Warmup with Compilation time | 233.61 s <sup>1</sup> | 177.321s <sup>2</sup> |
| OneDiff Warmup with Cache time | 41.120 s | 30.019s |

<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz. Note this is just for reference, and it varies a lot on different CPU.

<sup>2</sup> AMD EPYC 7543 32-Core Processor.

## Dynamic shape for SD1.5

<!-- TODO -->

Run:

```shell
python3 benchmarks/text_to_image.py \
--model runwayml/stable-diffusion-v1-5 \
--height 512 --width 512 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-v1-5-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}, "dynamic": true}' \
--run_multiple_resolutions 1
```

## Quality
When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.

<p align="center">
<img src="../../../imgs/nexfort_sd1-5_demo.png">
</p>
105 changes: 105 additions & 0 deletions onediff_diffusers_extensions/examples/sd2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Run SD2 with nexfort backend (Beta Release)

1. [Environment Setup](#environment-setup)
- [Set Up OneDiff](#set-up-onediff)
- [Set Up NexFort Backend](#set-up-nexfort-backend)
- [Set Up Diffusers Library](#set-up-diffusers)
- [Set Up SD2](#set-up-sd2)
2. [Execution Instructions](#run)
- [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
- [Run With Compilation](#run-with-compilation)
3. [Performance Comparison](#performance-comparison)
4. [Dynamic Shape for SD2](#dynamic-shape-for-sd2)
5. [Quality](#quality)

## Environment setup
### Set up onediff
https://github.com/siliconflow/onediff?tab=readme-ov-file#installation

### Set up nexfort backend
https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/backends/nexfort

### Set up diffusers

```
pip3 install --upgrade diffusers[torch]
```
### Set up SD2
Model version for diffusers: https://huggingface.co/stabilityai/stable-diffusion-2

HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md

## Run

### Run without compilation (Baseline)
```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-2-1 \
--height 768 --width 768 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-2-1.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler none \
--print-output
```

### Run with compilation

```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-2-1 \
--height 768 --width 768 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-2-1-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:benchmark:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"triton.fuse_attention_allow_fp16_reduction": false, "inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}}' \
--print-output
```

## Performance comparison

Testing on NVIDIA GeForce RTX 3090 / 4090, with image size of 786\*768 and 512\*512, iterating 20 steps:

| Metric | RTX3090, 768*768 | RTX3090, 512*512 | RTX4090, 768*768 | RTX4090, 512*512 |
| ------------------------------------ | -------------------- | -------------------- | --------------------- | --------------------- |
| Data update date (yyyy-mm-dd) | 2024-07-10 | 2024-07-10 | 2024-07-10 | 2024-07-10 |
| PyTorch iteration speed | 10.45 it/s | 22.84 it/s | 12.34 it/s | 39.06 it/s |
| OneDiff iteration speed | 15.93 it/s (+52.4%) | 44.84 it/s (+96.3%) | 31.63 it/s (+156.3%) | 83.63 it/s (+114.1%) |
| PyTorch E2E time | 2.10 s | 0.97 s | 1.78s | 0.58 s |
| OneDiff E2E time | 1.35 s (-35.7%) | 0.49 s (-49.5%) | 0.68s (-61.8%) | 0.26 s (-55.2%) |
| PyTorch Max Mem Used | 3.767 GiB | 3.025 GiB | 3.767 GiB | 3.024 GiB |
| OneDiff Max Mem Used | 3.558 GiB | 3.018 GiB | 3.567 GiB | 3.016 GiB |
| PyTorch Warmup with Run time | | | | |
| OneDiff Warmup with Compilation time | 301.54 s<sup>1</sup> | 222.18 s<sup>1</sup> | 195.34 s <sup>2</sup> | 165.29 s <sup>1</sup> |
| OneDiff Warmup with Cache time | 113.04 s | 44.94 s | 32.41 s | 30.10 s |

<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz. Note this is just for reference, and it varies a lot on different CPU.

<sup>2</sup> AMD EPYC 7543 32-Core Processor.

## Dynamic shape for SD2

Run:

```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-2-1 \
--height 768 --width 768 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-2-1-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}, "dynamic": true}' \
--run_multiple_resolutions 1
```

## Quality
When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.

<p align="center">
<img src="../../../imgs/nexfort_sd2_demo.png">
</p>
109 changes: 109 additions & 0 deletions onediff_diffusers_extensions/examples/sdxl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Run SDXL with nexfort backend (Beta Release)

1. [Environment Setup](#environment-setup)
- [Set Up OneDiff](#set-up-onediff)
- [Set Up NexFort Backend](#set-up-nexfort-backend)
- [Set Up Diffusers Library](#set-up-diffusers)
- [Set Up SDXL](#set-up-sdxl)
2. [Execution Instructions](#run)
- [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
- [Run With Compilation](#run-with-compilation)
3. [Performance Comparison](#performance-comparison)
4. [Dynamic Shape for SDXL](#dynamic-shape-for-sdxl)
5. [Quality](#quality)

## Environment setup
### Set up onediff
https://github.com/siliconflow/onediff?tab=readme-ov-file#installation

### Set up nexfort backend
https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/backends/nexfort

### Set up diffusers

```
pip3 install --upgrade diffusers[torch]
```
### Set up SDXL
Model version for diffusers: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0

HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md

## Run

### Run without compilation (Baseline)
```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-xl-base-1.0 \
--height 1024 --width 1024 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-xl.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler none \
--variant fp16 \
--seed 1 \
--print-output
```

### Run with compilation

```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-xl-base-1.0 \
--height 1024 --width 1024 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-xl-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "benchmark:cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}}' \
--variant fp16 \
--seed 1 \
--print-output
```

## Performance comparison

Testing on NVIDIA GeForce RTX 3090 / 4090, with image size of 1024*1024, iterating 20 steps:
| Metric | RTX 3090 1024*1024 | RTX 4090 1024*1024 |
| ------------------------------------ | --------------------- | --------------------- |
| Data update date (yyyy-mm-dd) | 2024-07-10 | 2024-07-10 |
| PyTorch iteration speed | 4.08 it/s | 6.93 it/s |
| OneDiff iteration speed | 7.21 it/s (+76.7%) | 13.92 it/s (+100.9%) |
| PyTorch E2E time | 5.60 s | 3.23 s |
| OneDiff E2E time | 3.41 s (-39.1%) | 1.67 s (-48.3%) |
| PyTorch Max Mem Used | 10.467 GiB | 10.467 GiB |
| OneDiff Max Mem Used | 12.004 GiB | 12.021 GiB |
| PyTorch Warmup with Run time | | |
| OneDiff Warmup with Compilation time | 474.36 s <sup>1</sup> | 236.54 s <sup>2</sup> |
| OneDiff Warmup with Cache time | 306.84 s | 104.57 s |

<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz. Note this is just for reference, and it varies a lot on different CPU.

<sup>2</sup> AMD EPYC 7543 32-Core Processor.


## Dynamic shape for SDXL

Run:

```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-xl-base-1.0 \
--height 1024 --width 1024 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-xl-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}, "dynamic": true}' \
--run_multiple_resolutions 1
```

## Quality
When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.

<p align="center">
<img src="../../../imgs/nexfort_sdxl_demo.png">
</p>

0 comments on commit 5aeb01f

Please sign in to comment.