add diffusers nexfort example (#998)

- [x] sd1.5 - [x] sdxl - [x] sd2 --------- Co-authored-by: Li Junliang <[email protected]>
siliconflow · Jul 12, 2024 · 5aeb01f · 5aeb01f
1 parent f498be2
commit 5aeb01f
Show file tree

Hide file tree

Showing 7 changed files with 331 additions and 0 deletions.
diff --git a/benchmarks/text_to_image.py b/benchmarks/text_to_image.py
@@ -62,6 +62,7 @@ def parse_args():
     parser.add_argument("--input-image", type=str, default=INPUT_IMAGE)
     parser.add_argument("--control-image", type=str, default=CONTROL_IMAGE)
     parser.add_argument("--output-image", type=str, default=OUTPUT_IMAGE)
+    parser.add_argument("--print-output", action="store_true")
     parser.add_argument("--throughput", action="store_true")
     parser.add_argument("--deepcache", action="store_true")
     parser.add_argument(
@@ -384,6 +385,14 @@ def get_kwarg_inputs():
     print(f"Max used CUDA memory : {cuda_mem_after_used:.3f}GiB")
     print("=======================================")
 
+    if args.print_output:
+        from onediff.utils.import_utils import is_nexfort_available
+        if is_nexfort_available():
+            from nexfort.utils.term_image import print_image
+
+            for image in output_images:
+                print_image(image, max_width=80)
+
     if args.output_image is not None:
         output_images[0].save(args.output_image)
     else:

diff --git a/imgs/nexfort_sd1-5_demo.png b/imgs/nexfort_sd1-5_demo.png
diff --git a/imgs/nexfort_sd2_demo.png b/imgs/nexfort_sd2_demo.png
diff --git a/imgs/nexfort_sdxl_demo.png b/imgs/nexfort_sdxl_demo.png
diff --git a/onediff_diffusers_extensions/examples/sd1.5/README.md b/onediff_diffusers_extensions/examples/sd1.5/README.md
@@ -0,0 +1,108 @@
+# Run SD1.5 with nexfort backend (Beta Release)
+
+1. [Environment Setup](#environment-setup)
+   - [Set Up OneDiff](#set-up-onediff)
+   - [Set Up NexFort Backend](#set-up-nexfort-backend)
+   - [Set Up Diffusers Library](#set-up-diffusers)
+   - [Set Up SD1.5](#set-up-sd15)
+2. [Execution Instructions](#run)
+   - [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
+   - [Run With Compilation](#run-with-compilation)
+3. [Performance Comparison](#performance-comparison)
+4. [Dynamic Shape for SD1.5](#dynamic-shape-for-sd15)
+5. [Quality](#quality)
+
+## Environment setup
+### Set up onediff
+https://github.com/siliconflow/onediff?tab=readme-ov-file#installation
+
+### Set up nexfort backend
+https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/backends/nexfort
+
+### Set up diffusers
+
+```
+pip3 install --upgrade diffusers[torch]
+```
+### Set up SD1.5
+Model version for diffusers: https://huggingface.co/runwayml/stable-diffusion-v1-5
+
+HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/overview.md
+
+## Run
+
+### Run without compilation (Baseline)
+```shell
+python3 benchmarks/text_to_image.py \
+  --model runwayml/stable-diffusion-v1-5 \
+  --height 512 --width 512 \
+  --scheduler none \
+  --steps 20 \
+  --output-image ./stable-diffusion-v1-5.png \
+  --prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
+  --compiler none \
+  --seed 1 \
+  --print-output
+```
+
+### Run with compilation
+
+```shell
+python3 benchmarks/text_to_image.py \
+  --model runwayml/stable-diffusion-v1-5 \
+  --height 512 --width 512 \
+  --scheduler none \
+  --steps 20 \
+  --output-image ./stable-diffusion-v1-5-compile.png \
+  --prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
+  --compiler nexfort \
+  --compiler-config '{"mode": "cudagraphs:benchmark:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}}' \
+  --seed 1 \
+  --print-output
+```
+
+## Performance comparison
+
+Testing on NVIDIA GeForce RTX 3090 / 4090, with image size of 512*512, iterating 20 steps:
+| Metric                               | RTX3090, 512*512      | RTX4090, 512*512      |
+| ------------------------------------ | --------------------- | --------------------- |
+| Data update date (yyyy-mm-dd)        | 2024-07-10            | 2024-07-10            |
+| PyTorch iteration speed              | 21.20 it/s            | 34.46 it/s            |
+| OneDiff iteration speed              | 48.00 it/s (+126.4%)  | 81.81 it/s (+137.4%)  |
+| PyTorch E2E time                     | 1.07 s                | 0.67 s                |
+| OneDiff E2E time                     | 0.48 s (-55.1%)       | 0.28 s (-58.2%)       |
+| PyTorch Max Mem Used                 | 2.627 GiB             | 2.616 GiB             |
+| OneDiff Max Mem Used                 | 2.587 GiB             | 2.709 GiB             |
+| PyTorch Warmup with Run time         |                       |                       |
+| OneDiff Warmup with Compilation time | 233.61 s <sup>1</sup> | 177.321s <sup>2</sup> |
+| OneDiff Warmup with Cache time       | 41.120 s              | 30.019s               |
+
+<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz. Note this is just for reference, and it varies a lot on different CPU.
+
+<sup>2</sup> AMD EPYC 7543 32-Core Processor.
+
+## Dynamic shape for SD1.5
+
+ <!-- TODO -->
+
+Run:
+
+```shell
+python3 benchmarks/text_to_image.py \
+  --model runwayml/stable-diffusion-v1-5 \
+  --height 512 --width 512 \
+  --scheduler none \
+  --steps 20 \
+  --output-image ./stable-diffusion-v1-5-compile.png \
+  --prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
+  --compiler nexfort \
+  --compiler-config '{"mode": "cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}, "dynamic": true}' \
+  --run_multiple_resolutions 1
+```
+
+## Quality
+When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.
+
+<p align="center">
+<img src="../../../imgs/nexfort_sd1-5_demo.png">
+</p>
diff --git a/onediff_diffusers_extensions/examples/sd2/README.md b/onediff_diffusers_extensions/examples/sd2/README.md
@@ -0,0 +1,105 @@
+# Run SD2 with nexfort backend (Beta Release)
+
+1. [Environment Setup](#environment-setup)
+   - [Set Up OneDiff](#set-up-onediff)
+   - [Set Up NexFort Backend](#set-up-nexfort-backend)
+   - [Set Up Diffusers Library](#set-up-diffusers)
+   - [Set Up SD2](#set-up-sd2)
+2. [Execution Instructions](#run)
+   - [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
+   - [Run With Compilation](#run-with-compilation)
+3. [Performance Comparison](#performance-comparison)
+4. [Dynamic Shape for SD2](#dynamic-shape-for-sd2)
+5. [Quality](#quality)
+
+## Environment setup
+### Set up onediff
+https://github.com/siliconflow/onediff?tab=readme-ov-file#installation
+
+### Set up nexfort backend
+https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/backends/nexfort
+
+### Set up diffusers
+
+```
+pip3 install --upgrade diffusers[torch]
+```
+### Set up SD2
+Model version for diffusers: https://huggingface.co/stabilityai/stable-diffusion-2
+
+HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md
+
+## Run
+
+### Run without compilation (Baseline)
+```shell
+python3 benchmarks/text_to_image.py \
+  --model stabilityai/stable-diffusion-2-1 \
+  --height 768 --width 768 \
+  --scheduler none \
+  --steps 20 \
+  --output-image ./stable-diffusion-2-1.png \
+  --prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
+  --compiler none \
+  --print-output
+```
+
+### Run with compilation
+
+```shell
+python3 benchmarks/text_to_image.py \
+  --model stabilityai/stable-diffusion-2-1 \
+  --height 768 --width 768 \
+  --scheduler none \
+  --steps 20 \
+  --output-image ./stable-diffusion-2-1-compile.png \
+  --prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
+  --compiler nexfort \
+  --compiler-config '{"mode": "cudagraphs:benchmark:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"triton.fuse_attention_allow_fp16_reduction": false, "inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}}' \
+  --print-output
+```
+
+## Performance comparison
+
+Testing on NVIDIA GeForce RTX 3090 / 4090, with image size of 786\*768 and 512\*512, iterating 20 steps:
+
+| Metric                               | RTX3090, 768*768     | RTX3090, 512*512     | RTX4090, 768*768      | RTX4090, 512*512      |
+| ------------------------------------ | -------------------- | -------------------- | --------------------- | --------------------- |
+| Data update date (yyyy-mm-dd)        | 2024-07-10           | 2024-07-10           | 2024-07-10            | 2024-07-10            |
+| PyTorch iteration speed              | 10.45 it/s           | 22.84 it/s           | 12.34 it/s            | 39.06 it/s            |
+| OneDiff iteration speed              | 15.93 it/s (+52.4%)  | 44.84 it/s (+96.3%)  | 31.63 it/s (+156.3%)  | 83.63 it/s (+114.1%)  |
+| PyTorch E2E time                     | 2.10 s               | 0.97 s               | 1.78s                 | 0.58 s                |
+| OneDiff E2E time                     | 1.35 s (-35.7%)      | 0.49 s (-49.5%)      | 0.68s (-61.8%)        | 0.26 s (-55.2%)       |
+| PyTorch Max Mem Used                 | 3.767 GiB            | 3.025 GiB            | 3.767 GiB             | 3.024 GiB             |
+| OneDiff Max Mem Used                 | 3.558 GiB            | 3.018 GiB            | 3.567 GiB             | 3.016 GiB             |
+| PyTorch Warmup with Run time         |                      |                      |                       |                       |
+| OneDiff Warmup with Compilation time | 301.54 s<sup>1</sup> | 222.18 s<sup>1</sup> | 195.34 s <sup>2</sup> | 165.29 s <sup>1</sup> |
+| OneDiff Warmup with Cache time       | 113.04 s             | 44.94 s              | 32.41 s               | 30.10 s               |
+
+<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz. Note this is just for reference, and it varies a lot on different CPU.
+
+<sup>2</sup> AMD EPYC 7543 32-Core Processor.
+
+## Dynamic shape for SD2
+
+Run:
+
+```shell
+python3 benchmarks/text_to_image.py \
+  --model stabilityai/stable-diffusion-2-1 \
+  --height 768 --width 768 \
+  --scheduler none \
+  --steps 20 \
+  --output-image ./stable-diffusion-2-1-compile.png \
+  --prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
+  --compiler nexfort \
+  --compiler-config '{"mode": "cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}, "dynamic": true}' \
+  --run_multiple_resolutions 1
+```
+
+## Quality
+When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.
+
+<p align="center">
+<img src="../../../imgs/nexfort_sd2_demo.png">
+</p>
diff --git a/onediff_diffusers_extensions/examples/sdxl/README.md b/onediff_diffusers_extensions/examples/sdxl/README.md
@@ -0,0 +1,109 @@
+# Run SDXL with nexfort backend (Beta Release)
+
+1. [Environment Setup](#environment-setup)
+   - [Set Up OneDiff](#set-up-onediff)
+   - [Set Up NexFort Backend](#set-up-nexfort-backend)
+   - [Set Up Diffusers Library](#set-up-diffusers)
+   - [Set Up SDXL](#set-up-sdxl)
+2. [Execution Instructions](#run)
+   - [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
+   - [Run With Compilation](#run-with-compilation)
+3. [Performance Comparison](#performance-comparison)
+4. [Dynamic Shape for SDXL](#dynamic-shape-for-sdxl)
+5. [Quality](#quality)
+
+## Environment setup
+### Set up onediff
+https://github.com/siliconflow/onediff?tab=readme-ov-file#installation
+
+### Set up nexfort backend
+https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/backends/nexfort
+
+### Set up diffusers
+
+```
+pip3 install --upgrade diffusers[torch]
+```
+### Set up SDXL
+Model version for diffusers: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
+
+HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md
+
+## Run
+
+### Run without compilation (Baseline)
+```shell
+python3 benchmarks/text_to_image.py \
+  --model stabilityai/stable-diffusion-xl-base-1.0 \
+  --height 1024 --width 1024 \
+  --scheduler none \
+  --steps 20 \
+  --output-image ./stable-diffusion-xl.png \
+  --prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
+  --compiler none \
+  --variant fp16 \
+  --seed 1 \
+  --print-output
+```
+
+### Run with compilation
+
+```shell
+python3 benchmarks/text_to_image.py \
+  --model stabilityai/stable-diffusion-xl-base-1.0 \
+  --height 1024 --width 1024 \
+  --scheduler none \
+  --steps 20 \
+  --output-image ./stable-diffusion-xl-compile.png \
+  --prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
+  --compiler nexfort \
+  --compiler-config '{"mode": "benchmark:cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}}' \
+  --variant fp16 \
+  --seed 1 \
+  --print-output
+```
+
+## Performance comparison
+
+Testing on NVIDIA GeForce RTX 3090 / 4090, with image size of 1024*1024, iterating 20 steps:
+| Metric                               | RTX 3090  1024*1024   | RTX 4090 1024*1024    |
+| ------------------------------------ | --------------------- | --------------------- |
+| Data update date (yyyy-mm-dd)        | 2024-07-10            | 2024-07-10            |
+| PyTorch iteration speed              | 4.08 it/s             | 6.93 it/s             |
+| OneDiff iteration speed              | 7.21 it/s (+76.7%)    | 13.92 it/s (+100.9%)  |
+| PyTorch E2E time                     | 5.60 s                | 3.23 s                |
+| OneDiff E2E time                     | 3.41 s (-39.1%)       | 1.67 s (-48.3%)       |
+| PyTorch Max Mem Used                 | 10.467 GiB            | 10.467 GiB            |
+| OneDiff Max Mem Used                 | 12.004 GiB            | 12.021 GiB            |
+| PyTorch Warmup with Run time         |                       |                       |
+| OneDiff Warmup with Compilation time | 474.36 s <sup>1</sup> | 236.54 s <sup>2</sup> |
+| OneDiff Warmup with Cache time       | 306.84 s              | 104.57 s              |
+
+<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz. Note this is just for reference, and it varies a lot on different CPU.
+
+<sup>2</sup> AMD EPYC 7543 32-Core Processor.
+
+
+## Dynamic shape for SDXL
+
+Run:
+
+```shell
+python3 benchmarks/text_to_image.py \
+  --model stabilityai/stable-diffusion-xl-base-1.0 \
+  --height 1024 --width 1024 \
+  --scheduler none \
+  --steps 20 \
+  --output-image ./stable-diffusion-xl-compile.png \
+  --prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
+  --compiler nexfort \
+  --compiler-config '{"mode": "cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}, "dynamic": true}' \
+  --run_multiple_resolutions 1
+```
+
+## Quality
+When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.
+
+<p align="center">
+<img src="../../../imgs/nexfort_sdxl_demo.png">
+</p>