diff --git a/README.md b/README.md
index 4069bad52..5e3011497 100644
--- a/README.md
+++ b/README.md
@@ -77,7 +77,7 @@ The Full Introduction of OneDiff:
 - [Installation](#installation)
   - [OneDiff Installation](#onediff-installation)
     - [Install a compiler backend](#install-a-compiler-backend)
-      - [(Optional) Install NexFort](#optional-install-nexfort)
+      - [(Optional) Install Nexfort](#optional-install-nexfort)
       - [(Optional) Install OneFlow](#optional-install-oneflow)
     - [2. Install torch and diffusers](#2-install-torch-and-diffusers)
     - [3. Install OneDiff](#3-install-onediff)
@@ -188,10 +188,11 @@ When considering the choice between OneFlow and Nexfort, either one is optional,
 
 - For all other cases, it is recommended to use OneFlow. Note that optimizations within OneFlow will gradually transition to Nexfort in the future.
 
-##### (Optional) Install NexFort
+##### (Optional) Install Nexfort
+The detailed introduction of Nexfort is [here](https://github.com/siliconflow/onediff/tree/main/onediff/src/onediff/infer_compiler/backends/nexfort/README.md).
 
 ```bash
-python3 -m  pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+python3 -m  pip install -U torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 torchao==0.1
 python3 -m  pip install -U nexfort
 ```
 
diff --git a/benchmarks/text_to_image.py b/benchmarks/text_to_image.py
index cb006fc85..a0f565770 100644
--- a/benchmarks/text_to_image.py
+++ b/benchmarks/text_to_image.py
@@ -267,7 +267,7 @@ def main():
             options = json.loads(args.compiler_config)
         else:
             # config with string
-            options = '{"mode": "max-optimize:max-autotune:freezing:benchmark:low-precision:cudagraphs", "memory_format": "channels_last"}'
+            options = '{"mode": "max-optimize:max-autotune:freezing", "memory_format": "channels_last"}'
         pipe = compile_pipe(
             pipe, backend="nexfort", options=options, fuse_qkv_projections=True
         )
diff --git a/onediff_diffusers_extensions/examples/pixart_alpha/README.md b/onediff_diffusers_extensions/examples/pixart_alpha/README.md
index 38ae417d2..eb0f55669 100644
--- a/onediff_diffusers_extensions/examples/pixart_alpha/README.md
+++ b/onediff_diffusers_extensions/examples/pixart_alpha/README.md
@@ -58,14 +58,10 @@ python3 ./benchmarks/text_to_image.py \
 ```
 
 ## Performance comparation
-### nexfort compile config
-- compiler-config default is `{"mode": "max-optimize:max-autotune:freezing:benchmark:low-precision:cudagraphs", "memory_format": "channels_last"}` in `/benchmarks/text_to_image.py`
-  - setting `--compiler-config '{"mode": "max-autotune", "memory_format": "channels_last"}'` will reduce compilation time and just slightly reduce the performance
-  - setting `--compiler-config '{"mode": "jit:disable-runtime-fusion", "memory_format": "channels_last"}'` will reduce compilation time to 21.832s, but will reduce the performance
-- fuse_qkv_projections: True
 
 ### Metric
 
+#### On A100
 | Metric                                           | NVIDIA A100-PCIE-40GB (1024 * 1024) |
 | ------------------------------------------------ | ----------------------------------- |
 | Data update date(yyyy-mm-dd)                     | 2024-05-23                          |
@@ -76,11 +72,12 @@ python3 ./benchmarks/text_to_image.py \
 | PyTorch Max Mem Used                             | 14.445GiB                           |
 | OneDiff Max Mem Used                             | 13.855GiB                           |
 | PyTorch Warmup with Run time                     | 4.100s                              |
-| OneDiff Warmup with Compilation time<sup>1</sup> | 776.170s                            |
+| OneDiff Warmup with Compilation time<sup>1</sup> | 510.170s                            |
 | OneDiff Warmup with Cache time                   | 111.563s                            |
 
  <sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz. Note this is just for reference, and it varies a lot on different CPU.
 
+#### On H800
 | Metric                                           |      NVIDIA H800 (1024 * 1024)      |
 | ------------------------------------------------ | ----------------------------------- |
 | Data update date(yyyy-mm-dd)                     | 2024-05-29                          |
@@ -96,6 +93,14 @@ python3 ./benchmarks/text_to_image.py \
 
  <sup>2</sup> Intel(R) Xeon(R) Platinum 8468.
 
+#### nexfort compile config and warmup cost
+- compiler-config 
+  - default is `{"mode": "max-optimize:max-autotune:freezing", "memory_format": "channels_last"}` in `/benchmarks/text_to_image.py`, the compilation time is about 500 seconds
+  - setting `--compiler-config '{"mode": "max-autotune", "memory_format": "channels_last"}'` will reduce compilation time to about 60 seconds and just slightly reduce the performance
+  - setting `--compiler-config '{"mode": "max-optimize:max-autotune:freezing:benchmark:low-precision:cudagraphs", "memory_format": "channels_last"}'` will help to make the best performance but the compilation time is about 700 seconds
+  - setting `--compiler-config '{"mode": "jit:disable-runtime-fusion", "memory_format": "channels_last"}'` will reduce compilation time to 20 seconds, but will reduce the performance
+- fuse_qkv_projections: True
+
 ## Quantization
 
 Onediff's nexfort backend works closely with Torchao to support model quantization. Quant can reduce the runtime memory requirement and increase the inference speed.