Open
Description
Describe the bug
https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Base will be 60% slower when update linux kernel from 5.19.0-41 to 6.2.0-35, time cost increase from 2.19s to 3.5
import torch
import intel_extension_for_pytorch as ipex
import time
import argparse
from transformers import AutoTokenizer, AutoModelForCausalLM
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Llama2 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="meta-llama/Llama-2-7b-chat-hf",
help='The huggingface repo id for the Llama2 (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
prompt = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun."
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
model = AutoModelForCausalLM.from_pretrained(model_path,
trust_remote_code=True,
use_cache=True)
model = model.half().to('xpu')
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# ipex model needs a warmup, then inference time can be accurate
output = model.generate(input_ids,
max_new_tokens=args.n_predict)
# start inference
st = time.time()
output = model.generate(input_ids,
max_new_tokens=args.n_predict)
torch.xpu.synchronize()
end = time.time()
output = output.cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)
On linux 5.19, time cost is 2.19 seconds with following command:
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python generate.py --repo-id-or-model-path /mnt/disk1/models/RedPajama-INCITE-7B-Base
On linux 6.2, time cost is 3.57 seconds with following command:
source /opt/intel/oneapi/setvars.sh
python generate.py --repo-id-or-model-path /mnt/disk1/models/RedPajama-INCITE-7B-Base
On linux 6.2, export USE_XETLA=OFF
and export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
will be worse, time cost is 4.08s with following command:
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python generate.py --repo-id-or-model-path /mnt/disk1/models/RedPajama-INCITE-7B-Base
Versions
Versions
CPU: i9 13900K
GPU: GUNNIR Arc A770
OS: ubuntu 22.04.3
Python: 3.9.18
Dependencies:
accelerate 0.21.0
antlr4-python3-runtime 4.9.3
certifi 2023.7.22
charset-normalizer 3.2.0
einops 0.6.1
filelock 3.12.4
fsspec 2023.9.1
huggingface-hub 0.17.2
idna 3.4
intel-extension-for-pytorch 2.0.110+xpu
Jinja2 3.1.2
MarkupSafe 2.1.3
mkl-include 2023.2.0
mkl-static 2023.2.0
mpmath 1.3.0
networkx 3.1
ninja 1.11.1
numpy 1.26.0
omegaconf 2.3.0
packaging 23.1
pandas 2.1.0
Pillow 10.0.1
pip 23.2.1
protobuf 4.24.3
psutil 5.9.5
py-cpuinfo 9.0.0
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.8.8
requests 2.31.0
safetensors 0.3.3
sentencepiece 0.1.99
setuptools 68.0.0
six 1.16.0
sympy 1.12
tabulate 0.9.0
tiktoken 0.5.1
tokenizers 0.13.3
torch 2.0.1a0+cxx11.abi
torchvision 0.15.2a0+cxx11.abi
tqdm 4.66.1
transformers 4.31.0
transformers-stream-generator 0.0.4
typing_extensions 4.8.0
tzdata 2023.3
urllib3 2.0.4
wheel 0.38.4