RedPajama-INCITE-7B-Base fp16 is 60% slower on Arc 770 when upgrade linux kernel from 5.19.0-41 to 6.2.0-35

### Describe the bug

https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Base will be  60% slower when update linux kernel from 5.19.0-41 to 6.2.0-35, time cost increase from 2.19s to 3.5
```
import torch
import intel_extension_for_pytorch as ipex
import time
import argparse

from transformers import AutoTokenizer, AutoModelForCausalLM

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Llama2 model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="meta-llama/Llama-2-7b-chat-hf",
                        help='The huggingface repo id for the Llama2 (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--n-predict', type=int, default=32,
                        help='Max tokens to predict')

    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    prompt = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun."

    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                 trust_remote_code=True,
                                                 use_cache=True)
    model = model.half().to('xpu')

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

    # Generate predicted tokens
    with torch.inference_mode():
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        # ipex model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

        # start inference
        st = time.time()
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        torch.xpu.synchronize()
        end = time.time()
        output = output.cpu()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

```

On linux 5.19, time cost is 2.19 seconds with following command:
```
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python generate.py --repo-id-or-model-path /mnt/disk1/models/RedPajama-INCITE-7B-Base
```
On linux 6.2, time cost is 3.57 seconds  with following command:
```
source /opt/intel/oneapi/setvars.sh
python generate.py --repo-id-or-model-path /mnt/disk1/models/RedPajama-INCITE-7B-Base
```
On linux 6.2, `export USE_XETLA=OFF` and `export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1` will be worse, time cost is 4.08s with following command：
```
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python generate.py --repo-id-or-model-path /mnt/disk1/models/RedPajama-INCITE-7B-Base
```

### Versions

Versions
CPU: i9 13900K
GPU: GUNNIR Arc A770
OS: ubuntu 22.04.3
Python: 3.9.18
Dependencies:
```
accelerate                    0.21.0
antlr4-python3-runtime        4.9.3
certifi                       2023.7.22
charset-normalizer            3.2.0
einops                        0.6.1
filelock                      3.12.4
fsspec                        2023.9.1
huggingface-hub               0.17.2
idna                          3.4
intel-extension-for-pytorch   2.0.110+xpu
Jinja2                        3.1.2
MarkupSafe                    2.1.3
mkl-include                   2023.2.0
mkl-static                    2023.2.0
mpmath                        1.3.0
networkx                      3.1
ninja                         1.11.1
numpy                         1.26.0
omegaconf                     2.3.0
packaging                     23.1
pandas                        2.1.0
Pillow                        10.0.1
pip                           23.2.1
protobuf                      4.24.3
psutil                        5.9.5
py-cpuinfo                    9.0.0
python-dateutil               2.8.2
pytz                          2023.3.post1
PyYAML                        6.0.1
regex                         2023.8.8
requests                      2.31.0
safetensors                   0.3.3
sentencepiece                 0.1.99
setuptools                    68.0.0
six                           1.16.0
sympy                         1.12
tabulate                      0.9.0
tiktoken                      0.5.1
tokenizers                    0.13.3
torch                         2.0.1a0+cxx11.abi
torchvision                   0.15.2a0+cxx11.abi
tqdm                          4.66.1
transformers                  4.31.0
transformers-stream-generator 0.0.4
typing_extensions             4.8.0
tzdata                        2023.3
urllib3                       2.0.4
wheel                         0.38.4
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RedPajama-INCITE-7B-Base fp16 is 60% slower on Arc 770 when upgrade linux kernel from 5.19.0-41 to 6.2.0-35 #458

Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RedPajama-INCITE-7B-Base fp16 is 60% slower on Arc 770 when upgrade linux kernel from 5.19.0-41 to 6.2.0-35 #458

Description

Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions