Skip to content

RedPajama-INCITE-7B-Base fp16 is 60% slower on Arc 770 when upgrade linux kernel from 5.19.0-41 to 6.2.0-35 #458

Open
@qiuxin2012

Description

@qiuxin2012

Describe the bug

https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Base will be 60% slower when update linux kernel from 5.19.0-41 to 6.2.0-35, time cost increase from 2.19s to 3.5

import torch
import intel_extension_for_pytorch as ipex
import time
import argparse

from transformers import AutoTokenizer, AutoModelForCausalLM

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Llama2 model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="meta-llama/Llama-2-7b-chat-hf",
                        help='The huggingface repo id for the Llama2 (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--n-predict', type=int, default=32,
                        help='Max tokens to predict')

    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    prompt = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun."

    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                 trust_remote_code=True,
                                                 use_cache=True)
    model = model.half().to('xpu')

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

    # Generate predicted tokens
    with torch.inference_mode():
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        # ipex model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

        # start inference
        st = time.time()
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        torch.xpu.synchronize()
        end = time.time()
        output = output.cpu()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

On linux 5.19, time cost is 2.19 seconds with following command:

source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python generate.py --repo-id-or-model-path /mnt/disk1/models/RedPajama-INCITE-7B-Base

On linux 6.2, time cost is 3.57 seconds with following command:

source /opt/intel/oneapi/setvars.sh
python generate.py --repo-id-or-model-path /mnt/disk1/models/RedPajama-INCITE-7B-Base

On linux 6.2, export USE_XETLA=OFF and export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 will be worse, time cost is 4.08s with following command:

source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python generate.py --repo-id-or-model-path /mnt/disk1/models/RedPajama-INCITE-7B-Base

Versions

Versions
CPU: i9 13900K
GPU: GUNNIR Arc A770
OS: ubuntu 22.04.3
Python: 3.9.18
Dependencies:

accelerate                    0.21.0
antlr4-python3-runtime        4.9.3
certifi                       2023.7.22
charset-normalizer            3.2.0
einops                        0.6.1
filelock                      3.12.4
fsspec                        2023.9.1
huggingface-hub               0.17.2
idna                          3.4
intel-extension-for-pytorch   2.0.110+xpu
Jinja2                        3.1.2
MarkupSafe                    2.1.3
mkl-include                   2023.2.0
mkl-static                    2023.2.0
mpmath                        1.3.0
networkx                      3.1
ninja                         1.11.1
numpy                         1.26.0
omegaconf                     2.3.0
packaging                     23.1
pandas                        2.1.0
Pillow                        10.0.1
pip                           23.2.1
protobuf                      4.24.3
psutil                        5.9.5
py-cpuinfo                    9.0.0
python-dateutil               2.8.2
pytz                          2023.3.post1
PyYAML                        6.0.1
regex                         2023.8.8
requests                      2.31.0
safetensors                   0.3.3
sentencepiece                 0.1.99
setuptools                    68.0.0
six                           1.16.0
sympy                         1.12
tabulate                      0.9.0
tiktoken                      0.5.1
tokenizers                    0.13.3
torch                         2.0.1a0+cxx11.abi
torchvision                   0.15.2a0+cxx11.abi
tqdm                          4.66.1
transformers                  4.31.0
transformers-stream-generator 0.0.4
typing_extensions             4.8.0
tzdata                        2023.3
urllib3                       2.0.4
wheel                         0.38.4

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions