Baichuan devices #554

ShawnXuan · 2024-09-19T14:42:46Z

推理

cuda PASS

python projects/Baichuan/pipeline.py --mode=huggingface --model_path=/root/models/Baichuan2-7B-Chat

xpu PASS

python projects/Baichuan/pipeline.py --mode=huggingface --device=xpu --model_path=/root/models/Baichuan2-7B-Chat

训练

cuda PASS with NaN

export NUM_GPUS=8
python3 -m oneflow.distributed.launch \
    --nproc_per_node ${NUM_GPUS} \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr 127.0.0.1 \
    --master_port 12345 \
        tools/train_net.py --config-file=projects/Baichuan/configs/baichuan_sft.py \
            graph.enabled=True \
            train.input_placement_device="cuda" \
            train.dist.device_type="cuda" \
            train.dist.pipeline_parallel_size=${NUM_GPUS}

[09/19 14:39:40 lb.utils.events]:  eta: 22:07:15  iteration: 87/18660  consumed_samples: 704  total_loss: 10.36  time: 4.2893 s/iter  data_time: 0.0105 s/iter total_throughput: 1.87 samples/s lr: 6.99e-07
[09/19 14:39:44 lb.utils.events]:  eta: 22:07:07  iteration: 88/18660  consumed_samples: 712  total_loss: nan  time: 4.2889 s/iter  data_time: 0.0104 s/iter total_throughput: 1.87 samples/s lr: 7.07e-07
NaN or Inf found in input tensor.

xpu OOM after 7 iterations

export NUM_GPUS=1
python3 -m oneflow.distributed.launch \
    --nproc_per_node ${NUM_GPUS} \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr 127.0.0.1 \
    --master_port 12345 \
        tools/train_net.py --config-file=projects/Baichuan/configs/baichuan_sft.py \
            graph.enabled=False \
            train.input_placement_device="xpu" \
            train.dist.device_type="xpu" \
            train.dist.pipeline_parallel_size=${NUM_GPUS}

…baichuan_devices

ShawnXuan added 4 commits September 19, 2024 13:34

support Baichuan

ee3acee

update

07fd072

format

671ae9f

update

f793c89

ShawnXuan requested review from fpzh2011, 0x404, xiezipeng-ML, oneflow-ci-bot and Flowingsun007 September 19, 2024 14:42

ShawnXuan added 4 commits September 20, 2024 01:02

fix

fdb5d97

Merge branch 'baichuan_devices' of github.com:Oneflow-Inc/libai into …

9d5bec4

…baichuan_devices

format

a47f796

format

df09c7d

fpzh2011 approved these changes Sep 20, 2024

View reviewed changes

Merge branch 'main' into baichuan_devices

f95734e

ShawnXuan merged commit 13f1b12 into main Sep 20, 2024
2 checks passed

ShawnXuan deleted the baichuan_devices branch September 20, 2024 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baichuan devices #554

Baichuan devices #554

ShawnXuan commented Sep 19, 2024 •

edited

Loading

Baichuan devices #554

Baichuan devices #554

Conversation

ShawnXuan commented Sep 19, 2024 • edited Loading

推理

训练

ShawnXuan commented Sep 19, 2024 •

edited

Loading