Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baichuan devices #554

Merged
merged 9 commits into from
Sep 20, 2024
Merged

Baichuan devices #554

merged 9 commits into from
Sep 20, 2024

Conversation

ShawnXuan
Copy link
Contributor

@ShawnXuan ShawnXuan commented Sep 19, 2024

推理

  • cuda PASS
python projects/Baichuan/pipeline.py --mode=huggingface --model_path=/root/models/Baichuan2-7B-Chat
  • xpu PASS
python projects/Baichuan/pipeline.py --mode=huggingface --device=xpu --model_path=/root/models/Baichuan2-7B-Chat

训练

  • cuda PASS with NaN
export NUM_GPUS=8
python3 -m oneflow.distributed.launch \
    --nproc_per_node ${NUM_GPUS} \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr 127.0.0.1 \
    --master_port 12345 \
        tools/train_net.py --config-file=projects/Baichuan/configs/baichuan_sft.py \
            graph.enabled=True \
            train.input_placement_device="cuda" \
            train.dist.device_type="cuda" \
            train.dist.pipeline_parallel_size=${NUM_GPUS}
[09/19 14:39:40 lb.utils.events]:  eta: 22:07:15  iteration: 87/18660  consumed_samples: 704  total_loss: 10.36  time: 4.2893 s/iter  data_time: 0.0105 s/iter total_throughput: 1.87 samples/s lr: 6.99e-07
[09/19 14:39:44 lb.utils.events]:  eta: 22:07:07  iteration: 88/18660  consumed_samples: 712  total_loss: nan  time: 4.2889 s/iter  data_time: 0.0104 s/iter total_throughput: 1.87 samples/s lr: 7.07e-07
NaN or Inf found in input tensor.
  • xpu OOM after 7 iterations
export NUM_GPUS=1
python3 -m oneflow.distributed.launch \
    --nproc_per_node ${NUM_GPUS} \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr 127.0.0.1 \
    --master_port 12345 \
        tools/train_net.py --config-file=projects/Baichuan/configs/baichuan_sft.py \
            graph.enabled=False \
            train.input_placement_device="xpu" \
            train.dist.device_type="xpu" \
            train.dist.pipeline_parallel_size=${NUM_GPUS}

@ShawnXuan ShawnXuan merged commit 13f1b12 into main Sep 20, 2024
2 checks passed
@ShawnXuan ShawnXuan deleted the baichuan_devices branch September 20, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants