run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine #152

wuyujiji · 2020-10-28T09:53:08Z

Question

Hi, recently I build the oneflow environment and run the resnet50 of OneFlow-Benchmark, it runs successfully when use 1 gpus of single machine and 2 gpus of single machine, but hangs when 4 gpus of single machine and 8 gpus of single machine.

Envrionment

gpu: Tesla V100 16g
python:3.6
cuda: 10.0
cudnn: 7
oneflow: 0.2.0
OneFlow-benchmark: master@f09f31ea8c3da6a1cc193081eb544b92d8e504c2

log info：
NUM_EPOCH=2
DATA_ROOT=/workdir/data/mini-imagenet/ofrecord

Running resnet50: num_gpu_per_node = 4, num_nodes = 1.

dtype = float32
gpu_num_per_node = 4
num_nodes = 1
node_ips = ['192.168.1.13', '192.168.1.14']
ctrl_port = 50051
model = resnet50
use_fp16 = None
use_xla = None
channel_last = None
pad_output = None
num_epochs = 2
model_load_dir = None
batch_size_per_device = 128
val_batch_size_per_device = 50
nccl_fusion_threshold_mb = 0
nccl_fusion_max_ops = 0
fuse_bn_relu = False
fuse_bn_add_relu = False
gpu_image_decoder = False
image_path = test_img/tiger.jpg
num_classes = 1000
num_examples = 1281167
num_val_examples = 50000
rgb_mean = [123.68, 116.779, 103.939]
rgb_std = [58.393, 57.12, 57.375]
image_shape = [3, 224, 224]
label_smoothing = 0.1
model_save_dir = ./output/snapshots/model_save-20201028202443
log_dir = ./output
loss_print_every_n_iter = 100
image_size = 224
resize_shorter = 256
train_data_dir = /workdir/data/mini-imagenet/ofrecord/train
train_data_part_num = 8
val_data_dir = /workdir/data/mini-imagenet/ofrecord/val
val_data_part_num = 8
optimizer = sgd
learning_rate = 1.024
wd = 3.0517578125e-05
momentum = 0.875
lr_decay = cosine
lr_decay_rate = 0.94
lr_decay_epochs = 2
warmup_epochs = 5
decay_rate = 0.9
epsilon = 1.0
gradient_clipping = 0.0

Time stamp: 2020-10-28-20:24:43
Loading data from /workdir/data/mini-imagenet/ofrecord/train
Optimizer: SGD
Loading data from /workdir/data/mini-imagenet/ofrecord/val

Then, it hangs for a long time

To Reproduce

build the oneflow envrionment
python3 -m pip install --find-links https://oneflow-inc.github.io/nightly oneflow_cu100
clone the source of OneFlow-benchmark
git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git
download the mini-imagenet
note: For running multi-gpu of one machine, I copy the part-00000 into 8 pieces of data in train and validation folders, respectively
change the content of the shell scripit
cd Classification/cnns/
vim train.sh
set --train_data_part_num=8
set --val_data_part_num=8
set gpu_num_per_node=4 # gpu numbers is 1, 2, 4 ,8, respectively. 1 and 2 is normal, but 4 and 8 hangs.
run the shell scripts
sh train.sh

The text was updated successfully, but these errors were encountered:

wuyujiji · 2020-10-28T12:45:51Z

This is a new error of 3 gpus:

F1028 20:39:07.626992 206314 collective_boxing_executor.cpp:452] Check failed: ncclGroupEnd() : unhandled system error (2)
*** Check failure stack trace: ***
@ 0x7f09480a08dd google::LogMessage::Fail()
@ 0x7f09480a4a1c google::LogMessage::SendToLog()
@ 0x7f09480a0403 google::LogMessage::Flush()
@ 0x7f09480a5439 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f0947005ff4 oneflow::boxing::collective::NcclCollectiveBoxingExecutorBackend::Init()
@ 0x7f0947006e5d oneflow::boxing::collective::CollectiveBoxingExecutor::CollectiveBoxingExecutor()
@ 0x7f09470adcc9 oneflow::Runtime::NewAllGlobal()
@ 0x7f09470ae63e oneflow::Runtime::Runtime()
@ 0x7f0947095bf3 (unknown)
@ 0x7f0946e24425 (unknown)
@ 0x7f096f40a04a _PyCFunction_FastCallDict
@ 0x7f096f475a3f (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f47582a (unknown)
@ 0x7f096f475b63 (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f47582a (unknown)
@ 0x7f096f475b63 (unknown)
@ 0x7f096f475b63 (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f47582a (unknown)
@ 0x7f096f475b63 (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f474c5a (unknown)
@ 0x7f096f4758da (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f476c5a _PyFunction_FastCallDict
@ 0x7f096f3cc6be _PyObject_FastCallDict
@ 0x7f096f3cc7d1 _PyObject_Call_Prepend
@ 0x7f096f3cc443 PyObject_Call
@ 0x7f096f41f555 (unknown)
@ 0x7f096f41bf12 (unknown)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine #152

run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine #152

wuyujiji commented Oct 28, 2020 •

edited

Loading

wuyujiji commented Oct 28, 2020

run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine #152

run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine #152

Comments

wuyujiji commented Oct 28, 2020 • edited Loading

Question

Envrionment

To Reproduce

wuyujiji commented Oct 28, 2020

wuyujiji commented Oct 28, 2020 •

edited

Loading