Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run sh train.sh hangs in resnet50 benchmark with 4 and 8 gpus of single machine #152

Open
wuyujiji opened this issue Oct 28, 2020 · 1 comment

Comments

@wuyujiji
Copy link

wuyujiji commented Oct 28, 2020

Question

Hi, recently I build the oneflow environment and run the resnet50 of OneFlow-Benchmark, it runs successfully when use 1 gpus of single machine and 2 gpus of single machine, but hangs when 4 gpus of single machine and 8 gpus of single machine.

Envrionment

gpu: Tesla V100 16g
python:3.6
cuda: 10.0
cudnn: 7
oneflow: 0.2.0
OneFlow-benchmark: master@f09f31ea8c3da6a1cc193081eb544b92d8e504c2

log info:
NUM_EPOCH=2
DATA_ROOT=/workdir/data/mini-imagenet/ofrecord

Running resnet50: num_gpu_per_node = 4, num_nodes = 1.

dtype = float32
gpu_num_per_node = 4
num_nodes = 1
node_ips = ['192.168.1.13', '192.168.1.14']
ctrl_port = 50051
model = resnet50
use_fp16 = None
use_xla = None
channel_last = None
pad_output = None
num_epochs = 2
model_load_dir = None
batch_size_per_device = 128
val_batch_size_per_device = 50
nccl_fusion_threshold_mb = 0
nccl_fusion_max_ops = 0
fuse_bn_relu = False
fuse_bn_add_relu = False
gpu_image_decoder = False
image_path = test_img/tiger.jpg
num_classes = 1000
num_examples = 1281167
num_val_examples = 50000
rgb_mean = [123.68, 116.779, 103.939]
rgb_std = [58.393, 57.12, 57.375]
image_shape = [3, 224, 224]
label_smoothing = 0.1
model_save_dir = ./output/snapshots/model_save-20201028202443
log_dir = ./output
loss_print_every_n_iter = 100
image_size = 224
resize_shorter = 256
train_data_dir = /workdir/data/mini-imagenet/ofrecord/train
train_data_part_num = 8
val_data_dir = /workdir/data/mini-imagenet/ofrecord/val
val_data_part_num = 8
optimizer = sgd
learning_rate = 1.024
wd = 3.0517578125e-05
momentum = 0.875
lr_decay = cosine
lr_decay_rate = 0.94
lr_decay_epochs = 2
warmup_epochs = 5
decay_rate = 0.9
epsilon = 1.0
gradient_clipping = 0.0

Time stamp: 2020-10-28-20:24:43
Loading data from /workdir/data/mini-imagenet/ofrecord/train
Optimizer: SGD
Loading data from /workdir/data/mini-imagenet/ofrecord/val

Then, it hangs for a long time

To Reproduce

  1. build the oneflow envrionment
    python3 -m pip install --find-links https://oneflow-inc.github.io/nightly oneflow_cu100
  2. clone the source of OneFlow-benchmark
    git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git
  3. download the mini-imagenet
    note: For running multi-gpu of one machine, I copy the part-00000 into 8 pieces of data in train and validation folders, respectively
  4. change the content of the shell scripit
    cd Classification/cnns/
    vim train.sh
    set --train_data_part_num=8
    set --val_data_part_num=8
    set gpu_num_per_node=4 # gpu numbers is 1, 2, 4 ,8, respectively. 1 and 2 is normal, but 4 and 8 hangs.
  5. run the shell scripts
    sh train.sh
@wuyujiji
Copy link
Author

This is a new error of 3 gpus:

F1028 20:39:07.626992 206314 collective_boxing_executor.cpp:452] Check failed: ncclGroupEnd() : unhandled system error (2)
*** Check failure stack trace: ***
@ 0x7f09480a08dd google::LogMessage::Fail()
@ 0x7f09480a4a1c google::LogMessage::SendToLog()
@ 0x7f09480a0403 google::LogMessage::Flush()
@ 0x7f09480a5439 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f0947005ff4 oneflow::boxing::collective::NcclCollectiveBoxingExecutorBackend::Init()
@ 0x7f0947006e5d oneflow::boxing::collective::CollectiveBoxingExecutor::CollectiveBoxingExecutor()
@ 0x7f09470adcc9 oneflow::Runtime::NewAllGlobal()
@ 0x7f09470ae63e oneflow::Runtime::Runtime()
@ 0x7f0947095bf3 (unknown)
@ 0x7f0946e24425 (unknown)
@ 0x7f096f40a04a _PyCFunction_FastCallDict
@ 0x7f096f475a3f (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f47582a (unknown)
@ 0x7f096f475b63 (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f47582a (unknown)
@ 0x7f096f475b63 (unknown)
@ 0x7f096f475b63 (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f47582a (unknown)
@ 0x7f096f475b63 (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f474c5a (unknown)
@ 0x7f096f4758da (unknown)
@ 0x7f096f46a0a7 _PyEval_EvalFrameDefault
@ 0x7f096f476c5a _PyFunction_FastCallDict
@ 0x7f096f3cc6be _PyObject_FastCallDict
@ 0x7f096f3cc7d1 _PyObject_Call_Prepend
@ 0x7f096f3cc443 PyObject_Call
@ 0x7f096f41f555 (unknown)
@ 0x7f096f41bf12 (unknown)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant