You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, recently I build the oneflow environment and run the resnet50 of OneFlow-Benchmark, it runs successfully when use 1 gpus of single machine and 2 gpus of single machine, but hangs when 4 gpus of single machine and 8 gpus of single machine.
Time stamp: 2020-10-28-20:24:43
Loading data from /workdir/data/mini-imagenet/ofrecord/train
Optimizer: SGD
Loading data from /workdir/data/mini-imagenet/ofrecord/val
clone the source of OneFlow-benchmark git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git
download the mini-imagenet note: For running multi-gpu of one machine, I copy the part-00000 into 8 pieces of data in train and validation folders, respectively
change the content of the shell scripit cd Classification/cnns/ vim train.sh set --train_data_part_num=8 set --val_data_part_num=8 set gpu_num_per_node=4 # gpu numbers is 1, 2, 4 ,8, respectively. 1 and 2 is normal, but 4 and 8 hangs.
run the shell scripts sh train.sh
The text was updated successfully, but these errors were encountered:
Question
Hi, recently I build the oneflow environment and run the resnet50 of OneFlow-Benchmark, it runs successfully when use 1 gpus of single machine and 2 gpus of single machine, but hangs when 4 gpus of single machine and 8 gpus of single machine.
Envrionment
gpu: Tesla V100 16g
python:3.6
cuda: 10.0
cudnn: 7
oneflow: 0.2.0
OneFlow-benchmark: master@f09f31ea8c3da6a1cc193081eb544b92d8e504c2
log info:
NUM_EPOCH=2
DATA_ROOT=/workdir/data/mini-imagenet/ofrecord
Running resnet50: num_gpu_per_node = 4, num_nodes = 1.
dtype = float32
gpu_num_per_node = 4
num_nodes = 1
node_ips = ['192.168.1.13', '192.168.1.14']
ctrl_port = 50051
model = resnet50
use_fp16 = None
use_xla = None
channel_last = None
pad_output = None
num_epochs = 2
model_load_dir = None
batch_size_per_device = 128
val_batch_size_per_device = 50
nccl_fusion_threshold_mb = 0
nccl_fusion_max_ops = 0
fuse_bn_relu = False
fuse_bn_add_relu = False
gpu_image_decoder = False
image_path = test_img/tiger.jpg
num_classes = 1000
num_examples = 1281167
num_val_examples = 50000
rgb_mean = [123.68, 116.779, 103.939]
rgb_std = [58.393, 57.12, 57.375]
image_shape = [3, 224, 224]
label_smoothing = 0.1
model_save_dir = ./output/snapshots/model_save-20201028202443
log_dir = ./output
loss_print_every_n_iter = 100
image_size = 224
resize_shorter = 256
train_data_dir = /workdir/data/mini-imagenet/ofrecord/train
train_data_part_num = 8
val_data_dir = /workdir/data/mini-imagenet/ofrecord/val
val_data_part_num = 8
optimizer = sgd
learning_rate = 1.024
wd = 3.0517578125e-05
momentum = 0.875
lr_decay = cosine
lr_decay_rate = 0.94
lr_decay_epochs = 2
warmup_epochs = 5
decay_rate = 0.9
epsilon = 1.0
gradient_clipping = 0.0
Time stamp: 2020-10-28-20:24:43
Loading data from /workdir/data/mini-imagenet/ofrecord/train
Optimizer: SGD
Loading data from /workdir/data/mini-imagenet/ofrecord/val
Then, it hangs for a long time
To Reproduce
python3 -m pip install --find-links https://oneflow-inc.github.io/nightly oneflow_cu100
git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git
note: For running multi-gpu of one machine, I copy the part-00000 into 8 pieces of data in train and validation folders, respectively
cd Classification/cnns/
vim train.sh
set --train_data_part_num=8
set --val_data_part_num=8
set gpu_num_per_node=4 # gpu numbers is 1, 2, 4 ,8, respectively. 1 and 2 is normal, but 4 and 8 hangs.
sh train.sh
The text was updated successfully, but these errors were encountered: