CNN benchmark cannot run #130

JF-D · 2020-09-15T05:46:57Z

I follow the instruction in your CNN benchamrk training resnet50 with sync data. After I exec train.sh, It failed with the following information. Can you offer some help?

------------------------------------------------------------------
Time stamp: 2020-09-15-13:38:02
Traceback (most recent call last):
  File "of_cnn_train_val.py", line 64, in <module>
    @flow.global_function("train", get_train_config(args))
  File "/home/duanjiangfei/OneFlow-Benchmark/Classification/cnns/job_function_util.py", line 33, in get_train_config
    train_config = _default_config(args)
  File "/home/duanjiangfei/OneFlow-Benchmark/Classification/cnns/job_function_util.py", line 28, in _default_config
    config.enable_fuse_add_to_output(True)
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/function_util.py", line 54, in __getattr__
    assert attr_name in name2default
AssertionError

The text was updated successfully, but these errors were encountered:

ShawnXuan · 2020-09-15T09:35:48Z

enable_fuse_add_to_output is a new feature which can speed up resnet50 training speed.
Please try comment line of config.enable_fuse_add_to_output(True) to avoid this error.

JF-D · 2020-09-15T13:27:41Z

@ShawnXuan After that, I meet other errors. It seems the version of oneflow-benchmark is not consistent with the version of oneflow, thus it has many errors.

JF-D · 2020-09-15T17:56:21Z

I can train BERT in a single node. But for two node, I use this scripts

NUM_NODES=$1
NODE_IPS=$2

DATA_DIR=/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/wiki_ofrecord_seq_len_128_example
python run_pretraining.py \
  --gpu_num_per_node=8 \
  --num_nodes=$NUM_NODES \
  --node_ips=$NODE_IPS \
  --learning_rate=1e-4 \
  --batch_size_per_device=64 \
  --iter_num=100 \
  --loss_print_every_n_iter=20 \
  --seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_hidden_layers=12 \
  --num_attention_heads=12 \
  --max_position_embeddings=512 \
  --type_vocab_size=2 \
  --vocab_size=30522 \
  --attention_probs_dropout_prob=0.1 \
  --hidden_dropout_prob=0.1 \
  --hidden_size_per_head=64 \
  --data_part_num=1 \
  --data_dir=$DATA_DIR \
  --log_dir=./log \
  --model_save_every_n_iter=10000 \
  --save_last_snapshot=False \
  --model_save_dir=./snapshots

But I get the following error

Time stamp: 2020-09-16-01:52:58
[libprotobuf ERROR /oneflow-src/manylinux2014-build-cache-cuda-10.1/build-third-party/protobuf/src/protobuf/src/google/proto
buf/text_format.cc:303] Error parsing text-format oneflow.EnvProto: Message missing required fields: ctrl_port
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0916 01:52:58.912607 194902 error.cpp:26]  Check failed: TxtString2PbMessage(env_proto_str, &env_proto)        failed to pa
rse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true
Traceback (most recent call last):
  File "run_pretraining.py", line 120, in <module>
    main()
  File "run_pretraining.py", line 104, in main
    snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
  File "/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/util.py", line 48, in __init__
    self._check_point.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_context.py", line 49$
 in Func
    GetDefaultSession().TryInit()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 204, $
n TryInit
    self.Init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 211, $
n Init
    oneflow.env.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 53, in api
_env_init
    return enable_if.unique([env_init, do_nothing])()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 61, in env
_init
    c_api_util.InitEnv(default_env_proto)
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/c_api_util.py", line 100, in
InitEnv
    raise JobBuildAndInferError(error)
oneflow.python.framework.job_build_and_infer_error.JobBuildAndInferError:

error msg:


check_failed_error {
}

Check failed: TxtString2PbMessage(env_proto_str, &env_proto)    failed to parse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true

Do you know what the reason is. @ShawnXuan

yuanms2 · 2020-09-16T01:05:00Z

I can train BERT in a single node. But for two node, I use this scripts

NUM_NODES=$1
NODE_IPS=$2

DATA_DIR=/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/wiki_ofrecord_seq_len_128_example
python run_pretraining.py \
  --gpu_num_per_node=8 \
  --num_nodes=$NUM_NODES \
  --node_ips=$NODE_IPS \
  --learning_rate=1e-4 \
  --batch_size_per_device=64 \
  --iter_num=100 \
  --loss_print_every_n_iter=20 \
  --seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_hidden_layers=12 \
  --num_attention_heads=12 \
  --max_position_embeddings=512 \
  --type_vocab_size=2 \
  --vocab_size=30522 \
  --attention_probs_dropout_prob=0.1 \
  --hidden_dropout_prob=0.1 \
  --hidden_size_per_head=64 \
  --data_part_num=1 \
  --data_dir=$DATA_DIR \
  --log_dir=./log \
  --model_save_every_n_iter=10000 \
  --save_last_snapshot=False \
  --model_save_dir=./snapshots

But I get the following error

Time stamp: 2020-09-16-01:52:58
[libprotobuf ERROR /oneflow-src/manylinux2014-build-cache-cuda-10.1/build-third-party/protobuf/src/protobuf/src/google/proto
buf/text_format.cc:303] Error parsing text-format oneflow.EnvProto: Message missing required fields: ctrl_port
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0916 01:52:58.912607 194902 error.cpp:26]  Check failed: TxtString2PbMessage(env_proto_str, &env_proto)        failed to pa
rse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true
Traceback (most recent call last):
  File "run_pretraining.py", line 120, in <module>
    main()
  File "run_pretraining.py", line 104, in main
    snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
  File "/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/util.py", line 48, in __init__
    self._check_point.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_context.py", line 49$
 in Func
    GetDefaultSession().TryInit()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 204, $
n TryInit
    self.Init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 211, $
n Init
    oneflow.env.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 53, in api
_env_init
    return enable_if.unique([env_init, do_nothing])()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 61, in env
_init
    c_api_util.InitEnv(default_env_proto)
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/c_api_util.py", line 100, in
InitEnv
    raise JobBuildAndInferError(error)
oneflow.python.framework.job_build_and_infer_error.JobBuildAndInferError:

error msg:


check_failed_error {
}

Check failed: TxtString2PbMessage(env_proto_str, &env_proto)    failed to parse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true

Do you know what the reason is. @ShawnXuan

@JF-D sorry. this is a stupid mistake in the script. Please uncomment the following line. We will update the script today.

OneFlow-Benchmark/LanguageModeling/BERT/util.py

Line 29 in 5c2b305

#flow.env.ctrl_port(12138)

JF-D · 2020-09-16T05:20:03Z

Thanks a lot. The Bert benchmark can run successfully. But for cnn benchmark, I cannot run due to #130 (comment)

yuanms2 · 2020-09-16T08:04:12Z

Thanks a lot. The Bert benchmark can run successfully. But for cnn benchmark, I cannot run due to #130 (comment)

Thanks. this is due to the incompatibility between benchmark scripts and the oneflow release. You can try building from source to use the latest oneflow. We will also try our best to release a new version ASAP.

ShawnXuan · 2020-09-17T08:01:16Z

The default value of fuse_bn_relu and fuse_bn_add_relu was changed to False temporary, and will be back to True after next oneflow release. Please update your code, it should be fixed. thanks! @JF-D

JF-D · 2020-09-18T10:23:35Z

@ShawnXuan Thanks.
@yuanms2 I think you should add some git tags to clarify different versions of benchmark.

I have one more question. You only realse the speed of bert base model, have you tried bert large model? I can get similar speed using the benchmark. Bert base throughput 145 samples/s. My machine is 32G V100(SXM2) + pytorch1.5 + cuda 10.1. Since I have some experiment results of bert large model I tested about 2 months ago, I compare it with the oneflow benchmark. The oneflow bert large is about 45 samples/s (~ 1400ms/iter), for my pytorch result, it is about 800 ms/iter (single card + bs64). This result doesn't quite match benchmark.

yuanms2 · 2020-09-19T13:15:54Z

@ShawnXuan Thanks.
@yuanms2 I think you should add some git tags to clarify different versions of benchmark.

I have one more question. You only realse the speed of bert base model, have you tried bert large model? I can get similar speed using the benchmark. Bert base throughput 145 samples/s. My machine is 32G V100(SXM2) + pytorch1.5 + cuda 10.1. Since I have some experiment results of bert large model I tested about 2 months ago, I compare it with the oneflow benchmark. The oneflow bert large is about 45 samples/s (~ 1400ms/iter), for my pytorch result, it is about 800 ms/iter (single card + bs64). This result doesn't quite match benchmark.

@JF-D thank you. We will look into the bert-large training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNN benchmark cannot run #130

CNN benchmark cannot run #130

JF-D commented Sep 15, 2020

ShawnXuan commented Sep 15, 2020

JF-D commented Sep 15, 2020

JF-D commented Sep 15, 2020

yuanms2 commented Sep 16, 2020

JF-D commented Sep 16, 2020

yuanms2 commented Sep 16, 2020

ShawnXuan commented Sep 17, 2020 •

edited

Loading

JF-D commented Sep 18, 2020

yuanms2 commented Sep 19, 2020

CNN benchmark cannot run #130

CNN benchmark cannot run #130

Comments

JF-D commented Sep 15, 2020

ShawnXuan commented Sep 15, 2020

JF-D commented Sep 15, 2020

JF-D commented Sep 15, 2020

yuanms2 commented Sep 16, 2020

JF-D commented Sep 16, 2020

yuanms2 commented Sep 16, 2020

ShawnXuan commented Sep 17, 2020 • edited Loading

JF-D commented Sep 18, 2020

yuanms2 commented Sep 19, 2020

ShawnXuan commented Sep 17, 2020 •

edited

Loading