Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNN benchmark cannot run #130

Open
JF-D opened this issue Sep 15, 2020 · 9 comments
Open

CNN benchmark cannot run #130

JF-D opened this issue Sep 15, 2020 · 9 comments

Comments

@JF-D
Copy link

JF-D commented Sep 15, 2020

I follow the instruction in your CNN benchamrk training resnet50 with sync data. After I exec train.sh, It failed with the following information. Can you offer some help?

------------------------------------------------------------------
Time stamp: 2020-09-15-13:38:02
Traceback (most recent call last):
  File "of_cnn_train_val.py", line 64, in <module>
    @flow.global_function("train", get_train_config(args))
  File "/home/duanjiangfei/OneFlow-Benchmark/Classification/cnns/job_function_util.py", line 33, in get_train_config
    train_config = _default_config(args)
  File "/home/duanjiangfei/OneFlow-Benchmark/Classification/cnns/job_function_util.py", line 28, in _default_config
    config.enable_fuse_add_to_output(True)
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/function_util.py", line 54, in __getattr__
    assert attr_name in name2default
AssertionError
@ShawnXuan
Copy link
Collaborator

enable_fuse_add_to_output is a new feature which can speed up resnet50 training speed.
Please try comment line of config.enable_fuse_add_to_output(True) to avoid this error.

@JF-D
Copy link
Author

JF-D commented Sep 15, 2020

@ShawnXuan After that, I meet other errors. It seems the version of oneflow-benchmark is not consistent with the version of oneflow, thus it has many errors.
image

@JF-D
Copy link
Author

JF-D commented Sep 15, 2020

I can train BERT in a single node. But for two node, I use this scripts

NUM_NODES=$1
NODE_IPS=$2

DATA_DIR=/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/wiki_ofrecord_seq_len_128_example
python run_pretraining.py \
  --gpu_num_per_node=8 \
  --num_nodes=$NUM_NODES \
  --node_ips=$NODE_IPS \
  --learning_rate=1e-4 \
  --batch_size_per_device=64 \
  --iter_num=100 \
  --loss_print_every_n_iter=20 \
  --seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_hidden_layers=12 \
  --num_attention_heads=12 \
  --max_position_embeddings=512 \
  --type_vocab_size=2 \
  --vocab_size=30522 \
  --attention_probs_dropout_prob=0.1 \
  --hidden_dropout_prob=0.1 \
  --hidden_size_per_head=64 \
  --data_part_num=1 \
  --data_dir=$DATA_DIR \
  --log_dir=./log \
  --model_save_every_n_iter=10000 \
  --save_last_snapshot=False \
  --model_save_dir=./snapshots

But I get the following error

Time stamp: 2020-09-16-01:52:58
[libprotobuf ERROR /oneflow-src/manylinux2014-build-cache-cuda-10.1/build-third-party/protobuf/src/protobuf/src/google/proto
buf/text_format.cc:303] Error parsing text-format oneflow.EnvProto: Message missing required fields: ctrl_port
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0916 01:52:58.912607 194902 error.cpp:26]  Check failed: TxtString2PbMessage(env_proto_str, &env_proto)        failed to pa
rse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true
Traceback (most recent call last):
  File "run_pretraining.py", line 120, in <module>
    main()
  File "run_pretraining.py", line 104, in main
    snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
  File "/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/util.py", line 48, in __init__
    self._check_point.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_context.py", line 49$
 in Func
    GetDefaultSession().TryInit()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 204, $
n TryInit
    self.Init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 211, $
n Init
    oneflow.env.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 53, in api
_env_init
    return enable_if.unique([env_init, do_nothing])()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 61, in env
_init
    c_api_util.InitEnv(default_env_proto)
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/c_api_util.py", line 100, in
InitEnv
    raise JobBuildAndInferError(error)
oneflow.python.framework.job_build_and_infer_error.JobBuildAndInferError:

error msg:


check_failed_error {
}

Check failed: TxtString2PbMessage(env_proto_str, &env_proto)    failed to parse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true

Do you know what the reason is. @ShawnXuan

@yuanms2
Copy link

yuanms2 commented Sep 16, 2020

I can train BERT in a single node. But for two node, I use this scripts

NUM_NODES=$1
NODE_IPS=$2

DATA_DIR=/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/wiki_ofrecord_seq_len_128_example
python run_pretraining.py \
  --gpu_num_per_node=8 \
  --num_nodes=$NUM_NODES \
  --node_ips=$NODE_IPS \
  --learning_rate=1e-4 \
  --batch_size_per_device=64 \
  --iter_num=100 \
  --loss_print_every_n_iter=20 \
  --seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_hidden_layers=12 \
  --num_attention_heads=12 \
  --max_position_embeddings=512 \
  --type_vocab_size=2 \
  --vocab_size=30522 \
  --attention_probs_dropout_prob=0.1 \
  --hidden_dropout_prob=0.1 \
  --hidden_size_per_head=64 \
  --data_part_num=1 \
  --data_dir=$DATA_DIR \
  --log_dir=./log \
  --model_save_every_n_iter=10000 \
  --save_last_snapshot=False \
  --model_save_dir=./snapshots

But I get the following error

Time stamp: 2020-09-16-01:52:58
[libprotobuf ERROR /oneflow-src/manylinux2014-build-cache-cuda-10.1/build-third-party/protobuf/src/protobuf/src/google/proto
buf/text_format.cc:303] Error parsing text-format oneflow.EnvProto: Message missing required fields: ctrl_port
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0916 01:52:58.912607 194902 error.cpp:26]  Check failed: TxtString2PbMessage(env_proto_str, &env_proto)        failed to pa
rse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true
Traceback (most recent call last):
  File "run_pretraining.py", line 120, in <module>
    main()
  File "run_pretraining.py", line 104, in main
    snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
  File "/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/util.py", line 48, in __init__
    self._check_point.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_context.py", line 49$
 in Func
    GetDefaultSession().TryInit()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 204, $
n TryInit
    self.Init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 211, $
n Init
    oneflow.env.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 53, in api
_env_init
    return enable_if.unique([env_init, do_nothing])()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 61, in env
_init
    c_api_util.InitEnv(default_env_proto)
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/c_api_util.py", line 100, in
InitEnv
    raise JobBuildAndInferError(error)
oneflow.python.framework.job_build_and_infer_error.JobBuildAndInferError:

error msg:


check_failed_error {
}

Check failed: TxtString2PbMessage(env_proto_str, &env_proto)    failed to parse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true

Do you know what the reason is. @ShawnXuan

@JF-D sorry. this is a stupid mistake in the script. Please uncomment the following line. We will update the script today.

#flow.env.ctrl_port(12138)

@JF-D
Copy link
Author

JF-D commented Sep 16, 2020

Thanks a lot. The Bert benchmark can run successfully. But for cnn benchmark, I cannot run due to #130 (comment)

@yuanms2
Copy link

yuanms2 commented Sep 16, 2020

Thanks a lot. The Bert benchmark can run successfully. But for cnn benchmark, I cannot run due to #130 (comment)

Thanks. this is due to the incompatibility between benchmark scripts and the oneflow release. You can try building from source to use the latest oneflow. We will also try our best to release a new version ASAP.

@ShawnXuan
Copy link
Collaborator

ShawnXuan commented Sep 17, 2020

The default value of fuse_bn_relu and fuse_bn_add_relu was changed to False temporary, and will be back to True after next oneflow release. Please update your code, it should be fixed. thanks! @JF-D

@JF-D
Copy link
Author

JF-D commented Sep 18, 2020

@ShawnXuan Thanks.
@yuanms2 I think you should add some git tags to clarify different versions of benchmark.

I have one more question. You only realse the speed of bert base model, have you tried bert large model? I can get similar speed using the benchmark. Bert base throughput 145 samples/s. My machine is 32G V100(SXM2) + pytorch1.5 + cuda 10.1. Since I have some experiment results of bert large model I tested about 2 months ago, I compare it with the oneflow benchmark. The oneflow bert large is about 45 samples/s (~ 1400ms/iter), for my pytorch result, it is about 800 ms/iter (single card + bs64). This result doesn't quite match benchmark.

@yuanms2
Copy link

yuanms2 commented Sep 19, 2020

@ShawnXuan Thanks.
@yuanms2 I think you should add some git tags to clarify different versions of benchmark.

I have one more question. You only realse the speed of bert base model, have you tried bert large model? I can get similar speed using the benchmark. Bert base throughput 145 samples/s. My machine is 32G V100(SXM2) + pytorch1.5 + cuda 10.1. Since I have some experiment results of bert large model I tested about 2 months ago, I compare it with the oneflow benchmark. The oneflow bert large is about 45 samples/s (~ 1400ms/iter), for my pytorch result, it is about 800 ms/iter (single card + bs64). This result doesn't quite match benchmark.

@JF-D thank you. We will look into the bert-large training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants