-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNN benchmark cannot run #130
Comments
|
@ShawnXuan After that, I meet other errors. It seems the version of oneflow-benchmark is not consistent with the version of oneflow, thus it has many errors. |
I can train BERT in a single node. But for two node, I use this scripts
But I get the following error
Do you know what the reason is. @ShawnXuan |
@JF-D sorry. this is a stupid mistake in the script. Please uncomment the following line. We will update the script today.
|
Thanks a lot. The Bert benchmark can run successfully. But for cnn benchmark, I cannot run due to #130 (comment) |
Thanks. this is due to the incompatibility between benchmark scripts and the oneflow release. You can try building from source to use the latest oneflow. We will also try our best to release a new version ASAP. |
The default value of |
@ShawnXuan Thanks. I have one more question. You only realse the speed of bert base model, have you tried bert large model? I can get similar speed using the benchmark. Bert base throughput 145 samples/s. My machine is 32G V100(SXM2) + pytorch1.5 + cuda 10.1. Since I have some experiment results of bert large model I tested about 2 months ago, I compare it with the oneflow benchmark. The oneflow bert large is about 45 samples/s (~ 1400ms/iter), for my pytorch result, it is about 800 ms/iter (single card + bs64). This result doesn't quite match benchmark. |
@JF-D thank you. We will look into the bert-large training. |
I follow the instruction in your CNN benchamrk training resnet50 with sync data. After I exec
train.sh
, It failed with the following information. Can you offer some help?The text was updated successfully, but these errors were encountered: