You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run this command:
torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs
The following error appears:
(train) clientadmin@clientadmin-Precision-3660:~/likang/MPI/MLI_new$ torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs
2023-11-10 16:19:31.813855: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Logging directory: train_outputs/outputs/tblock4_train_1gpu/log
20it [00:00, 2088.12it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7347) of binary: /home/clientadmin/anaconda3/envs/train/bin/python
Traceback (most recent call last):
File "/home/clientadmin/anaconda3/envs/train/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.10.0', 'console_scripts', 'torchrun')())
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Hello,
When I run this command:
torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs
The following error appears:
(train) clientadmin@clientadmin-Precision-3660:~/likang/MPI/MLI_new$ torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs
2023-11-10 16:19:31.813855: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Logging directory: train_outputs/outputs/tblock4_train_1gpu/log
20it [00:00, 2088.12it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7347) of binary: /home/clientadmin/anaconda3/envs/train/bin/python
Traceback (most recent call last):
File "/home/clientadmin/anaconda3/envs/train/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.10.0', 'console_scripts', 'torchrun')())
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
bin/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-11-10_16:19:35
host : clientadmin-Precision-3660
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7347)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
My pytorch==1.10.0 py3.8_cuda11.3_cudnn8.2.0_0
And my cuda is 11.3
Could you help me out?
Thank you very much.
The text was updated successfully, but these errors were encountered: