Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7347) of binary #18

Open
bearatom opened this issue Nov 10, 2023 · 0 comments

Comments

@bearatom
Copy link

Hello,

When I run this command:
torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs

The following error appears:
(train) clientadmin@clientadmin-Precision-3660:~/likang/MPI/MLI_new$ torchrun --standalone --nproc_per_node=1 \bin/train.py --config configs/tblock4_train.yaml --output-path train_outputs
2023-11-10 16:19:31.813855: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Logging directory: train_outputs/outputs/tblock4_train_1gpu/log
20it [00:00, 2088.12it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7347) of binary: /home/clientadmin/anaconda3/envs/train/bin/python
Traceback (most recent call last):
File "/home/clientadmin/anaconda3/envs/train/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.10.0', 'console_scripts', 'torchrun')())
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/clientadmin/anaconda3/envs/train/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

bin/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-11-10_16:19:35
host : clientadmin-Precision-3660
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7347)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

My pytorch==1.10.0 py3.8_cuda11.3_cudnn8.2.0_0
And my cuda is 11.3

Could you help me out?

Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant