Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fork error when running on Kubernetes #44

Open
uprush opened this issue Dec 12, 2023 · 2 comments
Open

fork error when running on Kubernetes #44

uprush opened this issue Dec 12, 2023 · 2 comments

Comments

@uprush
Copy link

uprush commented Dec 12, 2023

Hi,

The benchmark failed with the following error when running on Kubernetes. I was able to workaround it by setting environment variable RDMAV_FORK_SAFE=0, but not sure whether there is any performance impact and other issues.

root@mlperf-storage:/mlperf/storage# ./benchmark.sh run --workload unet3d --num-accelerators 8 --results-dir /mnt/fb1/unet3d_results --param dataset.data_folder=/mnt/fb1/unet3d_data --param dataset.num_subfolders_train=16 --param dataset.num_files_train=4687
[INFO] 2023-12-12T07:16:13.865342 Running DLIO with 8 process(es) [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:104]
[INFO] 2023-12-12T07:16:13.865599 Reading workload YAML config file '/mlperf/storage/storage-conf/workload/unet3d.yaml' [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:106]
[INFO] 2023-12-12T07:16:13.979505 Max steps per epoch: 146 = 1 * 4687 / 4 / 8 (samples per file * num files / batch size / comm size) [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:274]
[INFO] 2023-12-12T07:16:13.979733 Starting epoch 1: 146 steps expected [/mlperf/storage/dlio_benchmark/src/utils/statscounter.py:129]
[INFO] 2023-12-12T07:16:13.980126 Prefetch size is 0; a default prefetch factor of 2 will be set to Torch DataLoader. [/mlperf/storage/dlio_benchmark/src/reader/torch_data_loader_reader.py:123]
[INFO] 2023-12-12T07:16:13.980436 Starting block 1 [/mlperf/storage/dlio_benchmark/src/utils/statscounter.py:195]
A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.

python3:1099 terminated with signal 6 at PC=7f9034457a7c SP=7ffcd0aa49c0.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f9034457a7c]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f9034403476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f90343e97f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e)[0x7f8ea631eb4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeafb8)[0x7f90344abfb8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71)[0x7f90344ab781]
@johnugeorge
Copy link
Contributor

Use RDMAV_FORK_SAFE=1 ./benchmark.sh run --workload unet3d --num-accelerators 8 --results-dir /mnt/fb1/unet3d_results --param dataset.data_folder=/mnt/fb1/unet3d_data --param dataset.num_subfolders_train=16 --param dataset.num_files_train=4687

@marktheunissen
Copy link

Any further information in how to run without setting RDMAV_FORK_SAFE=1 ? There is apparently a performance penalty when running in this state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants