fork error when running on Kubernetes #44

uprush · 2023-12-12T10:33:48Z

Hi,

The benchmark failed with the following error when running on Kubernetes. I was able to workaround it by setting environment variable RDMAV_FORK_SAFE=0, but not sure whether there is any performance impact and other issues.

root@mlperf-storage:/mlperf/storage# ./benchmark.sh run --workload unet3d --num-accelerators 8 --results-dir /mnt/fb1/unet3d_results --param dataset.data_folder=/mnt/fb1/unet3d_data --param dataset.num_subfolders_train=16 --param dataset.num_files_train=4687
[INFO] 2023-12-12T07:16:13.865342 Running DLIO with 8 process(es) [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:104]
[INFO] 2023-12-12T07:16:13.865599 Reading workload YAML config file '/mlperf/storage/storage-conf/workload/unet3d.yaml' [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:106]
[INFO] 2023-12-12T07:16:13.979505 Max steps per epoch: 146 = 1 * 4687 / 4 / 8 (samples per file * num files / batch size / comm size) [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:274]
[INFO] 2023-12-12T07:16:13.979733 Starting epoch 1: 146 steps expected [/mlperf/storage/dlio_benchmark/src/utils/statscounter.py:129]
[INFO] 2023-12-12T07:16:13.980126 Prefetch size is 0; a default prefetch factor of 2 will be set to Torch DataLoader. [/mlperf/storage/dlio_benchmark/src/reader/torch_data_loader_reader.py:123]
[INFO] 2023-12-12T07:16:13.980436 Starting block 1 [/mlperf/storage/dlio_benchmark/src/utils/statscounter.py:195]
A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.

python3:1099 terminated with signal 6 at PC=7f9034457a7c SP=7ffcd0aa49c0.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f9034457a7c]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f9034403476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f90343e97f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e)[0x7f8ea631eb4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeafb8)[0x7f90344abfb8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71)[0x7f90344ab781]

The text was updated successfully, but these errors were encountered:

johnugeorge · 2024-02-20T12:42:04Z

Use RDMAV_FORK_SAFE=1 ./benchmark.sh run --workload unet3d --num-accelerators 8 --results-dir /mnt/fb1/unet3d_results --param dataset.data_folder=/mnt/fb1/unet3d_data --param dataset.num_subfolders_train=16 --param dataset.num_files_train=4687

marktheunissen · 2024-10-04T02:58:45Z

Any further information in how to run without setting RDMAV_FORK_SAFE=1 ? There is apparently a performance penalty when running in this state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fork error when running on Kubernetes #44

fork error when running on Kubernetes #44

uprush commented Dec 12, 2023

johnugeorge commented Feb 20, 2024

marktheunissen commented Oct 4, 2024

fork error when running on Kubernetes #44

fork error when running on Kubernetes #44

Comments

uprush commented Dec 12, 2023

johnugeorge commented Feb 20, 2024

marktheunissen commented Oct 4, 2024