-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix segfaults when using CUDA #1397
base: master
Are you sure you want to change the base?
Conversation
Summary: switch from using xxd to bin2c when generating the .ptx.c files so that the PTX data can be null-terminated. In newer drivers or cuda versions, vmaf now segfaults when trying to do anything from the GPU. The coredumps indicate that the crash happens somewhere inside the cuModuleLoadData calls in init_fex_cuda. Documentation for cuModuleLoadData states that its `image` argument can be "obtained by mapping a cubin or PTX or fatbin file, [or] passing a cubin or PTX or fatbin file as a NULL-terminated text string...". It looks like VMAF is trying to do the latter, encoding PTX text files as an ASCII string using xxd, but there's no null-terminator in the data because nothing asked for one. I'm a CUDA noob and don't know how this ever worked on older driver versions, but I tried editing the .ptx.c files by hand to add 0x00 bytes at the end and it worked! Switch from xxd to bin2c (which is distributed with the cuda-nvcc package) that supports a `--padd` option to add a null byte to the PTX data, eliminating the segfaults. The arrays got renamed slightly to remove the src_ prefix, since bin2c doesn't do any automatic naming of the output array.
Thanks for the contribution! @kylophone is this something you could easily test? |
I have the same issues as described in #1357 using the latest Nvidia driver and CUDA and this fix is working for me. If testing is a blocker for this PR, I'm sharing the tests I've done to move this forward. On master both running vmaf_cuda using ffmpeg and vmaf's cuda unit tests are crashing due to Tested this on: NVIDIA GeForce RTX 3060, Driver Version: 570.86.16, CUDA Version: 12.8
With the fix (rebased to upstream master) the unit tests are passing:
ffmpeg:
Without the fix ffmpeg is crashing at
With the fix.:
|
Summary: switch from using xxd to bin2c when generating the .ptx.c files so that the PTX data can be null-terminated.
In newer drivers or cuda versions, vmaf now segfaults when trying to do anything from the GPU. The coredumps indicate that the crash happens somewhere inside the cuModuleLoadData calls in init_fex_cuda.
Documentation for cuModuleLoadData states that its
image
argument can be "obtained by mapping a cubin or PTX or fatbin file, [or] passing a cubin or PTX or fatbin file as a NULL-terminated text string...". It looks like VMAF is trying to do the latter, encoding PTX text files as an ASCII string using xxd, but there's no null-terminator in the data because nothing asked for one.I'm a CUDA noob and don't know how this ever worked on older driver versions, but I tried editing the .ptx.c files by hand to add 0x00 bytes at the end and it worked!
Switch from xxd to bin2c (which is distributed with the cuda-nvcc package) that supports a
--padd
option to add a null byte to the PTX data, eliminating the segfaults. The arrays got renamed slightly to remove the src_ prefix, since bin2c doesn't do any automatic naming of the output array.This should resolve #1357