Description
Describe the bug
First, I apologize in advance for not being able to provide a short repro.
With GROMACS on Arc A770 and UHD770 (same machine), when running tests with a recent IntelLLVM build, some tests quite reliably (but not always) hang when using Level zero backend.
Observations:
- It happens on A770 and UHD770. Haven't tried other GPUs. Reproducer below uses Docker, but the behavior is identical when running on the same machine without any isolation.
- The application is single-threaded, nothing fancy. There are only two threads in process: the app itself and the NEO runtime.
- With OpenCL or L0 v2 (
SYCL_UR_USE_LEVEL_ZERO_V2=1
), the tests pass just fine. - With L0 v1 with in-order lists (
UR_L0_USE_DRIVER_INORDER_LISTS=1
), the tests pass on A770, but still deadlock on UHD770. - The problem only happens if
sycl::ext::oneapi::experimental::submit
is used. If I build GROMACS with-DGMX_SYCL_ENABLE_EXPERIMENTAL_SUBMIT_API=OFF
, or hackExecCGCommand::enqueueImpQueue
in IntelLLVM to haveDiscardUrEvent
always false, things work fine. - Enabling
SYCL_UR_TRACE=2
also makes the test pass. - With
ZE_DEBUG=1
, the bug reproduces slightly less reliably on bare metal, and not at all in docker or when run under gdb (even without any breakpoints). See logs below. - Running under
gdb
(without tracing), the test hangs inzeHostSynchronize
, which is consistent withZE_TRACE
output:
#0 0x00007e41edcfd7db in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007e41ea111dde in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2 0x00007e41ea107baa in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3 0x00007e41ec5a9ecb in ur::level_zero::urEventWait(unsigned int, ur_event_handle_t_* const*) () from /opt/llvm-project/build/install/lib/libur_adapter_level_zero.so.0
#4 0x00007e41ec7f95e7 in urEventWait () from /opt/llvm-project/build/install/lib/libur_loader.so.0
#5 0x00007e41ed981a3f in sycl::_V1::detail::event_impl::waitInternal(bool*) () from /opt/llvm-project/build/install/lib/libsycl.so.8
#6 0x00007e41ed981ba0 in sycl::_V1::detail::event_impl::wait(std::shared_ptr<sycl::_V1::detail::event_impl>, bool*) () from /opt/llvm-project/build/install/lib/libsycl.so.8
#7 0x00007e41ed981cac in sycl::_V1::detail::event_impl::wait_and_throw(std::shared_ptr<sycl::_V1::detail::event_impl>) () from /opt/llvm-project/build/install/lib/libsycl.so.8
#8 0x00007e41eda964b8 in sycl::_V1::event::wait_and_throw() () from /opt/llvm-project/build/install/lib/libsycl.so.8
#9 0x00007e41ef2998e8 in gmx::StatePropagatorDataGpu::Impl::waitVelocitiesReadyOnHost(gmx::AtomLocality) () from /opt/gromacs/build/bin/../lib/libgromacs.so.10
- Running with
NEOReadDebugKeys=1 PrintIoctlEntries=1
, the program seems to be stuck in the loop ofIOCTL DRM_IOCTL_I915_GET_RESET_STATS called
/IOCTL DRM_IOCTL_I915_GET_RESET_STATS returns 0
. - Killing the application produces
Fence expiration time out i915-0000:00:02.0:mdrun-pull-test[2687535:8a!
indmesg
. - Bisecting IntelLLVM leads to c19e176. Bisecting it further by changing
UR_LEVEL_ZERO_LOADER_TAG
inunified-runtime/cmake/FetchLevelZero.cmake
, leads to Only Enable Teardown thread on windows and remove debug on success oneapi-src/level-zero#323. Given the sensitivity of the bug to tracing output, I suspect that this is not the breaking commit, but it somehow makes the bug more likely by making the destructors faster. With tracing enabled, there's quite somezeEventDestroy
etc calls happening around the problematic code, and and it's in the middle of the run, so L0 should not be shutting down. Race condition in NEO / L0?
To reproduce
- Build container with the Dockerfile attached
docker run --device /dev/dri -e ONEAPI_DEVICE_SELECTOR=level_zero:0 -w /opt/gromacs/build --rm -it CONTAINER_NAME ./bin/mdrun-pull-test -ntmpi 1 -ntomp 1
- See that it hangs after
20 steps, 0.0 ps.
Environment
- Ubuntu 24.04, 6.8.0-60-generic; quick test with 6.11.0-25-generic shows the same behavior
- Target device and vendor: Intel GPU (A770, UHD770)
- DPC++ version: a5c7d88
- Dependencies version:
[level_zero:gpu] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.6.33276.160000]
Platforms: 1
Platform [#1]:
Version : 1.6
Name : Intel(R) oneAPI Unified Runtime over Level-Zero
Vendor : Intel(R) Corporation
Devices : 1
Type : gpu
Version : 12.55.8
Name : Intel(R) Arc(TM) A770 Graphics
Vendor : Intel(R) Corporation
Driver : 1.6.33276.160000
UUID : 13412816086800030000000
DeviceID : 22176
Num SubDevices : 0
Num SubSubDevices : 0
Aspects : gpu fp16 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address ext_intel_gpu_eu_count ext_intel_gpu_eu_simd_width ext_intel_gpu_slices ext_intel_gpu_subslices_per_slice ext_intel_gpu_eu_count_per_subslice atomic64 ext_intel_device_info_uuid ext_intel_gpu_hw_threads_per_eu ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_bindless_images ext_oneapi_bindless_images_1d_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_external_memory_import ext_oneapi_external_semaphore_import ext_intel_esimd ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_intel_matrix ext_oneapi_limited_graph ext_oneapi_private_alloca ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag ext_oneapi_virtual_mem ext_oneapi_image_array ext_oneapi_virtual_functions ext_intel_spill_memory_size ext_intel_current_clock_throttle_reasons ext_intel_power_limits ext_oneapi_async_memory_alloc
info::device::sub_group_sizes: 8 16 32
Architecture: intel_gpu_acm_g10
Reverting to Compute Runtime 24.52 (LevelZero version 1.6.32224.500000) does not seem to change the behavior, the application still hangs.
Additional context
I am willing to run more tests and understand that this is a pain to debug based on the description, but I'm stumped right now and would appreciate some advise in what to try next.
Other than two of the integration tests, things work just fine, and there's nothing special about those two that hang.
Normal execution time for the test is a couple seconds.
I'm attaching two log files, ze_debug_fail.txt
(bare-metal run with ZE_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:0
, that ends up hanging) and ze_debug_pass.txt
(bare-metal run with SYCL_UR_TRACE=2 ZE_DEBUG=1 ONEAPI_DEVICE_SELECTOR=level_zero:0
, that ends up passing). Diffing them one can more-or-less see what's going on in UR when the application is hanging:
UR <--- EventCreate( Queue->Context, Queue, IsMultiDevice, HostVisible.value(), Event, Queue->CounterBasedEventsEnabled, false , Queue->InterruptBasedEventsEnabled)(UR_RESULT_SUCCESS)
UR <--- createEventAndAssociateQueue(Queue, Event, CommandType, CommandList, IsInternal, false)(UR_RESULT_SUCCESS)
UR ---> setSignalEvent(Queue, UseCopyEngine, &ZeEvent, Event, NumEventsInWaitList, EventWaitList, CommandList->second.ZeQueue)
UR <--- setSignalEvent(Queue, UseCopyEngine, &ZeEvent, Event, NumEventsInWaitList, EventWaitList, CommandList->second.ZeQueue)(UR_RESULT_SUCCESS)
calling zeCommandListAppendMemoryCopy() with ZeEvent 448225560
NumEventsInWaitList 1: 448224632
ZE ---> zeCommandListAppendMemoryCopy(ZeCommandList, Dst, Src, Size, ZeEvent, WaitList.Length, WaitList.ZeEventList)
UR ---> Queue->executeCommandList(CommandList, BlockingWrite, OkToBatch)
UR <--- Queue->executeCommandList(CommandList, BlockingWrite, OkToBatch)(UR_RESULT_SUCCESS)
<--- urEnqueueUSMMemcpy(.hQueue = 0x1aa70220, .blocking = 0, .pDst = 0x7226a92bf000, .pSrc = 0xffffc001ff3f6000, .size = 7776, .numEventsInWaitList = 0, .phEventWaitList = nullptr, .phEvent = nullptr) -> UR_RESULT_SUCCESS;
---> urEnqueueEventsWaitWithBarrierExt
UR ---> TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, false )
UR <--- TmpWaitList.createAndRetainUrZeEventList( NumEventsInWaitList, EventWaitList, Queue, false )(UR_RESULT_SUCCESS)
UR ---> Queue->Context->getAvailableCommandList( Queue, CmdList, false , NumEventsInWaitList, EventWaitList, OkToBatch, nullptr )
UR ---> Queue->insertStartBarrierIfDiscardEventsMode(CommandList)
UR <--- Queue->insertStartBarrierIfDiscardEventsMode(CommandList)(UR_RESULT_SUCCESS)
UR <--- Queue->Context->getAvailableCommandList( Queue, CmdList, false , NumEventsInWaitList, EventWaitList, OkToBatch, nullptr )(UR_RESULT_SUCCESS)
UR ---> insertBarrierIntoCmdList(CmdList, TmpWaitList, ResultEvent, IsInternal, InterruptBasedEventsEnabled)
UR ---> createEventAndAssociateQueue( Queue, &Event, UR_COMMAND_EVENTS_WAIT_WITH_BARRIER, CmdList, IsInternal, InterruptBasedEventsEnabled)
UR ---> EventCreate( Queue->Context, Queue, IsMultiDevice, HostVisible.value(), Event, Queue->CounterBasedEventsEnabled, false , Queue->InterruptBasedEventsEnabled)
Cache empty (Host Visible: 1, Profiling: 0, Counter: 0, Interrupt: 0, Device: 0x1a9f1dc0)
ZE ---> zeEventCreate(ZeEventPool, &ZeEventDesc, &ZeEvent)
UR <--- EventCreate( Queue->Context, Queue, IsMultiDevice, HostVisible.value(), Event, Queue->CounterBasedEventsEnabled, false , Queue->InterruptBasedEventsEnabled)(UR_RESULT_SUCCESS)
UR ---> ur::level_zero::urEventRetain(*Event)
UR <--- ur::level_zero::urEventRetain(*Event)(UR_RESULT_SUCCESS)
UR <--- createEventAndAssociateQueue( Queue, &Event, UR_COMMAND_EVENTS_WAIT_WITH_BARRIER, CmdList, IsInternal, InterruptBasedEventsEnabled)(UR_RESULT_SUCCESS)
ZE ---> zeCommandListAppendWaitOnEvents(CmdList->first, EventWaitList.Length, EventWaitList.ZeEventList)
ZE ---> zeCommandListAppendSignalEvent(CmdList->first, Event->ZeEvent)
UR <--- insertBarrierIntoCmdList(CmdList, TmpWaitList, ResultEvent, IsInternal, InterruptBasedEventsEnabled)(UR_RESULT_SUCCESS)
UR ---> Queue->executeCommandList(CmdList, false , OkToBatch)
UR <--- Queue->executeCommandList(CmdList, false , OkToBatch)(UR_RESULT_SUCCESS)
<--- urEnqueueEventsWaitWithBarrierExt(.hQueue = 0x1aa70220, .pProperties = 0x7ffd0ae54530 ((struct ur_exp_enqueue_ext_properties_t){.stype = UR_STRUCTURE_TYPE_EXP_ENQUEUE_EXT_PROPERTIES, .pNext = nullptr, .flags = 0}), .numEventsInWaitList = 0, .phEventWaitList = nullptr, .phEvent = 0x7ffd0ae54400 (0x1ab40020)) -> UR_RESULT_SUCCESS;
---> urEventWait
UR ---> UrQueue->executeAllOpenCommandLists()
UR <--- UrQueue->executeAllOpenCommandLists()(UR_RESULT_SUCCESS)
ZeEvent = 448003464 # This line (with different value) is the last thing the hanging process prints
ZE ---> zeHostSynchronize(ZeEvent)
UR ---> CleanupEventListFromResetCmdList(EventListToCleanup, QueueLocked)
UR ---> CleanupCompletedEvent(Event, QueueLocked, true )
UR ---> urEventReleaseInternal(Event)
UR <--- urEventReleaseInternal(Event)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(DepEvent)
UR <--- urEventReleaseInternal(DepEvent)(UR_RESULT_SUCCESS)
UR <--- CleanupCompletedEvent(Event, QueueLocked, true )(UR_RESULT_SUCCESS)
UR ---> urEventReleaseInternal(Event)
UR <--- urEventReleaseInternal(Event)(UR_RESULT_SUCCESS)
UR ---> CleanupCompletedEvent(Event, QueueLocked, true )
UR ---> ur::level_zero::urKernelRelease(AssociatedKernel)
UR <--- ur::level_zero::urKernelRelease(AssociatedKernel)(UR_RESULT_SUCCESS)
ze_debug_pass.txt
ze_debug_fail.txt
EDIT 1: Added info about a newer kernel and an older compute-runtime (no effect).