You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the field, following errors occur when an MPI program is executed immediately after server startup for the first time.
It only occurs at the first MPI execution, and does not occur from the second time onwards.
This phenomenon occurred between 16:00 and 18:00 on July 14th.
When I asked Nvidia to check the MOFED driver, they told me that there was no error on the driver side.
Furthermore, they said that IntelMPI is based on libfrabric which OFED does not support. If customer want use IntelMPI need full stack including IB driver and libs from Intel. Hybrid IntelMPI with NVIDIA OFED is out of Nvidia support scope.
Does Intel support IntelMPI with Nvidia MOFED without Nvidia support?
If so, could you please investigate this issue?
Or does Intel only recommend using IntelMPI with Intel's IB driver?
Hope that the issue is still relevant. And sorry for delayed response. I've been updating the repo and noticed your issue report.
Quick answer is that Intel MPI is available and supported for the platforms with MOFED. Based on the description/information you provided, the issue is outside of Intel MPI library and I would recommend to update MOFED/UCX part of the stack. If the issue still there, then you may need to update NIC firmware. In general I would recommend to contact NVIDIA support team.
As a quick experiment you may try to use non DCT transport path: FI_MLX_TLS=ud,sm,self
It is less demanding and if it works, then most likely there is something wrong with DC infrastructure. You may need to run ucx_info and ibv_devinfo -v in order to check state of DC/DCT feature availability.
Would be great if you could redirect support related questions towards our regular support channel or Intel forum sections dedicated to the product in the future.
Hello,
My customer Fujitsu reports an issue below.
In the field, following errors occur when an MPI program is executed immediately after server startup for the first time.
It only occurs at the first MPI execution, and does not occur from the second time onwards.
This phenomenon occurred between 16:00 and 18:00 on July 14th.
When I asked Nvidia to check the MOFED driver, they told me that there was no error on the driver side.
Furthermore, they said that IntelMPI is based on libfrabric which OFED does not support. If customer want use IntelMPI need full stack including IB driver and libs from Intel. Hybrid IntelMPI with NVIDIA OFED is out of Nvidia support scope.
Does Intel support IntelMPI with Nvidia MOFED without Nvidia support?
If so, could you please investigate this issue?
Or does Intel only recommend using IntelMPI with Intel's IB driver?
Intel MPI OS:RHEL7.9 MOFED:5.2-1.0.4.0 HCA:CX5 (EDR) (FW:16.29.1016 ) ---- [0] MPI startup(): Intel(R) MPI Library, Version 2021.2 Build 20210302 (id: f4f7c92cd) [0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved. [0] MPI startup(): library kind: release [0] MPI startup(): libfabric version: 1.11.0-impi [0] MPI startup(): libfabric provider: mlx [1657788057.566554] [cmp-044:38365:0] mpool.c:193 UCX ERROR Failed to allocate memory pool (name=devx dbrec) chunk: Out of memory [1657788057.582960] [cmp-046:37100:0] dc_mlx5_devx.c:66 UCX ERROR mlx5dv_devx_obj_create(DCT) failed, syndrome 0: Resource temporarily unavailable [1657788057.586338] [cmp-038:42220:0] dc_mlx5_devx.c:66 UCX ERROR mlx5dv_devx_obj_create(DCT) failed, syndrome 0: Resource temporarily unavailable ----Thanks,
Shinto
The text was updated successfully, but these errors were encountered: