Skip to content

Error occurs at the first MPI run #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tetsushinto opened this issue Sep 2, 2022 · 2 comments
Open

Error occurs at the first MPI run #5

tetsushinto opened this issue Sep 2, 2022 · 2 comments

Comments

@tetsushinto
Copy link

Hello,

My customer Fujitsu reports an issue below.

In the field, following errors occur when an MPI program is executed immediately after server startup for the first time.
It only occurs at the first MPI execution, and does not occur from the second time onwards.
This phenomenon occurred between 16:00 and 18:00 on July 14th.

When I asked Nvidia to check the MOFED driver, they told me that there was no error on the driver side.
Furthermore, they said that IntelMPI is based on libfrabric which OFED does not support. If customer want use IntelMPI need full stack including IB driver and libs from Intel. Hybrid IntelMPI with NVIDIA OFED is out of Nvidia support scope.

Does Intel support IntelMPI with Nvidia MOFED without Nvidia support?
If so, could you please investigate this issue?

Or does Intel only recommend using IntelMPI with Intel's IB driver?

Intel MPI  OS:RHEL7.9  MOFED:5.2-1.0.4.0  HCA:CX5 (EDR) (FW:16.29.1016 ) ---- [0] MPI startup(): Intel(R) MPI Library, Version 2021.2 Build 20210302 (id: f4f7c92cd) [0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved. [0] MPI startup(): library kind: release [0] MPI startup(): libfabric version: 1.11.0-impi [0] MPI startup(): libfabric provider: mlx [1657788057.566554] [cmp-044:38365:0] mpool.c:193 UCX ERROR Failed to allocate memory pool (name=devx dbrec) chunk: Out of memory [1657788057.582960] [cmp-046:37100:0] dc_mlx5_devx.c:66 UCX ERROR mlx5dv_devx_obj_create(DCT) failed, syndrome 0: Resource temporarily unavailable [1657788057.586338] [cmp-038:42220:0] dc_mlx5_devx.c:66 UCX ERROR mlx5dv_devx_obj_create(DCT) failed, syndrome 0: Resource temporarily unavailable ----

Thanks,
Shinto

@tetsushinto
Copy link
Author

Hello,

Could you please give a comment on this issue?

Regards,
Shinto

@ddurnov
Copy link
Contributor

ddurnov commented Dec 5, 2023

Hello Shinto,

Hope that the issue is still relevant. And sorry for delayed response. I've been updating the repo and noticed your issue report.

Quick answer is that Intel MPI is available and supported for the platforms with MOFED. Based on the description/information you provided, the issue is outside of Intel MPI library and I would recommend to update MOFED/UCX part of the stack. If the issue still there, then you may need to update NIC firmware. In general I would recommend to contact NVIDIA support team.

As a quick experiment you may try to use non DCT transport path: FI_MLX_TLS=ud,sm,self
It is less demanding and if it works, then most likely there is something wrong with DC infrastructure. You may need to run ucx_info and ibv_devinfo -v in order to check state of DC/DCT feature availability.

Would be great if you could redirect support related questions towards our regular support channel or Intel forum sections dedicated to the product in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants