ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5623

lulu1599 · 2023-03-28T05:20:48Z

lulu1599
Mar 28, 2023

Regarding E3SM maint-2.1, I tried to run it on a CPU platform with intel+impi, and it worked well with the case WCYCL1850+ne30pg2_EC30to60E2r2. However, when I tried to switch to a GPU platform using PGI+openmpi, I encountered the following issues:

when I compiled with "--compset WCYCL1850 --res ne30pg2_EC30to60E2r2" on the GPU platform, I could not pass the compilation. In the "./case.build" process, I encountered an NVLINK multiple definition error, as follows:
"nvlink error : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_1d_gpu_647_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'"
I searched the forum but could not find any relevant solutions. It seems to point to a compatibility issue between the PGI compiler and the source code (perhaps it's my compile parameters?), however, I am not sure. Do you know what the reason might be?
Then I tried another case, which only activates the ocean module "--compset CMPASO-NYF --res T62_oEC60to30v3" to avoid the aforementioned NVLINK problem. However, when I ran "./case.submit", I found that the process was not submitted to the GPU card but still running on the CPU. Did I miss something?

The config_machine.xml, cmake files, running scripts, and log files are attached. Hope an expert can help me, thanks!

config_machine.xml.txt
create_case.sh.txt
Depends.pgigpu.cmake.txt
pgigpu_lu-gpu.cmake.txt
preview_run.log

philipwjones · 2023-03-28T16:04:19Z

philipwjones
Mar 28, 2023
Maintainer

@lulu1599 On your second, It looks like your files don't have the correct flags for building the GPU-enabled ocean. You will need a -DMPAS_OPENACC flag for preprocessing. And the fortran flags should have -acc -Minfo=accel (the ta flag is somewhat optional but if you know the architecture it can sometimes help). Looks like the latter are enabled on the LDFLAGS but not on the compile line. With the Minfo flag on, you should see the compiler generate additional output for acceleration and that's a good way to see whether it's actually generating gpu instructions.

On the first, we have seen that in the past but I think it disappeared with later compiler versions - which version of Nvidia/PGI are you using?

0 replies

lulu1599 · 2023-03-29T01:56:49Z

lulu1599
Mar 29, 2023
Author

Thanks a lot! 1. I'm trying the FLAGS you mentioned to see if they worked.  2. My PGI version is 21.9-0, my CUDA version is 11.4. Maybe I should try a newer PGI? Thanks again, Jingyu 

…

------------------ 原始邮件 ------------------ 发件人: "E3SM-Project/E3SM" ***@***.***>; 发送时间: 2023年3月29日(星期三) 凌晨0:04 ***@***.***>; ***@***.******@***.***>; 主题: Re: [E3SM-Project/E3SM] ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF (Issue #5563) @lulu1599 On your second, It looks like your files don't have the correct flags for building the GPU-enabled ocean. You will need a -DMPAS_OPENACC flag for preprocessing. And the fortran flags should have -acc -Minfo=accel (the ta flag is somewhat optional but if you know the architecture it can sometimes help). Looks like the latter are enabled on the LDFLAGS but not on the compile line. With the Minfo flag on, you should see the compiler generate additional output for acceleration and that's a good way to see whether it's actually generating gpu instructions. On the first, we have seen that in the past but I think it disappeared with later compiler versions - which version of Nvidia/PGI are you using? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

lulu1599 · 2023-03-29T02:48:31Z

lulu1599
Mar 29, 2023
Author

HI! I have added the FLAGS string(APPEND CPPDEFS " -DMPAS_OPENACC") string(APPEND FFLAGS "-acc -Minfo=accel") When I set num of process 8 (-n 8), and the --ngpu-per-node 8, the e3sm.exe was submitted to GPU successfully, however, When I set num of process 16 (-n 16), and the --ngpu-per-node 8, I can't see my process on GPU, here's my running info, perhaps you can see any clues? ![image](https://user-images.githubusercontent.com/51312558/228414079-d70b171e-4d7f-4541-b660-d800cc02818b.png) ![image](https://user-images.githubusercontent.com/51312558/228414141-cc5b3f93-9aae-4dab-bc5e-9e423e14e9c9.png) ![image](https://user-images.githubusercontent.com/51312558/228414154-92f0eb3c-450d-448d-b52b-fe490ff30690.png) ![image](https://user-images.githubusercontent.com/51312558/228414172-75c6fa31-2d68-4d52-9b29-f4b36f11b7b7.png)

0 replies

philipwjones · 2023-03-29T17:34:59Z

philipwjones
Mar 29, 2023
Maintainer

Since the first one works, it seems you have a gpu-enabled executable, so i suspect this is an issue with how your job launcher/scheduler is allocating resources and whether it supports multiple ranks per gpu.

0 replies

lulu1599 · 2023-03-30T08:58:45Z

lulu1599
Mar 30, 2023
Author

OK, thanks! Your answers are very helpful!

0 replies

lulu1599 · 2023-04-06T06:45:24Z

lulu1599
Apr 6, 2023
Author

HI!

I have another question, when I running the case WCLC1850 on GPU node, is the eam module using KOKKOS lib and mpas-o/mpas-si using OPENACC?
I'm confused if other modules can be accelareted by GPU? (I'm using E3SM v2.1)
Here's my "--compset WCYCL1850 --res ne30pg2_EC30to60E2r2" case cmake file and log file, I think it's kokkos error, do you know how to fix this?
e3sm.bldlog.230406-143814.txt
pgigpu_hpc.cmake.txt

I'm new at GPU running and thanks for your guidance!

0 replies

rljacob · 2023-04-06T15:12:26Z

rljacob
Apr 6, 2023
Maintainer

In v2.1, only mpas-o/mpas-si have GPU capability. EAM does use kokkos in part of the code but its in "cpu mode".

0 replies

lulu1599 · 2023-04-07T08:32:06Z

lulu1599
Apr 7, 2023
Author

OK，THANKS for your reply.

0 replies

lulu1599 · 2023-04-11T01:02:16Z

lulu1599
Apr 11, 2023
Author

Here's another question, when I use mpas.part.* file in this path, I found the min num part file is 8. I know this is related to the num of ocn process. Thus, how can I run with ocn process smaller than 8, like 1, 2, or 4?

0 replies

philipwjones · 2023-04-11T15:23:15Z

philipwjones
Apr 11, 2023
Maintainer

To create a new partition, use the gpmetis command line tool from metis (you'll need metis on your machine - many of our supported machines have it available as a module). If you have the module loaded or metis installed, just use
gpmetis graph_file num_procs
where graph_file is the main graph file in the path you noted (the file with same name but without the .part.x suffix).

That's for the ocean only. The sea-ice does some additional load balancing and the process is a bit more involved.

0 replies

lulu1599 · 2023-04-13T02:32:19Z

lulu1599
Apr 13, 2023
Author

That’s really helpful!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5623

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

ISSUES about pgigpu+openmpi with WCYCL1850 and CMPASO-NYF #5623

lulu1599 Mar 28, 2023

Replies: 11 comments

philipwjones Mar 28, 2023 Maintainer

lulu1599 Mar 29, 2023 Author

lulu1599 Mar 29, 2023 Author

philipwjones Mar 29, 2023 Maintainer

lulu1599 Mar 30, 2023 Author

lulu1599 Apr 6, 2023 Author

rljacob Apr 6, 2023 Maintainer

lulu1599 Apr 7, 2023 Author

lulu1599 Apr 11, 2023 Author

philipwjones Apr 11, 2023 Maintainer

lulu1599 Apr 13, 2023 Author

lulu1599
Mar 28, 2023

philipwjones
Mar 28, 2023
Maintainer

lulu1599
Mar 29, 2023
Author

lulu1599
Mar 29, 2023
Author

philipwjones
Mar 29, 2023
Maintainer

lulu1599
Mar 30, 2023
Author

lulu1599
Apr 6, 2023
Author

rljacob
Apr 6, 2023
Maintainer

lulu1599
Apr 7, 2023
Author

lulu1599
Apr 11, 2023
Author

philipwjones
Apr 11, 2023
Maintainer

lulu1599
Apr 13, 2023
Author