You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been working on the cabmBuild branch and noticed that we have some unexpected behavior while testing CSR. A new version of the testing code, ps_combo.cpp, was made to test larger amounts of data per particle, ps_combo32.cpp (which uses a size 32 array of doubles for each particle instead of the original size 3 array). This is linked here.
During comparative testing for CabM on AiMOS, it was found that CSR ceases due to an out of memory error at 50,000 elements and 50,000,000 particles. The error message is included below:
Test Command:
./ps_combo32 50000 50000000 1 -p 50 -n 1
Generating particle distribution with strategy: Uniform
Building CSR
Performing 100 iterations of rebuild on each structure
Beginning push on structure CSR
Beginning rebuild on structure CSR
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorMemoryAllocation): out of memory /gpfs/u/barn/MPFS/MPFSmttw/pumipic_CabM/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
Traceback functionality not available
[dcs044:159743] *** Process received signal ***
[dcs044:159743] Signal: Aborted (6)
[dcs044:159743] Signal code: (-6)
[dcs044:159743] [ 0] [0x7fff8ad704d8]
[dcs044:159743] [ 1] /usr/lib64/libc.so.6(abort+0x2b4)[0x7fff89412094]
[dcs044:159743] [ 2] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x7fff897a0644]
[dcs044:159743] [ 3] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(+0xab364)[0x7fff8979b364]
[dcs044:159743] [ 4] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x7fff8979b420]
[dcs044:159743] [ 5] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(__cxa_throw+0x80)[0x7fff8979b8e0]
[dcs044:159743] [ 6] ./ps_combo32(_ZN6Kokkos4Impl23throw_runtime_exceptionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xc4)[0x101aedc0]
[dcs044:159743] [ 7] ./ps_combo32(_ZN6Kokkos4Impl25cuda_internal_error_throwE9cudaErrorPKcS3_i+0x170)[0x101b0f40]
[dcs044:159743] [ 8] ./ps_combo32(_ZN6Kokkos4Impl23cuda_internal_safe_callE9cudaErrorPKcS3_i+0x60)[0x101b4128]
[dcs044:159743] [ 9] ./ps_combo32(_ZNK6Kokkos9CudaSpace8allocateEm+0x60)[0x101b6478]
[dcs044:159743] [10] ./ps_combo32(_ZN6Kokkos4Impl22SharedAllocationRecordINS_9CudaSpaceEvEC2ERKS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmPFvPNS1_IvvEEE+0x4c)[0x101b78a8]
[dcs044:159743] [11] ./ps_combo32(_ZN6Kokkos4ViewIPA32_dJNS_10LayoutLeftENS_6DeviceINS_4CudaENS_9CudaSpaceEEEEEC2IJNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEERKNS_4Impl12ViewCtorPropIJDpT_EEERKNSt9enable_ifIXntsrSK_11has_pointerES3_E4typeE+0x10c)[0x1016632c]
[dcs044:159743] [12] ./ps_combo32(_ZN7pumipic3CSRINS_11MemberTypesIJiA32_ddEEEN6Kokkos9CudaSpaceEE7rebuildENS4_4ViewIPiJNS4_6DeviceINS4_4CudaES5_EEEEESC_PPv+0x308)[0x1017f628]
[dcs044:159743] [13] ./ps_combo32(main+0x1800)[0x100a8e60]
[dcs044:159743] [14] /usr/lib64/libc.so.6(+0x25200)[0x7fff893f5200]
[dcs044:159743] [15] /usr/lib64/libc.so.6(__libc_start_main+0xc4)[0x7fff893f53f4]
[dcs044:159743] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node dcs044 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
However, both the SCS and CabM particle structures do not fail until our next set of tests at 75,000 elements and 75,000,000 particles. We investigated and attempted to run ps_combo32 again with the number of iterations at line 89 (originally 100) reduced to 1. In this case, all three particle structures failed due to an out of memory error at 75,000 elements and 75,000,000 particles. This leads me to suspect that there is some sort of large-scale memory error in CSR or possibly the testing code. (See Below Edit)
For reference, the set of tests we were running are in the file, test_largeE_largeP.sh, located here (using the second commented-out call to ps_combo for use on AiMOS).
EDIT: Upon further inspection, this does not seem to be a memory leak. However, it is the case that CSR is using much more memory than expected. I've checked, and it seems that particles_on_process is being calculated correctly, here. I ran some performance tests on CSR using the Kokkos memory-usage tools, here with the test mpirun -np 1 ./ps_combo160 1000 1000000 1 -n 1 on a 6-GPU node on AiMOS. I found that, at their maximums, CabM uses 331.2 MB and CSR uses 470.8 MB. This is unexpected behavior because CabM should be allocating more memory through the use of padding. I think I've tracked it down to the particle_info temporary MTVs in CSR::rebuild, here, but I'm not sure how it could be allocating this much extra space.
The text was updated successfully, but these errors were encountered:
UPDATE: The issue was found. Because CSR uses an MTVs to store its particle data and continually makes and destroys them, theseget calls were leaving a few smart pointers to the original set of data. Thus, when rebuilding, CSR was using 3x the memory of ptcl_data instead of just 2x. Currently, this has been fixed by enclosing these get calls in a for loop, thus causing these smart pointers to go out-of-scope before the call to migrate/rebuild.
A general fix has been proposed and is currently underway whereby a second copy of ptcl_data would be stored at all times for swapping purposes (like SCS) for both CSR and CabanaM.
Once CSR has its swapping implementation done, we could probably close this issue, although the issue is still technically there for cases in which CSR increases in size so that it triggers a full rebuild.
I've been working on the
cabmBuild
branch and noticed that we have some unexpected behavior while testingCSR
. A new version of the testing code,ps_combo.cpp
, was made to test larger amounts of data per particle,ps_combo32.cpp
(which uses a size 32 array of doubles for each particle instead of the original size 3 array). This is linked here.During comparative testing for
CabM
on AiMOS, it was found thatCSR
ceases due to anout of memory
error at 50,000 elements and 50,000,000 particles. The error message is included below:However, both the
SCS
andCabM
particle structures do not fail until our next set of tests at 75,000 elements and 75,000,000 particles. We investigated and attempted to runps_combo32
again with the number of iterations at line 89 (originally 100) reduced to 1.In this case, all three particle structures failed due to an(See Below Edit)out of memory
error at 75,000 elements and 75,000,000 particles. This leads me to suspect that there is some sort of large-scale memory error inCSR
or possibly the testing code.For reference, the set of tests we were running are in the file,
test_largeE_largeP.sh
, located here (using the second commented-out call tops_combo
for use on AiMOS).EDIT: Upon further inspection, this does not seem to be a memory leak. However, it is the case that
CSR
is using much more memory than expected. I've checked, and it seems thatparticles_on_process
is being calculated correctly, here. I ran some performance tests onCSR
using the Kokkos memory-usage tools, here with the testmpirun -np 1 ./ps_combo160 1000 1000000 1 -n 1
on a 6-GPU node on AiMOS. I found that, at their maximums,CabM
uses 331.2 MB andCSR
uses 470.8 MB. This is unexpected behavior becauseCabM
should be allocating more memory through the use of padding. I think I've tracked it down to theparticle_info
temporaryMTVs
inCSR::rebuild
, here, but I'm not sure how it could be allocating this much extra space.The text was updated successfully, but these errors were encountered: