Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong information for some particles after CabM particle structures rebuild operation #119

Open
zhangchonglin opened this issue Mar 22, 2024 · 7 comments
Labels
bug Something isn't working CabM

Comments

@zhangchonglin
Copy link
Contributor

zhangchonglin commented Mar 22, 2024

While developing on a new particle code, Comet, I encountered the following issue:

  • If using CabM particle structure, when the number of particles per GPU is >~50 million, information for some particles are incorrect after particle structure rebuild.
  • When simply switching to SCS particle structure, everything works fine.
  • This is likely related to the earlier XGCm issue when 64 million particles per GPU was used in that case: https://github.com/SCOREC/xgcm/issues/163.
Steps to reproduce the issue:
  • In each simulation time steps I generate some new particles and then perform a rebuild operation similar as here:
    auto pID = structure->get<0>();
    kkLidView new_element("new_element", structure->capacity());
    auto setElement = PS_LAMBDA(const lid_t& e, const lid_t& p, const bool& mask) {
    if (mask)
    new_element(p) = (e*3 + p + 2) % ne; //assign to diff elems
    else
    new_element(p) = -1;
    pID(p) = p;
    };
    ps::parallel_for(structure, setElement, "setElement");
    kkLidView new_particle_elements("new_particle_elements", nnp);
    auto new_particles = particle_structs::createMemberViews<Types>(nnp);
    auto new_particle_access = particle_structs::getMemberView<Types,0>(new_particles);
    auto val1 = particle_structs::getMemberView<Types,1>(new_particles);
    auto val2 = particle_structs::getMemberView<Types,2>(new_particles);
    lid_t cap = structure->capacity();
    //Assign new ptcl elements and identifiers
    Kokkos::parallel_for("new_particle_elements", nnp,
    KOKKOS_LAMBDA (const int& i){
    new_particle_elements(i) = i%ne;
    new_particle_access(i) = i+cap;
    val1(i, 0) = val1(i, 1) = val1(i, 2) = val2(i) = 0;
    });
    //Rebuild with new ptcls
    structure->rebuild(new_element, new_particle_elements, new_particles);
  • I ensure all the particles information are within a specified (x, y) range (within the simulation domain), for example: 0<=x<=1, 0<=y<=1.
  • But after particle rebuild, many of the particles' position are well outside the simulation domain.
Similarly, if no new particles are being added:
  • during each time step, update particles' position through push;
  • then perform search to find particles' new position/element;
  • after that perform a particle structure rebuild;
  • If checking particle information after the rebuild, many of the particles have wrong coordinates information: not in the desired element at all. (I assume it will also be true for other particle properties, but coordinates are easier to verify).
  • The reason that after rebuild every particle should be in its correct element is that particle search will mark particles (which are no longer in the simulation domain) for deletion.
  • But apparently, this is not the case, so the most likely reason is some bug in the particle rebuild operation for CabM particle structure.
@zhangchonglin zhangchonglin added bug Something isn't working CabM labels Mar 22, 2024
@zhangchonglin
Copy link
Contributor Author

With CabM particle structure and only using 1 MPI rank (no migration needed), when the number of particles reaches a value ~>30 million, the following line is even returning negative number of particles, so definitely some bugs in CabM as num_ptcls value is already corrupted.

lid_t nPtcls() const {return num_ptcls;}

@dhyan1272
Copy link
Contributor

Yes, I see similar issues, the information associated with the particles gets changed and particles are no longer in the element.

@diamog
Copy link

diamog commented Oct 15, 2024

I tried to reproduce this on my local machine, but was unable to get a failure. Can either of you send a simplified example that produces the erroneous particle data after rebuild, so I can better investigate the issue?

@zhangchonglin
Copy link
Contributor Author

Hi Gerrett @diamog: thank you for looking into this.

  • This only happens when the number of particles per GPU is greater than ~40-50 millions, it may be challenging to debug.
  • I will see if I can reproduce this with a case having a very coarse mesh but large number of particles. After that I will send that to you.

@zhangchonglin
Copy link
Contributor Author

@diamog: I made a smaller XGCm test case with 1000 mesh elements that can reproduce the error on Summit:

  • But this case still uses 4 MPI ranks and 4 GPUs.
  • To reproduce the error, it requires more than 50 million particles per GPU.
  • With 50 million particles, I did not observe the issue.
  • With 55 million particles per GPU, I observe similar issue as in https://github.com/SCOREC/xgcm/issues/163.
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]
Particle 0 is outside toroidal section [4.712389 < 0.000000 < 6.283185]

Looking at this error closer, it corresponds to this code segment, https://github.com/SCOREC/xgcm/blob/1a0ad1e80a4cb099c3fd6b31ab10a1c835aebc4c/src/xgcm_particle.hpp#L329-L344:

    //Check to see if particles are all in correct places
    fp_t major_phi = mesh.getMajorPlaneAngle();
    fp_t minor_phi = mesh.getMinorPlaneAngle();
    auto coords = ptcls->template get<PTCL_COORDS>();
    auto pids = ptcls->template get<PTCL_IDS>();
    auto checkPtcls = PS_LAMBDA(const int& e, const int& p, const bool& m) {
      if (m) {
        auto pid = pids(p);
        if (!is_safe[e])
          printf("Particle %d is in an unsafe element\n", pid);
        if (coords(p,2) < minor_phi || coords(p,2) > major_phi)
          printf("Particle %d is outside toroidal section [%f < %f < %f]\n", pid,
                 minor_phi, coords(p,2), major_phi);
      }
    };
    ps::parallel_for(ptcls, checkPtcls, "check particles");

This suggests that pids array does not contain correct information any more as all particles' pid are 0, in addition to ptcls->template get<PTCL_COORDS>() does not contain correct information.

Please let me know if this case is still be too large for your test.

@diamog
Copy link

diamog commented Oct 31, 2024

Small update, I was able to reproduce a similar failure in pumi-pic's pseudoXGCm. I had to push particle counts to >70 million, but I am seeing particle data getting corrupted. In my case, the mask is getting corrupted so particles "disappear" mid simulation.

I don't have access to Summit, nor xgcm anymore so any reproduction will need to be within pumi-pic. I can work with my current failure and see what I can figure out.

@zhangchonglin
Copy link
Contributor Author

@diamog: Hi Gerrett, thank you for letting us know. I think this is the same issue as I was encountering: particle data is getting corrupted. I think working with pseudoXGCm should be completely fine as you are also seeing this issue when number of particles is very large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CabM
Projects
None yet
Development

No branches or pull requests

3 participants