Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EAMxx: Adds aerosols heterogeneous freezing calculations in P3 microphysics #6947

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

singhbalwinder
Copy link
Contributor

@singhbalwinder singhbalwinder commented Jan 25, 2025

The heterogeneous freezing calculations from prognostics aerosols are
added to P3 microphysics. Setting use_hetfrz_classnuc to true
will turn on these calculations. Otherwise, P3 will use the default
prescribed aerosol calculations.

[BFB] for EAM and EAMxx

Copy link

github-actions bot commented Jan 25, 2025

PR Preview Action v1.6.0

🚀 View preview at
https://E3SM-Project.github.io/E3SM/pr-preview/pr-6947/

Built to branch gh-pages at 2025-02-06 00:34 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@singhbalwinder
Copy link
Contributor Author

TODO:

  1. Turn off this feature for default EAMxx
  2. Revive commented-out P3 tests after adding missing arguments in various function signatures.

@singhbalwinder singhbalwinder changed the title Adds aerosols heterogeneous freezing calculations in P3 microphysics EAMxx: Adds aerosols heterogeneous freezing calculations in P3 microphysics Jan 28, 2025
@mahf708
Copy link
Contributor

mahf708 commented Jan 29, 2025

Qucik comments:

  1. Please turn off the feature by default
  2. Please follow for the do_ice_production procedure (as an example) for passing the flag inside
  3. Please hide most (if not all) additions inside if-else guards (with the new flag), for example, the add_required calls and such
  4. Please keep tests intact if you intend to integrate this
  5. Also, ensure you don't break PAM/MMF2 (I am 100% almost certain you're currently breaking it)

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments. Mostly: why are lots of unit tests now commented?

components/eamxx/cime_config/namelist_defaults_scream.xml Outdated Show resolved Hide resolved
const auto mask = qc_incld > qsmall;
switch (Iflag) {
case 1: // cloud droplet immersion freezing
ncheti_cnt.set(mask, frzimm*1.0e6/rho /* frzimm input is in [#/cm3] */ , Zero);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all these "set" calls, how often do you expect the mask to be true/false? If the mask could often be ALL false (not sometimes, often), then you may consider using if statements, to avoid computing the packs for the true case for nothing (e.g., in the 1st line we have to compute frzimm*1e6/rho regardless of whether we need it or not).

Note: this nano-opt makes sense only if you expect mask to be often false. I assume that's not the case, since qsmall is very small. But I don't know how qc_incld is computed, so maybe it's often 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question. I am not sure about that. @kaizhangpnl or @AaronDonahue might know if mask can often be false or not.

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments. Mostly: why are lots of unit tests now commented?

@mahf708 mahf708 requested a review from brhillman February 2, 2025 18:16
@mahf708
Copy link
Contributor

mahf708 commented Feb 2, 2025

Requesting reviews from @hassanbeydoun and @brhillman because I know they're very curious about and interested in this part of the p3 code

@singhbalwinder singhbalwinder force-pushed the jroverf/singhbalwinder/eamxx/add-het-frz-p3_rebase1_1 branch from 487f9b6 to 9be599e Compare February 3, 2025 19:25
@singhbalwinder singhbalwinder marked this pull request as ready for review February 5, 2025 20:57
@singhbalwinder
Copy link
Contributor Author

@mahf708 and @bartgol : I have now addressed all the review comments. Please let me know if there is anything still missing. The P3 tests passed on Compy. I will try running the tests on PM-GPUs.

Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. If the tests pass, I support merging.

For the record, I will note that Balwinder, Luca, and I all agree we likely need to restructure the P3 code at some point in the future. This is outside the scope of the current PR, and we will think about finding the time to do it at a later point.

@mahf708
Copy link
Contributor

mahf708 commented Feb 5, 2025

One of the public CI tests failed with (which I think is related to my comment here #6947 (comment))

 FAIL:
!m_add_time_dim
/__w/E3SM/E3SM/components/eamxx/src/share/io/scorpio_output.cpp:477
Error! Time-dependent output field 'hetfrz_contact_nucleation_tend' has not been initialized yet
.

 FAIL:
!m_add_time_dim
/__w/E3SM/E3SM/components/eamxx/src/share/io/scorpio_output.cpp:477
Error! Time-dependent output field 'hetfrz_contact_nucleation_tend' has not been initialized yet
.

 FAIL:
!m_add_time_dim
/__w/E3SM/E3SM/components/eamxx/src/share/io/scorpio_output.cpp:477
Error! Time-dependent output field 'hetfrz_contact_nucleation_tend' has not been initialized yet
.

 FAIL:
!m_add_time_dim
/__w/E3SM/E3SM/components/eamxx/src/share/io/scorpio_output.cpp:477
Error! Time-dependent output field 'hetfrz_contact_nucleation_tend' has not been initialized yet
.

to reproduce locally, this is the test:

ERS_Ld5_P4.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.<MACHINE>_<COMPILER>.eamxx-prod

@singhbalwinder
Copy link
Contributor Author

Thanks, Naser! With your help, I have fixed this test.

@odiazib odiazib self-requested a review February 6, 2025 00:40
Copy link
Contributor

@mahf708 mahf708 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhbalwinder, looks good to me from the standpoint of actual p3 runtime and eamxx runtime, so I am approving.

Note there are likely two sticky problems that someone (not me, because I already gave up, see link below) has to contend with one way or another:

  1. Your PR is making the p3 unit tests fail (these are p3_tests and p3_sk_tests, found under components/eamxx/p3/tests). The change seems to be in the comparison only, so something you did is changing those. That aligns with my prior experience, but I opted to close the PR rather than figuring it out. See link below.
  2. A slightly less sticky problem is resolving the MM2 test (it doesn't fail on the ci here because it didn't run at all) but if you test it locally, it will likely fail to build, try SMS_Ln3_P4.ne4pg2_oQU480.F2010-MMF2 on some machine with this PR and it will almost certainly fail to build). I can probably help you fix this if you want; I fixed these fails multiple times in the past.

I ran into this precise situation a few weeks ago and I decided not to bother with it. With even much simpler code edits. You can see the discussion here: #6938

@bartgol
Copy link
Contributor

bartgol commented Feb 6, 2025

  1. Your PR is making the p3 unit tests fail (these are p3_tests and p3_sk_tests, found under components/eamxx/p3/tests). The change seems to be in the comparison only, so something you did is changing those. That aligns with my prior experience, but I opted to close the PR rather than figuring it out. See link below.

@singhbalwinder I noticed that only p3_tests/p3_sk_tests fail, while all of the XYZ_baseline_cmp tests (where XYZ includes p3) pass. Is it b/c you hard code the new fields to the constant value they have in master?

Also, it's interesting that the tests pass in the FPE build. The main diff between FPE and DBG is that the former uses a pack size of 1. That said, also CUDA builds use pack size of 1, and yet they fail. I would love it if you digged a bit, and see if there's an explanation for why all standalone tests fail but the FPE build passes. If fails and pass are expected, then great. If not, I'd hold off the merge.

@mahf708
Copy link
Contributor

mahf708 commented Feb 6, 2025

Update the MMF2 test fails with this annoying error:

e3sm.exe: /home/runner/_work/E3SM/E3SM/externals/ekat/src/ekat/kokkos/ekat_subview_utils.hpp:32: ekat::Unmanaged<Kokkos::View<ST*, Kokkos::LayoutRight, Props ...> > ekat::subview(ViewLR<ST**, Props ...>&, int) [with ST = const Pack<double, 1>; Props = {Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0>}; Unmanaged<Kokkos::View<ST*, Kokkos::LayoutRight, Props ...> > = Kokkos::View<const Pack<double, 1>*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<1> >; ViewLR<ST**, Props ...> = Kokkos::View<const Pack<double, 1>**, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> >]: Assertion `v.data() != nullptr' failed.

Program received signal SIGABRT: Process abort signal.

This type of error is almost certainly to do with missing views in the diagnostic_inputs struct based on my prior experience, but I could be misremembering things. Could be fixed by populating these in the PAM interface.

@bartgol
Copy link
Contributor

bartgol commented Feb 6, 2025

Update the MMF2 test fails with this annoying error:

e3sm.exe: /home/runner/_work/E3SM/E3SM/externals/ekat/src/ekat/kokkos/ekat_subview_utils.hpp:32: ekat::Unmanaged<Kokkos::View<ST*, Kokkos::LayoutRight, Props ...> > ekat::subview(ViewLR<ST**, Props ...>&, int) [with ST = const Pack<double, 1>; Props = {Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0>}; Unmanaged<Kokkos::View<ST*, Kokkos::LayoutRight, Props ...> > = Kokkos::View<const Pack<double, 1>*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<1> >; ViewLR<ST**, Props ...> = Kokkos::View<const Pack<double, 1>**, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> >]: Assertion `v.data() != nullptr' failed.

Program received signal SIGABRT: Process abort signal.

This type of error is almost certainly to do with missing views in the diagnostic_inputs struct based on my prior experience, but I could be misremembering things. Could be fixed by populating these in the PAM interface.

It could be a bad index of a subview. E.g., try to subview (ncols,nlevs) at first index ncols ...

@singhbalwinder
Copy link
Contributor Author

Both p3_tests/p3_sk_tests passed on my end on Compy (standalone build) and pm-gpu (using test-all-scream). Yesterday I tried with different pack sizes and omp threads on Compy and they all passed. I do not expect any non-MAM4xx tests to fail as the new flag is set to false by default.

In p3 tests, I am using the engine to generate input for the new flag. I thought it was okay to use engine as the asserts will compare the outputs consistently (e.g. output when flag is true on device vs. output when flag is true on host and vice versa). Should I hardwire it to false always?

Once I reproduce it locally, I should be able to debug it. I am currently looking at ways to reproduce it.

@mahf708
Copy link
Contributor

mahf708 commented Feb 6, 2025

Both p3_tests/p3_sk_tests passed on my end on Compy (standalone build) and pm-gpu (using test-all-scream). Yesterday I tried with different pack sizes and omp threads on Compy and they all passed. I do not expect any non-MAM4xx tests to fail as the new flag is set to false by default.

In p3 tests, I am using the engine to generate input for the new flag. I thought it was okay to use engine as the asserts will compare the outputs consistently (e.g. output when flag is true on device vs. output when flag is true on host and vice versa). Should I hardwire it to false always?

Once I reproduce it locally, I should be able to debug it. I am currently looking at ways to reproduce it.

The fails are to do with comparison, so you likely need to generate the baselines before this PR and then run the tests with compare enabled. Check out the test-all-scream options

@rljacob rljacob added the EAMxx PRs focused on capabilities for EAMxx label Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EAMxx PRs focused on capabilities for EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants