Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance scaling #94

Open
MartinDix opened this issue Sep 11, 2024 · 14 comments
Open

Performance scaling #94

MartinDix opened this issue Sep 11, 2024 · 14 comments

Comments

@MartinDix
Copy link

From Spencer's 20240827-release-preindustrial+concentrations-run with 192 atmosphere, 180 ocean, 16 ice PEs

Walltime from UM atm.fort6.pe0 files

esm_time

Difference between the PBS time and the UM time is around 60 s. Variation here will make scaling analysis trickier.

@MartinDix
Copy link
Author

MartinDix commented Sep 11, 2024

Similarly for the 20240823-dev-historical-concentrations-run
esm_time

This seems somehow system related because both got faster at about the same time ~ Aug 31.

@MartinDix
Copy link
Author

From the internal UM timer diagnostics, e.g.

 Maximum Elapsed Wallclock Time:    4347.89442100020
 Speedup:    191.985828730169
 --------------------------------------------
                              Non-Inclusive Timer Summary for PE    0
   ROUTINE              CALLS  TOT CPU    AVERAGE   TOT WALL  AVERAGE  % CPU    % WALL    SPEED-UP
  1 ATM_STEP             ****    836.89      0.05    836.93      0.05   19.25   19.25      1.00
  2 Convect              ****    523.78      0.03    523.78      0.03   12.05   12.05      1.00
  3 SL_tracer2           ****    469.82      0.03    469.83      0.03   10.81   10.81      1.00
  4 SL_tracer1           ****    445.61      0.03    445.62      0.03   10.25   10.25      1.00
...

Calculated standard deviations of the routines. Sorted values are

Routine SD Total time
TOTAL 173.8 4452
SFEXCH 94.6 182
Atmos_Physics2 46.3 146
PUTO2A_COMM 15.1 31
ATM_STEP 12.9 829
U_MODEL 11.4 18
GETO2A_COMM 7.4 55

Notable that most of the standard deviation comes from just two relatively inexpensive routines. Excluding these removes most of the variation.

esm_time2

SFEXCH includes boundary layer calculations w/o any communications so it's a mystery why it should be so variable.

@blimlim
Copy link

blimlim commented Oct 18, 2024

The routines U_MODEL and PUTO2A_COMM also contribute to the variation, though less so:
Slow routines hist

In the pre-industrial simulation however, U_MODEL had a larger contribution to the total walltime valiation:
Slow routines PI

Other typically expensive routines show much less variation:
Stable routines hist

Slowdowns of the high variability routines is fairly consistent across all the processors, according to the MPP : None Inclusive timer summary WALLCLOCK TIMES table, where there's typically a ~4s difference between the min and max time for SFEXCH:
Historical run SFEXCH

U_MODEL is less consistent (~30s difference between max and min), though the max and min times follow similar variations:
Historical run U_MODEL

@blimlim
Copy link

blimlim commented Oct 18, 2024

Repeating a simulation of a single year, with runs broken down into month long segments, also has large variation across the different simulations and different months. Some individual months take almost double the time of others.
total walltime

The same four routines contribute significantly to the variations, especially U_MODEL. At this smaller scale, the four routine's times don't as clearly correlate with each other.
routines month

and in this case, there's still significant variation after subtracting these routine's times from the total walltime:
total walltime minus routines

@blimlim
Copy link

blimlim commented Oct 22, 2024

Occasional extreme cases can occur. The following results were for a month long simulation configured to write separate UM output each day, run on 2024-10-04. The exact same simulation was then repeated on 2024-10-08:

2024-10-04:

Walltime Used: 00:21:16    

2024-10-08:

Walltime Used: 00:07:14  

Differences in mean routine times across the PEs are

Routine Time difference (s)
SFEXCH 251.73
U_MODEL 198.00
Atmos_Physics2 160.68
ATM_STEP 83.77
PUTO2A_COMM 69.58
Q_Pos_Ctl 39.90
GETO2A_COMM 13.09
SW Rad 7.97
INCRTIME 3.01
Diags 2.27
INITIAL 2.27
PPCTL_REINIT 1.06
MICROPHYS_CTL 0.16
DUMPCTL 0.13
SF_IMPL 0.13
INITDUMP 0.07
PHY_DIAG 0.05
PE_Helmholtz 0.03
UP_ANCIL 0.02
KMKHZ 0.02

Plots of the same thing:

slow_routines

fast_routines

It's mostly the same routines from before causing the slowdown. However ATM_STEP and Q_Pos_CTL are both much slower in the 20204-10-04 run, while they were both pretty stable in the longer historical simulation.

@blimlim
Copy link

blimlim commented Oct 25, 2024

These variations make it hard to get strong conclusions from scaling tests. From 5 year simulations with various atmosphere decompositions, providing an extra node to the atmosphere, and using a decomposition: atm.240.X16.Y15 appears to decrease the run length by ~10% with very little impact on the SU usage:

5 Runs atmosphere scaling

and cuts down the significant ocean coupling wait time:
coupling wait times

However when extending the atm.240.X16.Y15-ocn.180-ice.12 run to 20 years, there was massive variations in the walltimes, with the worst years being slower than the slowest times for the default atm.192.X16.Y12-ocn.180-ice.12 decomposition.

Image

The same routines were again implicated in the slowdowns:
atm_240_routine_times

It's hard to tell whether adding more processors to the atmosphere somehow made the timing inherently more unstable, or if that run was unlucky and was impacted by unknown external factors. There are some simultaneous runs of each decomposition currently going, which will hopefully help us understand this better.

@blimlim
Copy link

blimlim commented Nov 6, 2024

Running amip simulations with 192 (standard) and 240 processors, the runs appeared a much more stable, with only one single jump occurring for the 192 pe simulation.
Image
The jump again corresponded to increased time in the Atmos_Physics2 and SFEXCH routines, however the other routines usually implicated seemed fairly stable.
Image

It's unclear whether the amip configuration is inherently more stable than the coupled configuration, or if system conditions at the time happened to make these runs more stable.


To test this, we tried running two amip and two coupled simulations at the same time (192 and 240 atm pes in each case, with 16 pes in the x direction and 15 in the y direction for the 240 pe case). Each configuration was run one year at a time, which was set off at the same time for each configuration. Due to a mistake, the first three years of the 240 pe configurations were out of sync, additionally one year's PBS logs of the 192 coupled run went missing somehow, but apart from that, the runs were in sync.

Very few instabilities occurred in any of the runs:

Image
Image

Only one year of the 240pe coupled run jumped up, however this was due to the atmosphere waiting for the ocean, and appears to have a different cause than the instabilities we'd seen so far:

Image

Based on these simulations alone, using 240 processors with the x:16, y:15 decomposition looks favourable for the default configuration. We get an ~10% drop in walltime for practically no difference in SUs. However it's again hard to make any concrete recommendations due to the instabilities that have appeared in our other tests.

@blimlim
Copy link

blimlim commented Dec 2, 2024

@aidanheerdegen suggested some extra sanity checks to test the whether writing diagnostics contributes to the wall time variation. The previous simulations were run using the "standard" presets, apart from the last comment, where I used the draft "spinup" preset to save on space.

I've done two additional runs – one with the "standard" preset, and one with no output – both of which were started at the same time. Both increased the atmosphere to 240 processors, as I'd like to confirm whether this configuration is generally faster.

The following shows the wall times for the two runs:
Image

Both were essentially stable, with only two single-year jumps in the diagnostic saving run.

Image

The routine behind the largest jump GETO2A_COMM differed from what we typically saw earlier, and so this particular jump likely had a different cause. Since we didn't see any coherent large swings in either run's wall times, it's hard to conclude what role the diagnostics play.

In any case, it looks like under "normal/good" circumstances, increasing the atmosphere to 240 processors does cut down the walltime without a cost in SUs. @MartinDix and @aidanheerdegen, would you be in favour of changing the decomposition in our released configurations?

@aidanheerdegen
Copy link
Member

aidanheerdegen commented Dec 2, 2024

In any case, it looks like under "normal/good" circumstances, increasing the atmosphere to 240 processors does cut down the walltime without a cost in SUs. @MartinDix and @aidanheerdegen, would you be in favour of changing the decomposition in our released configurations?

Seems like a no-brainer to me.

Some other points:

  1. Those runs with the horrendous runtime variations in the first 60 years seem like they were related to some gadi hardware issues. Arguably if we're that far out of expected runtime maybe we should be stopping user simulations and feeding back to NCI that there are issues?
  2. When comparing month by month the variation correlates with boreal and austral winters, so sea-ice computation kicks in. I see CICE is using 12 cpus with 180 for the ocean with ACCESS-OM2 uses 24 cpus with 216 for the ocean. Is it worth bumping the CICE cpus up a little to see if it helps smooth out some of the variation? Especially if the much larger ocean and atmosphere are waiting on CICE?
  3. Might be worth asking @manodeep's opinion
  4. Running without diagnostics shows two useful pieces of data: diagnostics add 10% to wall time, and they can add up to an additional 15% due to transient disk performance problems

@manodeep
Copy link

manodeep commented Dec 2, 2024

Thanks @aidanheerdegen. I took a quick look through the issue:

  • 10% extra time for collecting diagnostics info seems reasonable (though collecting only timers should be effectively free but presumably more diag. info is being collected).
  • skimmed through the source for SFEXCH and other than replacing calling functions within a for-loop with an equivalent function that absorbs the for loop within the body, nothing else immediately stood out as a potential culprit. NUMA issues could also cause high variability - are the tasks pinned to cores, or can the OS migrate threads?

@aidanheerdegen
Copy link
Member

  • 10% extra time for collecting diagnostics info seems reasonable

Sorry @manodeep, but confusingly in ESM world "diagnostics" often (and in this case) refer to output data fields. They are "diagnosed" (derived) from the physical (prognostic) fields, e.g. for ocean these prognostic fields are temperature, salinity and velocity (might have missed something). Everything else is derived from them, and there are a huge number of diagnostic variables, so they're never all saved, as some are for very specific use cases. Also as the resolution increases the amount of disk space these variables consume increases dramatically, so often only very specific diagnostic variables are saved.

@manodeep
Copy link

manodeep commented Dec 2, 2024

Hahahaha - I should have known better than to assume terminology in a entirely different field 😄! 10% for i/o seems reasonable - presumably the io is (somewhat) parallel?

@aidanheerdegen
Copy link
Member

Hahahaha - I should have known better than to assume terminology in a entirely different field 😄!

Well it gets even more confusing, because I know the MOM5 (ocean) model has diag_step as a user configurable setting for a number of the sub-components that controls .. wait for it .. model diagnostics, which is what you were referring to above: runtime analysis of model performance, conservation of terms etc.

It's almost intentionally confusing! :)

10% for i/o seems reasonable - presumably the io is (somewhat) parallel?

Different models differ. The ocean model gathers ranks and outputs a number of tiles from separate PEs in parallel, and then stitches them together in a post-processing step. CICE (ice model) is serial, and becomes a serious bottleneck for higher resolutions, so the OM2 model uses PIO to parallelise the IO. Not sure about the atmosphere. Again at higher core counts it uses an IO server (xios), but I'm not sure for this model.

The problem with IO isn't the overhead so much, it's the incredible variability when the lustre file system gets overloaded. It means we have to put in some wall time padding to allow for this, which affects the queue time, though I've not quantified how much difference it makes.

@manodeep
Copy link

manodeep commented Dec 3, 2024

Different models differ. The ocean model gathers ranks and outputs a number of tiles from separate PEs in parallel, and then stitches them together in a post-processing step. CICE (ice model) is serial, and becomes a serious bottleneck for higher resolutions, so the OM2 model uses PIO to parallelise the IO. Not sure about the atmosphere. Again at higher core counts it uses an IO server (xios), but I'm not sure for this model.

The problem with IO isn't the overhead so much, it's the incredible variability when the lustre file system gets overloaded. It means we have to put in some wall time padding to allow for this, which affects the queue time, though I've not quantified how much difference it makes.

I have certainly encountered issues with the metadata servers being overloaded by other jobs, and resulting in significant differences in runtime. One option could be to write to node-level filesystems (typically jobfs), and then run periodically lustre io. Another option could be to use asynchronous io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants