Performance scaling #94

MartinDix · 2024-09-11T02:27:52Z

From Spencer's 20240827-release-preindustrial+concentrations-run with 192 atmosphere, 180 ocean, 16 ice PEs

Walltime from UM atm.fort6.pe0 files

Difference between the PBS time and the UM time is around 60 s. Variation here will make scaling analysis trickier.

The text was updated successfully, but these errors were encountered:

MartinDix · 2024-09-11T02:35:59Z

Similarly for the 20240823-dev-historical-concentrations-run

This seems somehow system related because both got faster at about the same time ~ Aug 31.

MartinDix · 2024-10-17T05:41:39Z

From the internal UM timer diagnostics, e.g.

 Maximum Elapsed Wallclock Time:    4347.89442100020
 Speedup:    191.985828730169
 --------------------------------------------
                              Non-Inclusive Timer Summary for PE    0
   ROUTINE              CALLS  TOT CPU    AVERAGE   TOT WALL  AVERAGE  % CPU    % WALL    SPEED-UP
  1 ATM_STEP             ****    836.89      0.05    836.93      0.05   19.25   19.25      1.00
  2 Convect              ****    523.78      0.03    523.78      0.03   12.05   12.05      1.00
  3 SL_tracer2           ****    469.82      0.03    469.83      0.03   10.81   10.81      1.00
  4 SL_tracer1           ****    445.61      0.03    445.62      0.03   10.25   10.25      1.00
...

Calculated standard deviations of the routines. Sorted values are

Routine	SD	Total time
TOTAL	173.8	4452
SFEXCH	94.6	182
Atmos_Physics2	46.3	146
PUTO2A_COMM	15.1	31
ATM_STEP	12.9	829
U_MODEL	11.4	18
GETO2A_COMM	7.4	55

Notable that most of the standard deviation comes from just two relatively inexpensive routines. Excluding these removes most of the variation.

SFEXCH includes boundary layer calculations w/o any communications so it's a mystery why it should be so variable.

blimlim · 2024-10-18T05:23:31Z

The routines U_MODEL and PUTO2A_COMM also contribute to the variation, though less so:

In the pre-industrial simulation however, U_MODEL had a larger contribution to the total walltime valiation:

Other typically expensive routines show much less variation:

Slowdowns of the high variability routines is fairly consistent across all the processors, according to the MPP : None Inclusive timer summary WALLCLOCK TIMES table, where there's typically a ~4s difference between the min and max time for SFEXCH:

U_MODEL is less consistent (~30s difference between max and min), though the max and min times follow similar variations:

blimlim · 2024-10-18T06:36:32Z

Repeating a simulation of a single year, with runs broken down into month long segments, also has large variation across the different simulations and different months. Some individual months take almost double the time of others.

The same four routines contribute significantly to the variations, especially U_MODEL. At this smaller scale, the four routine's times don't as clearly correlate with each other.

and in this case, there's still significant variation after subtracting these routine's times from the total walltime:

blimlim · 2024-10-22T23:11:06Z

Occasional extreme cases can occur. The following results were for a month long simulation configured to write separate UM output each day, run on 2024-10-04. The exact same simulation was then repeated on 2024-10-08:

2024-10-04:

Walltime Used: 00:21:16

2024-10-08:

Walltime Used: 00:07:14

Differences in mean routine times across the PEs are

Routine	Time difference (s)
SFEXCH	251.73
U_MODEL	198.00
Atmos_Physics2	160.68
ATM_STEP	83.77
PUTO2A_COMM	69.58
Q_Pos_Ctl	39.90
GETO2A_COMM	13.09
SW Rad	7.97
INCRTIME	3.01
Diags	2.27
INITIAL	2.27
PPCTL_REINIT	1.06
MICROPHYS_CTL	0.16
DUMPCTL	0.13
SF_IMPL	0.13
INITDUMP	0.07
PHY_DIAG	0.05
PE_Helmholtz	0.03
UP_ANCIL	0.02
KMKHZ	0.02

Plots of the same thing:

It's mostly the same routines from before causing the slowdown. However ATM_STEP and Q_Pos_CTL are both much slower in the 20204-10-04 run, while they were both pretty stable in the longer historical simulation.

blimlim · 2024-10-25T03:38:00Z

These variations make it hard to get strong conclusions from scaling tests. From 5 year simulations with various atmosphere decompositions, providing an extra node to the atmosphere, and using a decomposition: atm.240.X16.Y15 appears to decrease the run length by ~10% with very little impact on the SU usage:

and cuts down the significant ocean coupling wait time:

However when extending the atm.240.X16.Y15-ocn.180-ice.12 run to 20 years, there was massive variations in the walltimes, with the worst years being slower than the slowest times for the default atm.192.X16.Y12-ocn.180-ice.12 decomposition.

The same routines were again implicated in the slowdowns:

It's hard to tell whether adding more processors to the atmosphere somehow made the timing inherently more unstable, or if that run was unlucky and was impacted by unknown external factors. There are some simultaneous runs of each decomposition currently going, which will hopefully help us understand this better.

blimlim · 2024-11-06T00:11:49Z

Running amip simulations with 192 (standard) and 240 processors, the runs appeared a much more stable, with only one single jump occurring for the 192 pe simulation.

The jump again corresponded to increased time in the Atmos_Physics2 and SFEXCH routines, however the other routines usually implicated seemed fairly stable.

It's unclear whether the amip configuration is inherently more stable than the coupled configuration, or if system conditions at the time happened to make these runs more stable.

To test this, we tried running two amip and two coupled simulations at the same time (192 and 240 atm pes in each case, with 16 pes in the x direction and 15 in the y direction for the 240 pe case). Each configuration was run one year at a time, which was set off at the same time for each configuration. Due to a mistake, the first three years of the 240 pe configurations were out of sync, additionally one year's PBS logs of the 192 coupled run went missing somehow, but apart from that, the runs were in sync.

Very few instabilities occurred in any of the runs:

Only one year of the 240pe coupled run jumped up, however this was due to the atmosphere waiting for the ocean, and appears to have a different cause than the instabilities we'd seen so far:

Based on these simulations alone, using 240 processors with the x:16, y:15 decomposition looks favourable for the default configuration. We get an ~10% drop in walltime for practically no difference in SUs. However it's again hard to make any concrete recommendations due to the instabilities that have appeared in our other tests.

blimlim · 2024-12-02T00:42:57Z

@aidanheerdegen suggested some extra sanity checks to test the whether writing diagnostics contributes to the wall time variation. The previous simulations were run using the "standard" presets, apart from the last comment, where I used the draft "spinup" preset to save on space.

I've done two additional runs – one with the "standard" preset, and one with no output – both of which were started at the same time. Both increased the atmosphere to 240 processors, as I'd like to confirm whether this configuration is generally faster.

The following shows the wall times for the two runs:

Both were essentially stable, with only two single-year jumps in the diagnostic saving run.

The routine behind the largest jump GETO2A_COMM differed from what we typically saw earlier, and so this particular jump likely had a different cause. Since we didn't see any coherent large swings in either run's wall times, it's hard to conclude what role the diagnostics play.

In any case, it looks like under "normal/good" circumstances, increasing the atmosphere to 240 processors does cut down the walltime without a cost in SUs. @MartinDix and @aidanheerdegen, would you be in favour of changing the decomposition in our released configurations?

aidanheerdegen · 2024-12-02T01:31:53Z

In any case, it looks like under "normal/good" circumstances, increasing the atmosphere to 240 processors does cut down the walltime without a cost in SUs. @MartinDix and @aidanheerdegen, would you be in favour of changing the decomposition in our released configurations?

Seems like a no-brainer to me.

Some other points:

Those runs with the horrendous runtime variations in the first 60 years seem like they were related to some gadi hardware issues. Arguably if we're that far out of expected runtime maybe we should be stopping user simulations and feeding back to NCI that there are issues?
When comparing month by month the variation correlates with boreal and austral winters, so sea-ice computation kicks in. I see CICE is using 12 cpus with 180 for the ocean with ACCESS-OM2 uses 24 cpus with 216 for the ocean. Is it worth bumping the CICE cpus up a little to see if it helps smooth out some of the variation? Especially if the much larger ocean and atmosphere are waiting on CICE?
Might be worth asking @manodeep's opinion
Running without diagnostics shows two useful pieces of data: diagnostics add 10% to wall time, and they can add up to an additional 15% due to transient disk performance problems

manodeep · 2024-12-02T02:20:26Z

Thanks @aidanheerdegen. I took a quick look through the issue:

10% extra time for collecting diagnostics info seems reasonable (though collecting only timers should be effectively free but presumably more diag. info is being collected).
skimmed through the source for SFEXCH and other than replacing calling functions within a for-loop with an equivalent function that absorbs the for loop within the body, nothing else immediately stood out as a potential culprit. NUMA issues could also cause high variability - are the tasks pinned to cores, or can the OS migrate threads?

aidanheerdegen · 2024-12-02T03:31:55Z

10% extra time for collecting diagnostics info seems reasonable

Sorry @manodeep, but confusingly in ESM world "diagnostics" often (and in this case) refer to output data fields. They are "diagnosed" (derived) from the physical (prognostic) fields, e.g. for ocean these prognostic fields are temperature, salinity and velocity (might have missed something). Everything else is derived from them, and there are a huge number of diagnostic variables, so they're never all saved, as some are for very specific use cases. Also as the resolution increases the amount of disk space these variables consume increases dramatically, so often only very specific diagnostic variables are saved.

manodeep · 2024-12-02T11:12:15Z

Hahahaha - I should have known better than to assume terminology in a entirely different field 😄! 10% for i/o seems reasonable - presumably the io is (somewhat) parallel?

aidanheerdegen · 2024-12-02T23:37:05Z

Hahahaha - I should have known better than to assume terminology in a entirely different field 😄!

Well it gets even more confusing, because I know the MOM5 (ocean) model has diag_step as a user configurable setting for a number of the sub-components that controls .. wait for it .. model diagnostics, which is what you were referring to above: runtime analysis of model performance, conservation of terms etc.

It's almost intentionally confusing! :)

10% for i/o seems reasonable - presumably the io is (somewhat) parallel?

Different models differ. The ocean model gathers ranks and outputs a number of tiles from separate PEs in parallel, and then stitches them together in a post-processing step. CICE (ice model) is serial, and becomes a serious bottleneck for higher resolutions, so the OM2 model uses PIO to parallelise the IO. Not sure about the atmosphere. Again at higher core counts it uses an IO server (xios), but I'm not sure for this model.

The problem with IO isn't the overhead so much, it's the incredible variability when the lustre file system gets overloaded. It means we have to put in some wall time padding to allow for this, which affects the queue time, though I've not quantified how much difference it makes.

manodeep · 2024-12-03T04:45:49Z

Different models differ. The ocean model gathers ranks and outputs a number of tiles from separate PEs in parallel, and then stitches them together in a post-processing step. CICE (ice model) is serial, and becomes a serious bottleneck for higher resolutions, so the OM2 model uses PIO to parallelise the IO. Not sure about the atmosphere. Again at higher core counts it uses an IO server (xios), but I'm not sure for this model.

The problem with IO isn't the overhead so much, it's the incredible variability when the lustre file system gets overloaded. It means we have to put in some wall time padding to allow for this, which affects the queue time, though I've not quantified how much difference it makes.

I have certainly encountered issues with the metadata servers being overloaded by other jobs, and resulting in significant differences in runtime. One option could be to write to node-level filesystems (typically jobfs), and then run periodically lustre io. Another option could be to use asynchronous io

This was referenced Dec 2, 2024

Preindustrial - increase atmosphere processors #109

Merged

Historical - increase atmosphere processors #110

Merged

blimlim mentioned this issue Feb 5, 2025

Increase atmosphere PEs in preindustrial configuration ACCESS-NRI/access-esm1.6-configs#28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance scaling #94

Performance scaling #94

MartinDix commented Sep 11, 2024

MartinDix commented Sep 11, 2024 •

edited

Loading

MartinDix commented Oct 17, 2024

blimlim commented Oct 18, 2024

blimlim commented Oct 18, 2024

blimlim commented Oct 22, 2024

blimlim commented Oct 25, 2024 •

edited

Loading

blimlim commented Nov 6, 2024

blimlim commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024 •

edited

Loading

manodeep commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

manodeep commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

manodeep commented Dec 3, 2024

Performance scaling #94

Performance scaling #94

Comments

MartinDix commented Sep 11, 2024

MartinDix commented Sep 11, 2024 • edited Loading

MartinDix commented Oct 17, 2024

blimlim commented Oct 18, 2024

blimlim commented Oct 18, 2024

blimlim commented Oct 22, 2024

blimlim commented Oct 25, 2024 • edited Loading

blimlim commented Nov 6, 2024

blimlim commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024 • edited Loading

manodeep commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

manodeep commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

manodeep commented Dec 3, 2024

MartinDix commented Sep 11, 2024 •

edited

Loading

blimlim commented Oct 25, 2024 •

edited

Loading

aidanheerdegen commented Dec 2, 2024 •

edited

Loading