Skip to content

Conversation

@huebleruwm
Copy link

@huebleruwm huebleruwm commented Nov 24, 2025

There's two parts here - getting a new version of clubb in, and enabling the descending grid mode in clubb_intr. Both these goals were split up over many commits. Fixes #1411

New CLUBB

The first commits are dedicated to getting a new version of CLUBB in. Because of clubb_intr diverging between cam_development, which had significant changes to enable GPUization, and UWM's branch, which had redimensioning changes, the merging was done manually.

The first 6 commits here include changes that can be matched with a certain version of CLUBB release:
3d40d8c0e03a298ae3925564bc4db2f0df5e9437 works with clubb hash 1632cf12
e67fc4719fa21f8e449b024a0e3b6df2d0a7f8cb works with clubb hash 673beb05
187d7b536c2f36968fc7f5e1b9d1167e430ad03f works with clubb hash dc302b95
e4b71220b33aeaddb0afc68c9103555edccb59eb works with clubb hash dc302b95
703aca60ed1e0b6b24f2cd890c3a4497041d25b8 works with clubb hash d5957b30
4d9b1b8a528ca532d964c1799e1860e96e068a12 works with clubb hash d5957b30
(to use the clubb hash, go to src/physics/clubb and run the git checkout there)

These commits all have to do with just getting a new version of clubb in, so we need to ensure that at least 4d9b1b8a528ca532d964c1799e1860e96e068a12 is working correctly. The later commits have more complicated changes to clubb_intr, so if we find any problems with this branch, we should test commit 4d9b1b8a528ca532d964c1799e1860e96e068a12 as a next step.

clubb_intr improvements

The next commits after getting new clubb in are increasingly ambitious changes aimed at simplifying clubb_intr and reducing its runtime cost.

e60848b4ec4df90a3060ffd7f664fab42e847509 introduces a parameter, clubb_grid_dir, in clubb_intr that controls which direction the grid is when calling advance_clubb_core. When using -O0 to compile, and setting l_test_grid_generalization = .true. in src/clubb/src/CLUBB_core/model_flag.F90, the results are BFB in either grid direction.

dddff494966bf2bf4341fa4a7526b2f8b0f3d16e separates the flipping code from the copying code (copying cam sized arrays to clubb sizes arrays). Should all be BFB.

055e53f70741531a58d0f7da788b824c76fef087 pushes flipping code inward until it's directly around the call to advance_clubb_core, and the way of controlling the grid direction has been changed to a flag, l_ascending_grid. This should all be BFB as well, and I tested with clubb_cloudtop_cooling, clubb_rainevap_turb, and do_clubb_mf all true to ensure BFBness in either ascending or descending mode. One caveat - the clubb_mf code assumes an ascending grid, so before calling it we flip the arrays, then flip the outputs to descending.

2fb2ba9bd6b1ec5e5c60039f90d0c1020663d0c9 is a pretty safe intermediate commit, mainly moving stuff around in preparation for redimensioning.

bded8a561131e4dbbccad293f14226e5e8c0e856 is some very safe dimensionings.

fce8e1b1b5e8d3e93232c2129be7671fae79db23 is the big one that redimensions most pbuf arrays, and uses them instead of the clubb_sizes local ones. This allows us to avoid the vast majority of the the data copying, and delete the local version of the clubb arrays.

The rest are safe and easy things mainly, or commits to make GPU code work.

Testing

I plan to run an ECT test to compare the cam_development head I started with to the head of this branch. Answers are expected to change slightly due to the ghost point removal in clubb, so I think it unlikely that this passes, but if it does that would be fantastic, and might be the only testing we really need.

If that first ECT test doesn't pass, then I'll will assume that the difference is due to the ghost point removal (or other small bit changing commits in clubb over the past year), and rely on @adamrher to check if the results are good.

The biggest concern is if the answer changes are acceptable, and the only real differences expected are from the new version of CLUBB. If the answers from this branch look bad, we should go back to commit 4d9b1b8a528ca532d964c1799e1860e96e068a12 and check the answers from that, since it only includes the new version of CLUBB, and NO unnecessary clubb_intr changes. If the answers still look bad in that commit, then we have a few more we can step back through to try to figure out which commit introduces the differences. If the hypothetical problem is present in the first commit (3d40d8c0e03a298ae3925564bc4db2f0df5e9437), then the problem is harder, because that includes (pretty much only) the removal of the ghost point in clubb, which is expected to change answers, but hopefully not significantly.

Again if that first ECT test fails, then I can still run another ECT test between the version where clubb is up to date (4d9b1b8a528ca532d964c1799e1860e96e068a12) and the head of this branch. The changes between that commit and the head may be slightly bit changing without (-O0), but definitely shouldn't be answer changing. If this ECT test fails, then I've made a mistake in the changes meant to avoid the flipping/copying, and I'll have to step back through and figure out what bad thing I did.

Some other tests I've ran along the way help confirm at least some of these changes:

  • I've ensured BFBness between ascending and descending modes since it was introduces in commit e60848b4ec4df90a3060ffd7f664fab42e847509
  • The ERP test passes
  • Commit 4d9b1b8a528ca532d964c1799e1860e96e068a12 should be working on the GPU, but later commits are untested

Next steps

I left a number of things messy for now, such as comments and gptl timers. Once we confirm these changes, I'd like to go through and make some of those nicer as a final step.

Performance

In addition to the clubb_intr changes improving performance, we should use this as an opportunity to try to flags that should be significantly faster:

  • clubb_penta_solve_method = 2 uses our custom pentadiagonal matrix solvers, which should be significantly faster than lapack and should pass an ECT test
  • clubb_tridiag_solve_method = 2 uses our custom tridiagonal solver, which should also be faster and pass an ECT test
  • clubb_fill_holes_type = 4 uses a different hole filling algorithm, that in theory is just all around better and faster than our current one, and I suspect it will pass ECT, but I have yet to test it
  • clubb_l_call_pdf_closure_twice = .false. will avoid a pdf_closure call (which is significant in cost) and reduce the memory footprint , and I think will have minimal effect (based on my visual analysis of what that affects in the code), but is the most likely to break the ECT test

@cacraigucar cacraigucar marked this pull request as draft November 24, 2025 20:36
@cacraigucar
Copy link
Collaborator

@huebleruwm - After reading through this text, I moved this PR to draft. Once it is ready for the SEs to review and process it for bringing into CAM, please move it out of draft.

…grid, and flipping sections have been consolidated to directly around the time stepping loop. Next is to push them inward until the only clubb grid calculations are done inside advance_clubb_core.
…s descending BFB (except wp3_ta but that's internal and doesn't matter) - this is even true with clubb_cloudtop_cooling, clubb_rainevap_turb, and do_clubb_mf.
…ending mode, even though there is no notion of ascending in it. There must be some bug with the flipper perhaps? Otherwise an internal interaction between clubb and silhs somehow, it's very upsetting, but I think it's time to give up and come back to it someday.
…oning yet. All changes should be BFB, but a handful of field outputs are different above top_lev because I made it zero everything above top_lev. Among the fields that differ: some (RTM_CLUBB RTP2_CLUBB THLM_CLUBB UP2_CLUBB WP2_CLUBB) were being initiazed to a tolerance above top_lev, and others (THLP2_CLUBB RTP2_CLUBB UM_CLUBB VM_CLUBB) were never initialized or set above top_lev, so we were outputting random garbage.
…ootprint and data copying steps in clubb_intr, mainly by switching pbuf variables (which are mostly clubb_inouts) to have nzm or nzt dimensions, allowing them to be passed into clubb directly, making the copying/flipping step unnecesary. This was tested with a top_lev > 1, so it should be well tested. There were some above top_lev interactions found, which seem erroneous, so I've marked them with a TODO and a little explaination.
@huebleruwm huebleruwm force-pushed the cam_dev_clubb_nov2025 branch from ed11ffc to b2b3232 Compare November 26, 2025 10:11
@huebleruwm
Copy link
Author

@adamrher I have done a number of ECT tests and it was pretty much as expected. I made a baseline from ESCOMP/cam_development commit 4c480c8e8c7e3138d8d5cb3118f635c71886ecf7 (which has cam6_4_128 as the last tag). This commit from this branch (4d9b1b8a528ca532d964c1799e1860e96e068a12), which is just getting in the newest version of clubb) fails the ECT test when compared to the cam_development baseline. So we will need some help confirming the changes as acceptable.

To check the more "dangerous" changes to clubb_intr, I made another baseline from commit 4d9b1b8a528ca532d964c1799e1860e96e068a12 (with new clubb), then compared that to the head of this branch, and that passed the ECT test. I also made sure to run with settings that have top_lev = 3 so that we can test the redimensioning changes, which I did by setting trop_cloud_top_press to 1.D3 in the namelist instead of 1.D2.

Other things I tested:

  • When running with -O0 and l_test_grid_generalization = .true. (clubb internal parameter), then the code is BFB between the ascending and descending modes in clubb_intr (changed by setting l_ascending_grid)
  • Also when running with -O0 the code is to be BFB between 4d9b1b8a528ca532d964c1799e1860e96e068a12 and the head of this branch, outside of the few output variables that differ above top_lev, see the comment from this commit for more detail.
  • The ERP test passes
  • The GPU code runs to completion, haven't check results with ECT test yet, but that's the only thing on the list left to do, and any further changes to make that work (if needed) won't affect CPU results.

So I think the clubb_intr changes are pretty well tested and confirmed. But the changes from new clubb don't pass the ECT test, and we'll need to confirm them as acceptable.

Note

Also, I found a number of examples where clubb_intr is interacting above top_lev, and I'm not sure exactly what we should do about that. For now I've left the errors in place for BFBness, but I've made little code comments explaining the issue (all contain a TODO). There's also a TODO with how we're setting concld_pbuf, which seems suspicious, but not a clear bug. @adamrher could you take a look at these "TODO" comments and see if you have any insight please. I added what I think the "correct" lines should be, and what the errors were that I left in for BFBness.

@huebleruwm
Copy link
Author

I ran some ECT tests with the performance options:

  • clubb_tridiag_solve_method = 2 passes the ECT test and should be significantly faster
  • clubb_penta_solve_method = 2 passes the ECT test and should be significantly faster, but requires clubb_release commit bba14c86a748da7993af72c056a9f6505cf43f1b or newer, since I just fixed a bug in it
  • clubb_fill_holes_type = 4 FAILS the ECT test, which was expected since it's fills holes in the same conservative way, but over completely different ranges, making it answer changing. This should also be decently faster, so it would be nice to test if possible.

@adamrher
Copy link
Collaborator

adamrher commented Dec 5, 2025

@huebleruwm I checked out this PR branch, and while the clubb_intr.F90 changes seem to be there, the .gitmodules are pointing to the clubb externals currently on cam_development:

[submodule "clubb"]
        path = src/physics/clubb
        url = https://github.com/larson-group/clubb_release
        fxrequired = AlwaysRequired
        fxsparse = ../.clubb_sparse_checkout
        fxtag = clubb_4ncar_20240605_73d60f6_gpufixes_posinf
        fxDONOTUSEurl = https://github.com/larson-group/clubb_release

This should be pointing to a different tag / hash containing support for descending / ascending options, no?

@huebleruwm
Copy link
Author

This should be pointing to a different tag / hash containing support for descending / ascending options, no?

Yes. In hindsight, it would've made much more sense to update that with the right hash each commit, rather than just including it in the commit comment.

I've just been going to src/physics/clubb and running git checkout master to get the newest version. Or git checkout hash when testing the first handful of commits:
cam 3d40d8c0e03a298ae3925564bc4db2f0df5e9437 works with clubb hash 1632cf12
cam e67fc4719fa21f8e449b024a0e3b6df2d0a7f8cb works with clubb hash 673beb05
cam 187d7b536c2f36968fc7f5e1b9d1167e430ad03f works with clubb hash dc302b95
cam e4b71220b33aeaddb0afc68c9103555edccb59eb works with clubb hash dc302b95
cam 703aca60ed1e0b6b24f2cd890c3a4497041d25b8 works with clubb hash d5957b30 or newer
cam 4d9b1b8a528ca532d964c1799e1860e96e068a12 works with clubb hash d5957b30 or newer

@huebleruwm
Copy link
Author

git checkout 4d9b1b8a528ca532d964c1799e1860e96e068a12 would get you the version with only the necessary clubb_intr changes. git checkout cam_dev_clubb_nov2025 would get you the head of this branch, which includes the extensive performance focused clubb_intr changes as well. Both should work with the newest clubb code.

@adamrher
Copy link
Collaborator

adamrher commented Dec 8, 2025

I'm having trouble grabbing the clubb hashes. I am able to checkout master from https://github.com/larson-group/clubb_release.git, but it doesn't recognize d5957b30:

Directory: /glade/campaign/cgd/amp/aherring/src/cam_dev_clubb_nov2025/src/physics/clubb
1044 /glade/u/home/aherring> ls
src
1045 /glade/u/home/aherring> git checkout d5957b30
error: pathspec 'd5957b30' did not match any file(s) known to git

Does this hash pertain to a different tag than master?

@huebleruwm
Copy link
Author

Does this hash pertain to a different tag than master?

They should be master, but I did just check and neither dc302b95 or d5957b30 exist. Sorry about that, not sure what happened there, here's what those last 2 should be:
dc302b95 => ea0b892fd2c25049cb5d1429994dc80e4ab6ed9f
d5957b30 => 665f3896481654acdb49de9afae25d7a9b24a81c

@huebleruwm
Copy link
Author

huebleruwm commented Dec 9, 2025

I have fixed up the GPU code now and confirmed correctness with an ECT test. Using a CPU baseline with intel, the GPU results with nvhpc (using top_lev > 1 and clubb/clubb_intr running on GPUs) match according to the ECT test.

All the other tests I run seem to be passing too. ECT tests and ERP tests pass, even when changing the l_ascending_grid flag now in clubb_intr. So if @adamrher's science checks pass, then I think this should be ready to merge.

@cacraigucar cacraigucar marked this pull request as ready for review December 9, 2025 22:59
@adamrher
Copy link
Collaborator

adamrher commented Dec 9, 2025

That's great to hear. My science validation runs are running, but may take a couple days to finish. In the meantime I noticed l_ascending_grid is not a CAM namelist, and I'm thinking it should be. Do you agree, and if so could you push that through?

Copy link
Collaborator

@cacraigucar cacraigucar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at the code (not a thorough code review at all). A couple of items jumped out a at me.

Comment on lines +3949 to +3952
<entry id="clubb_fill_holes_type" type="integer" category="pblrad"
group="clubb_params_nl" valid_values="0,1,2,3,4,5,6" >
Selects which algorithm the fill_holes routine uses to correct
below threshold values in field solutions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the different values be documented with their algorithm names? (This is the namelist documentation so it will be useful for folks wanting to change the values)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a good idea, added here

end do
end do
end if

!---- TODO: there seems to a an above top_lev interaction here that changes answers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A number of "todo" comments are in this code. Please see if they need to be addressed or the comments can be removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I told @huebleruwm I'd go through all these "todo" blocks; it's on my todo list.

@adamrher
Copy link
Collaborator

adamrher commented Dec 10, 2025

@huebleruwm I started testing the performance enhancing flags, and found that two of them are causing the model to bomb out when l_ascending_grid=.false., but which otherwise seem to be stable when l_ascending_grid=.true.. These are the clubb_fill_holes_type = 4 and clubb_penta_solve_method = 2. The other flag clubb_tridiag_solve_method = 2 seems to be stable for both ascending and descending options. I did not test clubb_l_call_pdf_closure_twice = .false. since we've previously found this degrades sharp transitions in fluxes near the top of the boundary layer.

I'm doing all this work in the case directory:

/glade/derecho/scratch/aherring/cam6_4_128_FHISTC_LTso_ne30pg3_ne30pg3_mg17_2176pes_251209_clubbflip_run2pe1

which is getting through the derecho queue at a good clip because it's setup as a 4 node job with JOB_PRIORITY=premium. Feel free to clone it if it helps with debugging. I haven't looked too closely at the errors other than I noticed the monotonic flux limiter is giving lots of warnings prior to the clubb_penta_solve_method = 2 failures (e.g., cesm.log.3902372.desched1.251210-033200), whereas the errors associated with clubb_fill_holes_type = 4 seem to be blowing up the model in the dycore (cesm.log.3909330.desched1.251210-105814) or the vertical diffusion solver (cesm.log.3909140.desched1.251210-103505), which I suspect is because CLUBB providing a garbage state. The dycore failure is blowing up near the top of the model, layer 2.

I'm doing these experiments with CAM hash fce8e1b1b5e8d3e93232c2129be7671fae79db23 and CLUBB hash 665f3896481654acdb49de9afae25d7a9b24a81c.

@huebleruwm
Copy link
Author

I started testing the performance enhancing flags, and found that two of them are causing the model to bomb out when l_ascending_grid=.false.

I found a bug in the penta_lu_solver in descending mode the other day, and have since fixed it.

  • clubb_penta_solve_method = 2 passes the ECT test and should be significantly faster, but requires clubb_release commit bba14c86a748da7993af72c056a9f6505cf43f1b or newer, since I just fixed a bug in it

Are you still testing with an older version of CLUBB? The head of clubb_release/master should work with the head of this branch, and includes the penta solve bug fix. Or the commit mentioned above should work too. Could you retry with that newer commit please?

The new fill_holes method being unstable in descending mode is a surprise though. I shall investigate.

@adamrher
Copy link
Collaborator

Are you still testing with an older version of CLUBB? The head of clubb_release/master should work with the head of this branch, and includes the penta solve bug fix. Or the commit mentioned above should work too. Could you retry with that newer commit please?

Apologies I missed the comment about fixing a bug in the penta solve. I re-ran with clubb_release/master and it ran successfully.

@huebleruwm
Copy link
Author

That's great to hear. My science validation runs are running, but may take a couple days to finish. In the meantime I noticed l_ascending_grid is not a CAM namelist, and I'm thinking it should be. Do you agree, and if so could you push that through?

I was debating this a little in my head. It felt alright to leave it hardcoded, since it's mainly a debugging/testing option now, but I have added it to the namelist now in this commit.

@huebleruwm
Copy link
Author

@vlarson - just adding this comment to include Vince in this PR

@adamrher
Copy link
Collaborator

I was debating this a little in my head. It felt alright to leave it hardcoded, since it's mainly a debugging/testing option now, but I have added it to the namelist now in this commit.

I feel like we'll be working out bugs combining descending with different clubb options for some time, and so it might be useful to revert this at the the namelist level. OTOH CLUBB+MF uses ascending ordering, and I don't think I want to support both options, but rather convert CLUBB+MF to descending in this PR (I presume you didn't do that). For that reason we may want to hard code it, but we could also add an endrun statement if someone tries to combine CLUBB+MF with ascending.

@huebleruwm
Copy link
Author

huebleruwm commented Dec 11, 2025

I feel like we'll be working out bugs combining descending with different clubb options for some time, and so it might be useful to revert this at the the namelist level. OTOH CLUBB+MF uses ascending ordering, and I don't think I want to support both options, but rather convert CLUBB+MF to descending in this PR (I presume you didn't do that). For that reason we may want to hard code it, but we could also add an endrun statement if someone tries to combine CLUBB+MF with ascending.

That makes sense. And you are correct, clubb_mf wasn't touched other than the vertical redimensioning (from nz to nzm and nzt), and those were just changes I copied directly from Brian's modifications to it, see here.

I did notice that clubb_mf was written for ascending though, so I surrounded the call to integrate_mf with flipping code and a little comment about why. I also turned on clubb_mf when I was making the BFB grid flipping changes to make sure I didn't break it in the process, so clubb_mf should work, and should not be affect by the clubb_l_ascending_grid flag.

Switching it to descending mode would still be nice though, because then we can just delete the flipping code.

@adamrher
Copy link
Collaborator

Summary of our mtg. There is a performance degradation in clubb_intr for the current head of this PR branch. For my test (3a in the plot), The CAM hash is c9e9d97e07ea4b5e34fdb4f77bcd7ea4dc8298cf and the CLUBB hash is 3b1b5a297a0d17097ce25773f685a548c758e5ac (Updating the clubb core and silhs version files). The "a" in 3a refers to ascending, and so for that specific hash the hard coded value of l_ascending_grid needs to be changed to true.

The performance is not degraded in the 1a run, which refers to CAM hash 4d9b1b8a528ca532d964c1799e1860e96e068a12 and CLUBB hash 665f3896481654acdb49de9afae25d7a9b24a81c.

temp_bar pe2176

There is also a problem with the clubb_fill_holes_type = 4 when combined with descending. Looking back at that message I determined this using the problematic commit "2d", and so I will test whether the errors also occur with my "3d" configuration.

@adamrher
Copy link
Collaborator

@huebleruwm I reproduced the issue with clubb_fill_holes_type = 4 using the current head of this CAM branch, using the head of CLUBB master. Setting clubb_l_ascending_grid = .false. and clubb_fill_holes_type = 4 causes an instability in the dycore, near the top of the model (negative pressure thickness at level 2). The error does not occur when using the ascending grid option.

To reproduce these results, my case command is:

/glade/campaign/cgd/amp/aherring/src/cam_dev_clubb_nov2025_p5/cime/scripts/create_newcase --case /glade/derecho/scratch/aherring/cam6_4_128_FHISTC_LTso_ne30pg3_ne30pg3_mg17_512pes_251215_run5a_halo --compset FHISTC_LTso --res ne30pg3_ne30pg3_mg17 --driver nuopc --walltime 00:29:00 --mach derecho --pecount 512 --project P93300042 --compiler intel --queue main --run-unsupported

@huebleruwm
Copy link
Author

There is a performance degradation in clubb_intr for the current head of this PR branch.

I've just made another commit that might improve performance meaningfully. I didn't find a single "culprit", but some of the things I fixed were things I did, but never really polished. Some of the slicing I lazily left in could explain both why the mean times are similar but maxes can be much larger (e.g slicing costs only happen when pcol /= ncol).

I also included a bunch of random other little improvements. @adamrher Could you do another timing run with the newest changes and compare them to the others? Hopefully this it was just these little things causing the slowdown.

I used the ECT test to make sure results match cam commit 4d9b1b8a528ca532d964c1799e1860e96e068a12 on both CPUs and GPUs.

@adamrher
Copy link
Collaborator

adamrher commented Dec 16, 2025

Sure, I'll give it a go (edit -- derecho computes nodes are down today. I'll try it on our local cluster).

@adamrher
Copy link
Collaborator

@huebleruwm I ran the head of your branch for two years, and I'm sorry to report that it didn't move the needle even the slightest. I ran it with more detailed timings -- ./xmlchange TIMER_LEVEL=10 -- which provided all the timing blocks inside clubb_intr. I've mapped out the code blocks for these timers (see below), and I'm re-running the 1a commit so I can triangulate where the slowdown is occurring. I should have an update by the end of the day.

Screenshot 2025-12-17 at 10 00 37 AM

@adamrher
Copy link
Collaborator

@huebleruwm I figured out the cause of ~half of the slowdown. It looks like you've moved the array flipping into the nadv loop, whereas this flipping occurs outside this loop in cam_development. Since nadv=2, you're doing the flipping twice (if running with l_ascending_grid=.true.)

To elaborate a bit on the nested loops, the clubb_tend_cam call is nested inside the cld_macmic_num_steps loop in physpkg.F90. By default cld_macmic_num_steps=3, and so with a 1800 s physics time-step, it calls clubb_tend_cam every 600 s. The CLUBB time-step is constrained to 300 s or less, and so nadv=2, meaning we call advance_clubb_core_api twice in a row in clubb_tend_cam.

The impact of moving array flipping into the nadv loop can be tested by running cld_macmic_num_steps=6, which causes nadv=1. Note that Increasing the macmic sub-cycling will increase the cost of clubb_tend_cam overall because we're calling it twice as often, but we're doing the array flipping and calling advance_clubb_core_api only one time per clubb_tend_cam call. Here's some timings for several 4 month runs on 4 nodes:

temp_bar

1a refers to the commit where you only updated the clubb external (CAM hash 4d9b1b8a528ca532d964c1799e1860e96e068a12), and 6a refers to head of your branch with l_ascending_grid=.true. "mm" refers to the value of cld_macmic_num_steps. When mm=3, 6a is 11.5% more expensive than 1a. When mm=6, 6a is 4.8% more expensive than 1a. This indicates that ~half of the performance hit is from moving the array flipping into the nadv loop.

I'll start looking into the the cause of the 4.8% slowdown when mm=6 -- starting by first running a longer two year job to get more accurate timings for the mm=3 and mm=6 cases.

As to whether you should move the flipping outside of nadv I'm not so sure it's worth the trouble if the purpose of ascending is to provide some backwards compatibility for debugging the descending code. We might want to get some input from Cheryl, Jesse, Kate on whether retaining the ascending functionality is even necessary in this PR (I could then switch clubb_mf to descending only in this PR).

@huebleruwm
Copy link
Author

huebleruwm commented Dec 18, 2025

As to whether you should move the flipping outside of nadv I'm not so sure it's worth the trouble if the purpose of ascending is to provide some backwards compatibility for debugging the descending code.

That's what I was thinking too - it's for debugging rather than speed, and it's minimally invasive when it's directly around the advance_clubb_core call. I was on the side of removing the ascending code at first, but we already found 2 bugs (fill_holes and penta_lu_solve) that were only triggered running descending mode in cam, and being able to rule out problems in ascending mode did make make bug hunting much simpler.

I ran some timing runs too, but with only 10 days, and found only a speedup in clubb_intr when running in descending mode (except 1a since that functionality didn't exist yet then). Spreadsheet here.

image
  • new clubb_core: same as 1a (4d9b1b8a528ca532d964c1799e1860e96e068a12)
  • new clubb_intr: same as 6a, branch head
  • new_fast: branch head + penta_solve = 2 + tridiag_solve = 2 + fill_holes = 4 (with newest clubb version that has the fill_holes fix)
  • too_fast: new_fast + clubb_l_diag_Lscale_from_tau = T + clubb_l_call_pdf_closure_twice = F (just to experiment, these options degrade answers).

From 1a to 6a I found clubb_intr reducing in cost by ~29% though. I also saw advance_clubb_core increasing by ~5% - I think because of the added slicing in the call when pcols/=ncol, but that was expected, and can go away when we stop using pbuf. The big discrepancy is definitely interesting though, I wonder if there's anything were doing different besides the run length. I remember you mentioning setting pecount to 1024, but was that for the timing runs? I used 512. Here's the bash script I made to do the runs, in case anything there jumps out (/glade/u/home/huebler/Code/CAM/run_scripts/temp_test.sh).

Turning on the performance options (new solvers and fixed version of new hole filler) reduced advance_clubb_core cost by ~12%, which was less than I expected, but I think the difference is that I was using estimates from testing with cheaper options, mainly clubb_l_diag_Lscale_from_tau = T + clubb_l_call_pdf_closure_twice = F. If you subract the cost of these options, and the added cost from (likely) slicing, then the speedup would be ~21%, which makes sense. So I think the performance options we do want to enable are giving the expected speedups

@adamrher
Copy link
Collaborator

That's what I was thinking too - it's for debugging rather than speed, and it's minimally invasive when it's directly around the advance_clubb_core call.

We could put if statements around the l_ascending_grid blocks - if t=1 for the before block and if t=nadv for the after block?

I also saw advance_clubb_core increasing by ~5% - I think because of the added slicing in the call when pcols/=ncol

Is this added slicing in 1a, or just 6a?

The big discrepancy is definitely interesting though, I wonder if there's anything were doing different besides the run length.

By big discrepancy, you mean that 4.7% slowdown I'm getting for 1a v. 6a w/ ascending on? Could that be the added advance_clubb_core costs from the slicing?

Would you be willing to move the timers, specifically the clubb_tend_cam:advance_clubb_core_api around the advance_clubb_core call, and clubb_tend_cam:flip-index inside the l_ascending_grid blocks?

@huebleruwm
Copy link
Author

huebleruwm commented Dec 18, 2025

One of the reasons the new ascending code is slower is that the flipping step used to be combined with the data copy step. When we copy the the cam data structures (e.g state_loc) to arrays for calling clubb (e.g cloud_frac_inout) , before we would copy and flip within the same loop, but now we copy then flip separately. This was to make the flipping options minimally invasive, affecting only advance_clubb_core, but will add a large amount of cost when turning on l_ascending_grid.

We are also flipping more arrays now, some of the pdf_params need flipping too now because SILHS is only descending mode now. Some other things didn't (and don't) need flipping at all because they weren't (and aren't) used in clubb_intr, but are flipped now just in case they do get used in the future.

The double flipping also adds cost, but I believe the only ways to reduce the number of times we flip will make it more invasive and require more code to work in ascending mode. For example, if we flip the arrays only once per nadv loop, then all the code inside the nadv loop (like integrate_mf) will need to support both ascending and descending modes, or have their own grid-direction conditional flipping step.

So we could definitely reduce the cost in ascending mode by just not flipping certain things or by reducing the frequency of flips, but I vote we just don't worry about it since it's only a debugging option.

By big discrepancy, you mean that 4.7% slowdown I'm getting for 1a v. 6a w/ ascending on? Could that be the added advance_clubb_core costs from the slicing?

By discrepancy I meant why you saw a cost increase in descending mode, but I'm seeing a decrease. The last test you did in descending mode found it slower overall than 0a and 1a. But now that I think about it, that wasn't using the clubb_tend_cam timers directly was it? I'm not sure why that would give such a different answer, but might have something to do with it.

@adamrher
Copy link
Collaborator

So we could definitely reduce the cost in ascending mode by just not flipping certain things or by reducing the frequency of flips, but I vote we just don't worry about it since it's only a debugging option.

That's fine by me.

From 1a to 6a I found clubb_intr reducing in cost by ~29% though. I also saw advance_clubb_core increasing by ~5%

I was able to reproduce this result (labeled 6d below), with clubb_intr (clubb_tend_cam - advance_clubb_core) reducing costs by 25% and advance_clubb_core increasing by 5%. This is based on 4 month runs with 512 tasks.

Turning on the performance options (new solvers and fixed version of new hole filler) reduced advance_clubb_core cost by ~12%

I was able to get basically the same number, a 15% reduction in advance_clubb_core costs (case 7d-new_fast below).

Here's my stacked bar plot (sorry about the lack of color). The bottom is advance_clubb_core and the top is clubb_intr = clubb_tend_cam - advance_clubb_core. Overall the clubb_tend_cam costs going from 0a (clean cam6_4_128) and 7d-new_fast is a 15% reduction.

temp_bar

I'll move on to the science validation for 7d-new_fast.

@huebleruwm
Copy link
Author

I was able to get basically the same number, a 15% reduction in advance_clubb_core costs (case 7d-new_fast below).

Oh excellent, maybe the difference was in the timers being looked at.

Would you be willing to move the timers, specifically the clubb_tend_cam:advance_clubb_core_api around the advance_clubb_core call, and clubb_tend_cam:flip-index inside the l_ascending_grid blocks?

I think it would be nice to move the timers a little too. At least to have to have the clubb_tend_cam:advance_clubb_core_api directly around the call, since then we can actually compare the time in ascending vs descending without the flipping cost. I realized that another explanation for why advance_clubb_core could be slightly slower is that it's slower in descending mode - 5% increase overall just for that little slicing in the call seems odd, and ascending vs descending performance is something I haven't explored in detail yet. I'll move the timers and do a couple more runs to check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants