Switch from xPPTRF to xPOTRF to improve TurbSim speed on macOS #3123
+169
−111
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Feature or improvement description
This PR rewrites the subroutines LAPACK_DPPTRF and LAPACK_SPPTRF in NWTC_LAPACK.f90, replacing the packed storage Cholesky decomposition (xPPTRF) with the full storage Cholesky decomposition (xPOTRF). To ensure compatibility with existing callers, the subroutine signature remains unchanged by using an internal wrapper to handle the conversion between packed and full storage formats.
This change results in a substantial speed improvement for TurbSim on macOS, with minimal additional memory overhead.
Related issue, if one exists
#3120
Impacted areas of the software
TurbSim
Test results, if applicable
(1) macOS
I compiled TurbSim using GCC 15.2.0 with the following build flags:
I used both versions of TurbSim to generate (i) Grid = 43 x 43, 120-second .bts file; (ii) Grid = 23 x 23, 600-second .bts file. The performance results (on macOS 26.2, M4 Pro) are shown below (Coh2h() is the caller of LAPACK_xPPTRF, and unit in seconds):
(i)
Coh2h()(ii)
Coh2h()Furthermore, the two version .bts files differ only in the metadata section, specifically at 0x42 ($n_{character}$ ) and the related $Character_i$ (typically version info and generated time), while the subsequent data sections are identical.
(2) Windows
I compiled TurbSim using IFORT (from Intel oneAPI 2024.2.1) and IFX (from Intel oneAPI 2025.0.1) with O2 optimization level. The performance results (on Windows 11 24H2, AMD 9950X) are shown below:
(i)
Coh2h()(ii)
Coh2h()After switching to SPOTRF, the computation speed of TurbSim on Windows has at least not decreased.
It should be noted that the .bts files generated by two versions of TurbSim (same compiler) are slightly different on Windows. However, in terms of engineering accuracy, this difference is negligible.