Add GPU implementation of preconditioner - v2 #31

LucasBoTang · 2025-11-15T09:11:43Z

Summary

This PR introduces a CUDA-based implementation of the preconditioner module, including both Ruiz, Pock–Chambolle, and objective-bound scaling.

Main Changes

Replaced preconditioner.cu with preconditioner.c
Modified initialize_solver_state in solver.cu for GPU preconditioner integration

Implementation Details

The matrix is stored in CSR format, with an additional row ID array to enable efficient row-wise scaling (A[i,j] *= E[i]) without extra lookups.
Added an auxiliary array recording the mapping of each A element to its corresponding position in Aᵀ, enabling synchronized scaling of A and Aᵀ without atomics or additional CSR/CSC conversions.

Next Step

Benchmark GPU vs CPU preconditioner runtime before merging

Note

Reduce_bound_norm_sq_atomic currently relies on atomicAdd(double*) for the bound-norm reduction, which requires CMAKE_CUDA_ARCHITECTURES ≥ 60.

Would it be preferable to:

Switch to a portable single-block shared-memory reduction (no atomics), or
Redesign the reduction kernel
Keep the current implementation and require sm_60+?

LucasBoTang · 2025-11-16T02:46:14Z

This update fixes two issues in the GPU preconditioner:

Corrected objective/bound rescaling: The previous GPU code applied the wrong scaling to the bounds and objective. This caused very long PDHG iterations. Now, all scaling is applied correctly on the GPU.
Improved reduce_bound_norm_sq_kernel: Replaced atomic accumulation with a shared-memory block reduction. This removes the atomic overhead and makes the result consistent and fast.

LucasBoTang · 2025-11-25T01:55:24Z

Hi @ZedongPeng and @jinwen-yang,

I have now completed the experiments on MILPLIB and the Mittelmann LP benchmark.

Below is an example figure comparing CPU vs GPU preconditioning time across all tested instances:

For details of the result and plot, please see the following notebook: Read_Summary.ipynb

Overall, the results show that GPU preconditioning is consistently faster than CPU preconditioning, often by several orders of magnitude across the benchmarks. The few cases in which the CPU appears marginally faster occur only when the preconditioning time on the CPU is already very small (< 0.1 sec).

Consequently, because the overall solve time is typically dominated by the iterative optimization phase rather than the preconditioning step, the choice between CPU and GPU preconditioning has only a minor impact on the solve time.

LucasBoTang · 2025-11-25T02:04:44Z

Solving timing observations

During the full MILPLIB and Mittelmann benchmark runs, I observed some minor numerical mismatches between the CPU and GPU runs. These differences occur only in a small subset of instances.

Below is the full list of mismatches detected:

[WARNING] mismatch in Primal obj for instance ns1687037:
         Status Iterations   Primal obj      Dual obj
CPU  TIME_LIMIT  247379000  4.673210786  -250.2518284
GPU  TIME_LIMIT  250602000   4.64700347  -249.3609757
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched3-3:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   52986800  2432689.742  2432616.795
GPU  OPTIMAL   54340200   2432694.06  2432526.403
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance bdry2:
      Status Iterations      Primal obj        Dual obj
CPU  OPTIMAL   33554200  0.004177643673  0.004222602356
GPU  OPTIMAL   33558800  0.004177664638  0.004227880325
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance piperout-d20:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1381600  25533.84286  25533.84195
GPU  OPTIMAL    1560400  25533.94809  25533.84247
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance sct1:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL      49000   -218.089755  -218.1330731
GPU  OPTIMAL      49000  -218.0897577  -218.1330731
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance proteindesign121hz512p9:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1898200   1423.90403  1423.962777
GPU  OPTIMAL    2684200  1423.904585  1423.944118
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance sct5:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5700  -228.1952931  -228.1610876
GPU  OPTIMAL       5700  -228.1952954  -228.1610876
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-76-50:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     772600  -14173883.52  -14174418.73
GPU  OPTIMAL     612600  -14173884.23  -14175314.28
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-75-50:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     549000  -14182312.02  -14182139.66
GPU  OPTIMAL     549000  -14182312.01  -14182139.61
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance hgms62:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL   20860200  -53405.66375  -53416.34387
GPU  OPTIMAL   20860200  -53405.66394  -53416.34369
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4413714-turia:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   25849200  43.47503004  43.47695371
GPU  OPTIMAL   26097200  43.47871915   43.4715827
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim7:
      Status Iterations   Primal obj    Dual obj
CPU  OPTIMAL     456200  2337.078721  2336.73184
GPU  OPTIMAL     456200   2337.07871  2336.73184
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-3208254-reiu:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       9000  -38604.86166  -38606.26268
GPU  OPTIMAL       9000  -38604.86167  -38606.26268
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance shs1042:
      Status Iterations  Primal obj     Dual obj
CPU  OPTIMAL    3663200  9442.47166  9444.253513
GPU  OPTIMAL    3663200  9442.47166  9444.253512
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched6-1:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     301200  7499.998066     7501.475
GPU  OPTIMAL     301200  7499.997697  7501.480112
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched3-4:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   72542200  1124642.272  1124806.832
GPU  OPTIMAL   52752600  1124688.273  1124585.589
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim4:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     929800  9.205938105  9.206941924
GPU  OPTIMAL     929800  9.205929986  9.206942036
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim2:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1300800  10.86447831  10.86236456
GPU  OPTIMAL    1284600  10.86393657  10.86326915
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-3229051-yass:
              Status Iterations        Primal obj     Dual obj
CPU  DUAL_INFEASIBLE        200  -1.992184297e+12  1441960.542
GPU  DUAL_INFEASIBLE        200  -1.992184298e+12  1441960.541
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites2-25:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5100  -30.00002622  -29.99759189
GPU  OPTIMAL       5100  -30.00002622  -29.99759469
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance ds-big:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     933200  86.83684649  86.81978274
GPU  OPTIMAL     934000  86.83691952  86.81978541
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-76-40:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     191800  -14171085.83  -14171901.21
GPU  OPTIMAL     191800  -14171085.81   -14171901.2
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance germanrr:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     345400  45980142.71  45980135.42
GPU  OPTIMAL     347000  45980143.85  45980135.42
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4332810-sesia:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     740600  -33931.28309   -33930.7782
GPU  OPTIMAL    2763400  -33929.56003  -33926.81054
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance proteindesign122trx11p8:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1416200  1720.461027  1720.694719
GPU  OPTIMAL    1416200  1720.461026  1720.694715
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4332801-seret:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    3552400  222338.3266  222313.5325
GPU  OPTIMAL    1400800  222342.9597  222320.9406
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4535459-waipa:
         Status Iterations       Primal obj          Dual obj
CPU  TIME_LIMIT    7867000  3.059144606e+15  -4.360493003e+10
GPU  TIME_LIMIT    7886000   3.05706559e+15  -4.368560407e+10
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites3-25:
      Status Iterations   Primal obj      Dual obj
CPU  OPTIMAL       4900  -39.0003573  -38.99905447
GPU  OPTIMAL       4900  -39.0003573  -38.99905805
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites4-25:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5000  -41.00318391  -41.01144414
GPU  OPTIMAL       5000  -41.00318391  -41.01144562
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites2-40:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5300  -29.99997468  -30.00582128
GPU  OPTIMAL       5300  -29.99997468  -30.00581932
------------------------------------------------------------
Total mismatches: 30 / 414 instances

In most mismatched cases, the differences are at the level of floating-point noise, and both implementations return the same status and a very similar number of iterations.

However, a few instances stand out with noticeably larger discrepancies between the CPU and GPU runs:

neos-4332810-sesia

CPU
- Primal obj: -33931.28309
- Dual obj: -33930.77820
- Relative duality gap: 1.5e-5
GPU
- Primal obj: -33929.56003
- Dual obj: -33926.81054
- Relative duality gap: 8.1e-5

neos-4332801-seret

CPU
- Primal obj: 222338.3266
- Dual obj: 222313.5325
- Relative duality gap: 1.1e-4
GPU
- Primal obj: 222342.9597
- Dual obj: 222320.9406
- Relative duality gap: 9.9e-5

All solutions satisfy a relative duality gap of order 1e-4, which is well within the expected tolerance. Thus, the current CPU/GPU differences are acceptable and consistent with the expected numerical behavior. However, the difference in iteration count is large.

Preconditioner timing observations

I also looked at the GPU preconditioner timing. Almost all instances have very cheap preconditioning on the GPU (Ruiz + Pock–Chambolle + bound-objective are usually on the order of 1e-3 seconds in total).

The only clear outliers are again neos-4332810-sesia and neos-4332801-seret. For these two instances, the GPU preconditioner times are:

neos-4332810-sesia

Ruiz scaling (10 iterations): 2.155691 sec
Pock-Chambolle scaling (alpha=1.0): 0.613833 sec
Bound-objective scaling: 0.020169 sec

neos-4332801-seret

Ruiz scaling (10 iterations): 2.158601 sec
Pock-Chambolle scaling (alpha=1.0): 0.607071 sec
Bound-objective scaling: 0.020089 sec

So these are currently the only instances where GPU preconditioning exceeds ~2 seconds; all other instances are around the 1e-3 second level for the same stages. Most of the extra time comes specifically from the Ruiz scaling stage, which is significantly more expensive in these instances compared to the rest of the benchmark set.

For completeness, here are the CPU preconditioner times for the same two problematic instances:

neos-4332810-sesia
- Ruiz scaling (10 iterations): 3.481281 sec
- Pock-Chambolle scaling (alpha=1.0): 0.946331 sec
- Bound-objective scaling: 0.100667 sec
neos-4332801-seret
- Ruiz scaling (10 iterations): 3.377352 sec
- Pock-Chambolle scaling (alpha=1.0): 0.930314 sec
- Bound-objective scaling: 0.091834 sec

As expected, CPU preconditioning is consistently slower on these instances. However, the GPU times for these two cases are still unusually large.

CMakeLists.txt

src/preconditioner.cu

src/solver.cu

internal/internal_types.h

pyproject.toml

…to preconditioner-v2

LucasBoTang · 2025-12-03T20:18:00Z

waipa
- larger nnz ≈ 2.57e8
- but dimensions are moderate (~1.6e5 rows / ~9.6e5 cols)
sesia / seret
- smaller nnz ≈ 4.7e7
- but dimensions are extreme (~2e7 rows / cols)

Although waipa has far more nonzeros than neos-4332810-sesia and neos-4332801-seret, its GPU rescaling is much faster because the preconditioner’s runtime is not determined solely by nnz. A large portion of the cost comes from vector-level kernels whose complexity scales with:

$$ O(\mathrm{num_variables} + \mathrm{num_constraints}) \times \mathrm{Ruiz\ iterations} $$

Because the GPU implementation repeatedly launches vector kernels over arrays of size m + n during Ruiz (10 iterations), the dimension dominates the runtime. This is why waipa, despite having much larger nnz, finishes much faster.

LucasBoTang · 2025-12-04T03:26:01Z

I fully compared the cumulative scaling factors produced by the CPU and GPU preconditioners on neos-4332801-seret:

Comparing constraint rescaling vectors...
==== Constraints ====

Length compared: 19912111
max |cpu - gpu| = 3.553e-15
mean |cpu - gpu| = 1.099e-18
min |cpu - gpu| = 0.000e+00
num(|diff| > 1.0e-12) = 0

Comparing variable rescaling vectors...
==== Variables ====

Length compared: 20682487
max |cpu - gpu| = 2.913e-13
mean |cpu - gpu| = 1.395e-15
min |cpu - gpu| = 0.000e+00
num(|diff| > 1.0e-12) = 0

I also compared the two scalar scaling factors used in the bound-objective rescaling step:

[CPU] con_bound_rescale = 3.8017920572592861e-04
[CPU] obj_vec_rescale   = 5.3256277297530932e-06

[GPU] con_bound_rescale = 3.8017920570129657e-04
[GPU] obj_vec_rescale   = 5.3256277287640704e-06

These results confirm that the CPU and GPU preconditioners produce numerically equivalent scaling. I will check initialize_solver_state in solver.cu for the next step.

LucasBoTang · 2025-12-04T18:21:15Z

I performed an additional experiment to check whether the discrepancies come from the solver (solver.cu).

To isolate the effect, I swapped the CPU/GPU scaling factors:

solver.cu for CPU Precondition + GPU scaling factors
solver.cu for GPU Precondition + CPU scaling factors

Under this setup, the CPU and GPU solvers behave consistently: given identical scaling vectors, they produce nearly identical trajectories and final results.

This strongly suggests thatsolver.cu is not introducing any systematic numerical bias.

Therefore, @ZedongPeng and @jinwen-yang, even though the scaling vectors differ only at the level of 1e−13 to 1e−15, the total iteration count on these two instances ends up quite different between CPU and GPU. Do you think such tiny discrepancies in scaling could naturally lead to large divergence in iteration counts for these particular problems? Or does this indicate that these instances are extremely numerically sensitive?

src/preconditioner.cu

LucasBoTang · 2025-12-12T17:45:45Z

I check the values in initialize_step_size_and_primal_weight():

max_sv returned by estimate_maximum_singular_value(...)
state->step_size = 0.998 / max_sv
state->primal_weight

I compared CPU and GPU runs on the following instances:

neos-4332810-sesia
neos-4332801-seret

Instance: neos-4332810-sesia

GPU:
- max_sv = 0.99981011762876171
- step_size = 0.99818953859653392
- primal_weight = 1
CPU:
- max_sv = 0.99981011762876093
- step_size = 0.99818953859653470
- primal_weight = 1

Instance: neos-4332801-seret

GPU:
- max_sv = 0.99981011762876171
- step_size = 0.99818953859653392
- primal_weight = 1
CPU:
- max_sv = 0.99981011762876093
- step_size = 0.99818953859653470
- primal_weight = 1

This indicates that the CPU/GPU discrepancy is not coming from initialize_step_size_and_primal_weight() (nor from the singular-value estimation feeding it), since both paths produce essentially identical initialization parameters.

src/preconditioner.cu

LucasBoTang added 3 commits November 14, 2025 18:33

New feat: GPU precondition

2a31c99

New feat: synchronized A/At scaling

5a993cd

Todo: infeasible is stucked

c2c7252

LucasBoTang requested review from ZedongPeng and jinwen-yang November 15, 2025 09:11

LucasBoTang added 2 commits November 15, 2025 16:45

New feat: preconditioner prints

63b2526

Bug fixed: objective & bound rescale

3f17100

LucasBoTang and others added 4 commits November 15, 2025 21:46

Merge branch 'main' into preconditioner-v2

e2149c3

New feat: record precondition time

1989b6b

Merge branch 'main' into preconditioner-v2

ae49b87

feat: add logging for precondition time

32889ca

Bug fixed: numerical issue

e3fd294

ZedongPeng reviewed Nov 26, 2025

View reviewed changes

LucasBoTang added 8 commits November 27, 2025 14:38

Clean code: clang-format and align naming style

7381654

Clean code: rename variables

f13f572

Refactor: move finite-bound computation

5ec9302

Merge branch 'MIT-Lu-Lab:main' into preconditioner-v2

af67331

Numerical issue: back to old version

61e88bd

Merge remote-tracking branch 'upstream/main' into preconditioner-v2

295e2cc

New ver: 0.1.5

51685be

Merge branch 'preconditioner-v2' of github.com:LucasBoTang/cuPDLPx in…

ec8532d

…to preconditioner-v2

ZedongPeng reviewed Dec 11, 2025

View reviewed changes

src/preconditioner.cu Outdated Show resolved Hide resolved

src/preconditioner.cu Show resolved Hide resolved

src/preconditioner.cu Show resolved Hide resolved

LucasBoTang and others added 2 commits December 11, 2025 14:07

Merge branch 'main' into preconditioner-v2

7b6c1fa

Code refactor: CUB DeviceReduce for bound norm

ccec6e3

LucasBoTang added 2 commits December 16, 2025 14:44

Merge branch 'main' into preconditioner-v2

f0e0373

Bug fixed: typo

004665f

ZedongPeng reviewed Dec 16, 2025

View reviewed changes

src/preconditioner.cu Outdated Show resolved Hide resolved

Clean code: Limit CUB headers to device_reduce to reduce compile time

079b1ef

Add GPU implementation of preconditioner - v2 #31

Are you sure you want to change the base?

Add GPU implementation of preconditioner - v2 #31

Uh oh!

Conversation

LucasBoTang commented Nov 15, 2025

Summary

Main Changes

Implementation Details

Next Step

Note

Uh oh!

LucasBoTang commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasBoTang commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasBoTang commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solving timing observations

neos-4332810-sesia

neos-4332801-seret

Preconditioner timing observations

neos-4332810-sesia

neos-4332801-seret

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LucasBoTang commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasBoTang commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasBoTang commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LucasBoTang commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LucasBoTang commented Nov 16, 2025 •

edited

Loading

LucasBoTang commented Nov 25, 2025 •

edited

Loading

LucasBoTang commented Nov 25, 2025 •

edited

Loading

LucasBoTang commented Dec 3, 2025 •

edited

Loading

LucasBoTang commented Dec 4, 2025 •

edited

Loading

LucasBoTang commented Dec 4, 2025 •

edited

Loading

LucasBoTang commented Dec 12, 2025 •

edited

Loading