Skip to content

Conversation

@LucasBoTang
Copy link
Collaborator

Summary

This PR introduces a CUDA-based implementation of the preconditioner module, including both Ruiz, Pock–Chambolle, and objective-bound scaling.

Main Changes

  • Replaced preconditioner.cu with preconditioner.c
  • Modified initialize_solver_state in solver.cu for GPU preconditioner integration

Implementation Details

  • The matrix is stored in CSR format, with an additional row ID array to enable efficient row-wise scaling (A[i,j] *= E[i]) without extra lookups.
  • Added an auxiliary array recording the mapping of each A element to its corresponding position in Aᵀ, enabling synchronized scaling of A and Aᵀ without atomics or additional CSR/CSC conversions.

Next Step

  • Benchmark GPU vs CPU preconditioner runtime before merging

Note

Reduce_bound_norm_sq_atomic currently relies on atomicAdd(double*) for the bound-norm reduction, which requires CMAKE_CUDA_ARCHITECTURES ≥ 60.

Would it be preferable to:

  • Switch to a portable single-block shared-memory reduction (no atomics), or
  • Redesign the reduction kernel
  • Keep the current implementation and require sm_60+?

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Nov 16, 2025

This update fixes two issues in the GPU preconditioner:

  1. Corrected objective/bound rescaling: The previous GPU code applied the wrong scaling to the bounds and objective. This caused very long PDHG iterations. Now, all scaling is applied correctly on the GPU.

  2. Improved reduce_bound_norm_sq_kernel: Replaced atomic accumulation with a shared-memory block reduction. This removes the atomic overhead and makes the result consistent and fast.

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Nov 25, 2025

Hi @ZedongPeng and @jinwen-yang,

I have now completed the experiments on MILPLIB and the Mittelmann LP benchmark.

Below is an example figure comparing CPU vs GPU preconditioning time across all tested instances:

For details of the result and plot, please see the following notebook: Read_Summary.ipynb

Overall, the results show that GPU preconditioning is consistently faster than CPU preconditioning, often by several orders of magnitude across the benchmarks. The few cases in which the CPU appears marginally faster occur only when the preconditioning time on the CPU is already very small (< 0.1 sec).

Consequently, because the overall solve time is typically dominated by the iterative optimization phase rather than the preconditioning step, the choice between CPU and GPU preconditioning has only a minor impact on the solve time.

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Nov 25, 2025

Solving timing observations

During the full MILPLIB and Mittelmann benchmark runs, I observed some minor numerical mismatches between the CPU and GPU runs. These differences occur only in a small subset of instances.

Below is the full list of mismatches detected:

[WARNING] mismatch in Primal obj for instance ns1687037:
         Status Iterations   Primal obj      Dual obj
CPU  TIME_LIMIT  247379000  4.673210786  -250.2518284
GPU  TIME_LIMIT  250602000   4.64700347  -249.3609757
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched3-3:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   52986800  2432689.742  2432616.795
GPU  OPTIMAL   54340200   2432694.06  2432526.403
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance bdry2:
      Status Iterations      Primal obj        Dual obj
CPU  OPTIMAL   33554200  0.004177643673  0.004222602356
GPU  OPTIMAL   33558800  0.004177664638  0.004227880325
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance piperout-d20:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1381600  25533.84286  25533.84195
GPU  OPTIMAL    1560400  25533.94809  25533.84247
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance sct1:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL      49000   -218.089755  -218.1330731
GPU  OPTIMAL      49000  -218.0897577  -218.1330731
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance proteindesign121hz512p9:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1898200   1423.90403  1423.962777
GPU  OPTIMAL    2684200  1423.904585  1423.944118
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance sct5:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5700  -228.1952931  -228.1610876
GPU  OPTIMAL       5700  -228.1952954  -228.1610876
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-76-50:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     772600  -14173883.52  -14174418.73
GPU  OPTIMAL     612600  -14173884.23  -14175314.28
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-75-50:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     549000  -14182312.02  -14182139.66
GPU  OPTIMAL     549000  -14182312.01  -14182139.61
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance hgms62:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL   20860200  -53405.66375  -53416.34387
GPU  OPTIMAL   20860200  -53405.66394  -53416.34369
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4413714-turia:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   25849200  43.47503004  43.47695371
GPU  OPTIMAL   26097200  43.47871915   43.4715827
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim7:
      Status Iterations   Primal obj    Dual obj
CPU  OPTIMAL     456200  2337.078721  2336.73184
GPU  OPTIMAL     456200   2337.07871  2336.73184
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-3208254-reiu:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       9000  -38604.86166  -38606.26268
GPU  OPTIMAL       9000  -38604.86167  -38606.26268
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance shs1042:
      Status Iterations  Primal obj     Dual obj
CPU  OPTIMAL    3663200  9442.47166  9444.253513
GPU  OPTIMAL    3663200  9442.47166  9444.253512
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched6-1:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     301200  7499.998066     7501.475
GPU  OPTIMAL     301200  7499.997697  7501.480112
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance physiciansched3-4:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL   72542200  1124642.272  1124806.832
GPU  OPTIMAL   52752600  1124688.273  1124585.589
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim4:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     929800  9.205938105  9.206941924
GPU  OPTIMAL     929800  9.205929986  9.206942036
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance triptim2:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1300800  10.86447831  10.86236456
GPU  OPTIMAL    1284600  10.86393657  10.86326915
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-3229051-yass:
              Status Iterations        Primal obj     Dual obj
CPU  DUAL_INFEASIBLE        200  -1.992184297e+12  1441960.542
GPU  DUAL_INFEASIBLE        200  -1.992184298e+12  1441960.541
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites2-25:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5100  -30.00002622  -29.99759189
GPU  OPTIMAL       5100  -30.00002622  -29.99759469
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance ds-big:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     933200  86.83684649  86.81978274
GPU  OPTIMAL     934000  86.83691952  86.81978541
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance gmut-76-40:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     191800  -14171085.83  -14171901.21
GPU  OPTIMAL     191800  -14171085.81   -14171901.2
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance germanrr:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL     345400  45980142.71  45980135.42
GPU  OPTIMAL     347000  45980143.85  45980135.42
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4332810-sesia:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL     740600  -33931.28309   -33930.7782
GPU  OPTIMAL    2763400  -33929.56003  -33926.81054
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance proteindesign122trx11p8:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    1416200  1720.461027  1720.694719
GPU  OPTIMAL    1416200  1720.461026  1720.694715
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4332801-seret:
      Status Iterations   Primal obj     Dual obj
CPU  OPTIMAL    3552400  222338.3266  222313.5325
GPU  OPTIMAL    1400800  222342.9597  222320.9406
------------------------------------------------------------
[WARNING] mismatch in Primal obj for instance neos-4535459-waipa:
         Status Iterations       Primal obj          Dual obj
CPU  TIME_LIMIT    7867000  3.059144606e+15  -4.360493003e+10
GPU  TIME_LIMIT    7886000   3.05706559e+15  -4.368560407e+10
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites3-25:
      Status Iterations   Primal obj      Dual obj
CPU  OPTIMAL       4900  -39.0003573  -38.99905447
GPU  OPTIMAL       4900  -39.0003573  -38.99905805
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites4-25:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5000  -41.00318391  -41.01144414
GPU  OPTIMAL       5000  -41.00318391  -41.01144562
------------------------------------------------------------
[WARNING] mismatch in Dual obj for instance satellites2-40:
      Status Iterations    Primal obj      Dual obj
CPU  OPTIMAL       5300  -29.99997468  -30.00582128
GPU  OPTIMAL       5300  -29.99997468  -30.00581932
------------------------------------------------------------
Total mismatches: 30 / 414 instances

In most mismatched cases, the differences are at the level of floating-point noise, and both implementations return the same status and a very similar number of iterations.

However, a few instances stand out with noticeably larger discrepancies between the CPU and GPU runs:

neos-4332810-sesia

  • CPU
    • Primal obj: -33931.28309
    • Dual obj: -33930.77820
    • Relative duality gap: 1.5e-5
  • GPU
    • Primal obj: -33929.56003
    • Dual obj: -33926.81054
    • Relative duality gap: 8.1e-5

neos-4332801-seret

  • CPU
    • Primal obj: 222338.3266
    • Dual obj: 222313.5325
    • Relative duality gap: 1.1e-4
  • GPU
    • Primal obj: 222342.9597
    • Dual obj: 222320.9406
    • Relative duality gap: 9.9e-5

All solutions satisfy a relative duality gap of order 1e-4, which is well within the expected tolerance. Thus, the current CPU/GPU differences are acceptable and consistent with the expected numerical behavior. However, the difference in iteration count is large.

Preconditioner timing observations

I also looked at the GPU preconditioner timing. Almost all instances have very cheap preconditioning on the GPU (Ruiz + Pock–Chambolle + bound-objective are usually on the order of 1e-3 seconds in total).

The only clear outliers are again neos-4332810-sesia and neos-4332801-seret. For these two instances, the GPU preconditioner times are:

neos-4332810-sesia

  • Ruiz scaling (10 iterations): 2.155691 sec
  • Pock-Chambolle scaling (alpha=1.0): 0.613833 sec
  • Bound-objective scaling: 0.020169 sec

neos-4332801-seret

  • Ruiz scaling (10 iterations): 2.158601 sec
  • Pock-Chambolle scaling (alpha=1.0): 0.607071 sec
  • Bound-objective scaling: 0.020089 sec

So these are currently the only instances where GPU preconditioning exceeds ~2 seconds; all other instances are around the 1e-3 second level for the same stages. Most of the extra time comes specifically from the Ruiz scaling stage, which is significantly more expensive in these instances compared to the rest of the benchmark set.

For completeness, here are the CPU preconditioner times for the same two problematic instances:

  • neos-4332810-sesia

    • Ruiz scaling (10 iterations): 3.481281 sec
    • Pock-Chambolle scaling (alpha=1.0): 0.946331 sec
    • Bound-objective scaling: 0.100667 sec
  • neos-4332801-seret

    • Ruiz scaling (10 iterations): 3.377352 sec
    • Pock-Chambolle scaling (alpha=1.0): 0.930314 sec
    • Bound-objective scaling: 0.091834 sec

As expected, CPU preconditioning is consistently slower on these instances. However, the GPU times for these two cases are still unusually large.

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Dec 3, 2025

  • waipa

    • larger nnz ≈ 2.57e8
    • but dimensions are moderate (~1.6e5 rows / ~9.6e5 cols)
  • sesia / seret

    • smaller nnz ≈ 4.7e7
    • but dimensions are extreme (~2e7 rows / cols)

Although waipa has far more nonzeros than neos-4332810-sesia and neos-4332801-seret, its GPU rescaling is much faster because the preconditioner’s runtime is not determined solely by nnz. A large portion of the cost comes from vector-level kernels whose complexity scales with:

$$ O(\mathrm{num_variables} + \mathrm{num_constraints}) \times \mathrm{Ruiz\ iterations} $$

Because the GPU implementation repeatedly launches vector kernels over arrays of size m + n during Ruiz (10 iterations), the dimension dominates the runtime. This is why waipa, despite having much larger nnz, finishes much faster.

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Dec 4, 2025

I fully compared the cumulative scaling factors produced by the CPU and GPU preconditioners on neos-4332801-seret:

Comparing constraint rescaling vectors...
==== Constraints ====

  • Length compared: 19912111
  • max |cpu - gpu| = 3.553e-15
  • mean |cpu - gpu| = 1.099e-18
  • min |cpu - gpu| = 0.000e+00
  • num(|diff| > 1.0e-12) = 0

Comparing variable rescaling vectors...
==== Variables ====

  • Length compared: 20682487
  • max |cpu - gpu| = 2.913e-13
  • mean |cpu - gpu| = 1.395e-15
  • min |cpu - gpu| = 0.000e+00
  • num(|diff| > 1.0e-12) = 0

I also compared the two scalar scaling factors used in the bound-objective rescaling step:

[CPU] con_bound_rescale = 3.8017920572592861e-04
[CPU] obj_vec_rescale   = 5.3256277297530932e-06

[GPU] con_bound_rescale = 3.8017920570129657e-04
[GPU] obj_vec_rescale   = 5.3256277287640704e-06

These results confirm that the CPU and GPU preconditioners produce numerically equivalent scaling. I will check initialize_solver_state in solver.cu for the next step.

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Dec 4, 2025

I performed an additional experiment to check whether the discrepancies come from the solver (solver.cu).

To isolate the effect, I swapped the CPU/GPU scaling factors:

  • solver.cu for CPU Precondition + GPU scaling factors
  • solver.cu for GPU Precondition + CPU scaling factors

Under this setup, the CPU and GPU solvers behave consistently: given identical scaling vectors, they produce nearly identical trajectories and final results.

This strongly suggests thatsolver.cu is not introducing any systematic numerical bias.

Therefore, @ZedongPeng and @jinwen-yang, even though the scaling vectors differ only at the level of 1e−13 to 1e−15, the total iteration count on these two instances ends up quite different between CPU and GPU. Do you think such tiny discrepancies in scaling could naturally lead to large divergence in iteration counts for these particular problems? Or does this indicate that these instances are extremely numerically sensitive?

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Dec 12, 2025

I check the values in initialize_step_size_and_primal_weight():

  • max_sv returned by estimate_maximum_singular_value(...)
  • state->step_size = 0.998 / max_sv
  • state->primal_weight

I compared CPU and GPU runs on the following instances:

  • neos-4332810-sesia
  • neos-4332801-seret

Instance: neos-4332810-sesia

  • GPU:
    • max_sv = 0.99981011762876171
    • step_size = 0.99818953859653392
    • primal_weight = 1
  • CPU:
    • max_sv = 0.99981011762876093
    • step_size = 0.99818953859653470
    • primal_weight = 1

Instance: neos-4332801-seret

  • GPU:
    • max_sv = 0.99981011762876171
    • step_size = 0.99818953859653392
    • primal_weight = 1
  • CPU:
    • max_sv = 0.99981011762876093
    • step_size = 0.99818953859653470
    • primal_weight = 1

This indicates that the CPU/GPU discrepancy is not coming from initialize_step_size_and_primal_weight() (nor from the singular-value estimation feeding it), since both paths produce essentially identical initialization parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants