Skip to content

[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor#7517

Open
Roux-sq wants to merge 41 commits into
deepmodeling:developfrom
collapsar-z:diago_final
Open

[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor#7517
Roux-sq wants to merge 41 commits into
deepmodeling:developfrom
collapsar-z:diago_final

Conversation

@Roux-sq

@Roux-sq Roux-sq commented Jun 25, 2026

Copy link
Copy Markdown

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

  • A unit test is added for each new feature or bug fix.

What's changed?

  • Example: My changes might affect the performance of the application under certain conditions, and I have tested the impact on various scenarios...

Any changes of core modules? (ignore if not applicable)

  • Example: I have added a new virtual function in the esolver base class in order to ...

[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor

Overview

This PR introduces a systematic extension and refactor of the ABACUS Hamiltonian eigenvalue solver module. The main changes include:

  1. New PPCG solver: implements the Projected Preconditioned Conjugate Gradient method with band-locking, Rayleigh-Ritz acceleration, extra buffer bands, and CPU/CUDA Device-path support.
  2. OpenMP parallel optimization: adds band-level OpenMP parallel regions to iterative solvers such as BPCG, Davidson, and DavSubspace.
  3. Shared orthogonalization refactor: adds DiagOrthogonalizer to unify duplicated orthogonalization logic across multiple solvers.
  4. DiagoTrace convergence diagnostics: provides optional CSV tracing to help analyze the iteration process of different solvers.
  5. DiagoAutoSelector solver recommendation: provides explainable solver recommendations based on problem size and runtime environment, without changing default behavior.
  6. Test and benchmark support: adds PPCG tests, OpenMP consistency tests, DiagoTrace validation paths, and related benchmark targets.

By default, these changes preserve the original solve flow. The new diagnostics and recommendation features are enabled explicitly through environment variables.

The main changes are concentrated in source/source_hsolver/, including the new solver, shared orthogonalization utilities, diagnostic tools, and related tests/benchmarks.


Main Changes

PPCG Eigenvalue Solver

This PR adds the DiagoPPCG class, implementing the Projected Preconditioned Conjugate Gradient method. The implementation follows the existing ABACUS T, Device template and device abstraction system, supports the CPU path, and provides corresponding CUDA GPU support and benchmark targets.

Key features:

  • Band-locking: converged bands are locked to reduce drift and avoid repeated work in later iterations.
  • Rayleigh-Ritz acceleration: periodically performs subspace diagonalization to improve convergence.
  • Extra buffer bands: uses an additional 10% of bands as a convergence buffer to reduce the risk of stagnation.
  • MPI band-group distribution: adapts to the existing MPI band-group distribution; locked and unlocked bands are handled through their corresponding paths.
  • CPU/CUDA device paths: adapts CPU and CUDA GPU execution through the ABACUS device abstraction layer.
  • Subspace dimension strategy: uses a 2D subspace in the first iteration and a 3D subspace in later iterations.

The PPCG path can be selected through the INPUT solver setting.


OpenMP Parallel Optimization

Band-level OpenMP parallelism is added to multiple iterative solvers to improve efficiency in multi-band cases.

Solver Parallelized regions
BPCG line_minimize_with_block_op, calc_grad_with_block_op, normalize_op
Davidson convergence check in diag_once, gradient calculation, preconditioning, refresh
DavSubspace convergence check in diag_once, generalized eigenproblem assembly, refresh

The OpenMP regions use schedule(static) and are guarded by if(nband > threshold) conditions to avoid thread overhead on small problems.


Shared Orthogonalization Refactor

This PR adds DiagOrthogonalizer<T, Device> at:

source/source_hsolver/module_diag/diag_orthogonalizer.h

The class provides shared orthogonalization operations that repeatedly appear in different solvers:

  • Cholesky-based orthonormalization: cholesky_orth_parallel
  • Projection-based orthogonalization: project_out_parallel, overlap_with_metric, project_out_with_coeff
  • Schmidt orthogonalization with S metric: schmidt_orthogonalize_s_metric
  • Plain Gram-Schmidt orthogonalization and orthogonality checks

This refactor replaces part of the scattered hand-written orthogonalization logic in BPCG, CG, Davidson, and PPCG, reducing duplicated implementations and providing a more centralized maintenance point.

Davidson requires extra care: its orthogonalization logic is not an isolated step, but is coupled with dynamic subspace expansion, nbase/notconv state updates, new-direction generation in cal_grad, and the refresh restart process. Therefore, the Davidson path is refactored conservatively: only the safely reusable low-level orthogonalization operations are extracted, while the solver-specific iteration flow remains intact.


DiagoTrace Convergence Diagnostics

This PR adds a lightweight header-only CSV tracer:

source/source_hsolver/module_diag/diago_trace.h

It records convergence data for each iteration, helping with performance analysis and debugging.

Environment variables:

ABACUS_DIAGO_TRACE=1              # Enable tracing; rank 0 only by default
ABACUS_DIAGO_TRACE_FILE=<path>    # Custom output path
ABACUS_DIAGO_TRACE_ALL_RANKS=1    # Output one file per MPI rank

Output columns:

solver, rank, iter, nband, max_residual, avg_residual, n_converged, orth_error, note

Currently integrated into:

  • PPCG
  • BPCG
  • CG
  • Davidson

By default, DiagoTrace is fully disabled: it does not create files and does not change the solve flow. In MPI runs, only rank 0 writes CSV by default. When ABACUS_DIAGO_TRACE_ALL_RANKS=1 is enabled, each rank writes a separate file to avoid multiple processes writing to the same CSV file.


DiagoAutoSelector Solver Recommendation

This PR adds a header-only solver recommendation utility:

source/source_hsolver/module_diag/diago_auto_selector.h

It gives explainable solver recommendations based on problem features, including:

  • current method;
  • calculation type;
  • matrix/basis size;
  • number of bands;
  • number of MPI pool processes;
  • SCF iteration index;
  • device type.

Environment variables:

ABACUS_DIAGO_AUTO_REPORT=1    # Report recommendation only; do not change the actual solver
ABACUS_DIAGO_AUTO_SELECT=1    # Use the recommended solver to override the current method

The recommendation rules are roughly prioritized as follows:

  1. GPU device -> bpcg: PPCG has CUDA-path support, but before sufficient benchmark data is available, AutoSelector conservatively recommends the more regular block-CG path for GPU cases.
  2. NSCF calculation -> dav, prioritizing the more robust full-convergence path.
  3. MPI + many bands -> bpcg, as block operations can better benefit from parallel execution.
  4. Many bands -> ppcg, where projected/block updates may be beneficial.
  5. Large PW subspace + non-initial SCF step -> dav_subspace.
  6. Default case -> cg, as the conservative fallback.

AutoSelector is a heuristic recommendation system. It does not claim to provide the optimal solver for every system. Its main purpose is to provide explainable selection hints for benchmarks and future rule calibration.

Safety mechanisms:

  • ABACUS_DIAGO_AUTO_REPORT=1 reports recommendations only and does not change the solver.
  • Actual switching happens only when ABACUS_DIAGO_AUTO_SELECT=1 is set.
  • After the preconditioner has been set up, AutoSelector does not switch across the dav_subspace boundary, avoiding inconsistent preconditioner usage.

Benchmark and Test Infrastructure

This PR adds or extends the following test/benchmark targets:

Target Description
MODULE_HSOLVER_ppcg PPCG unit tests, including random Hermitian matrices and fixed-size matrices
MODULE_HSOLVER_ppcg_bench PPCG performance benchmark
MODULE_HSOLVER_bpcg_bench BPCG performance benchmark
MODULE_HSOLVER_david_bench Davidson performance benchmark
MODULE_HSOLVER_ppcg_bench_cuda PPCG benchmark target for CUDA environments
MODULE_HSOLVER_openmp_consistency OpenMP consistency test covering BPCG and Davidson with different OMP_NUM_THREADS settings

Example test commands:

# PPCG unit test
ctest --test-dir build -V -R MODULE_HSOLVER_ppcg

# OpenMP consistency test
ctest --test-dir build -V -R MODULE_HSOLVER_openmp_consistency

DiagoTrace can be checked with:

export ABACUS_DIAGO_TRACE=1
export ABACUS_DIAGO_TRACE_FILE=diago_trace.csv

After running the relevant hsolver tests, check that diago_trace.csv is generated and contains the header:

solver,rank,iter,nband,max_residual,avg_residual,n_converged,orth_error,note

DiagoAutoSelector can first be validated in report-only mode:

export ABACUS_DIAGO_AUTO_REPORT=1
unset ABACUS_DIAGO_AUTO_SELECT

This mode reports the recommendation and reason only, without changing the actual solver. It is suitable as an observation tool before benchmarking.


Compatibility and Behavior Changes

By default, this PR does not change existing user behavior:

  • Existing solvers are still selected according to input parameters.
  • OpenMP regions are guarded by _OPENMP and band-count thresholds.
  • DiagoTrace is disabled by default.
  • DiagoAutoSelector is disabled by default.
  • ABACUS_DIAGO_AUTO_REPORT=1 reports recommendations only and does not change the solver.
  • The recommended solver is used only when ABACUS_DIAGO_AUTO_SELECT=1 is explicitly set.

Therefore, the new capabilities are added incrementally. PW solver selection may change only when auto-selection mode is explicitly enabled.


Breaking Changes

No default breaking changes.

Notes:

  • ABACUS_DIAGO_AUTO_SELECT=1 is explicit opt-in behavior and may change the selected PW solver.
  • DiagoTrace generates CSV files when enabled; it produces no extra output by default.
  • AutoSelector is currently a heuristic rule system and can be further calibrated with more benchmark data.

Follow-Up Work

Potential follow-up work includes:

  • Calibrating AutoSelector rules with more real-system benchmark data.
  • Recovering performance for Davidson paths affected by the orthogonalization refactor.
  • Extending DiagoTrace statistics, for example by outputting globally reduced residuals.
  • Running systematic benchmarks for PPCG, BPCG, and Davidson across different band counts and parallel scales.

collapsar-z and others added 30 commits May 15, 2026 16:17
…feat: add 4 bigger matrixs to test ppcg algorithm and update CMakeLists.txt
… the efficiency of previous ppcg and new ppcg
…p and calc_grad_with_block_op

Refactor the outer band loops to use coarse-grained #pragma omp parallel
with separate work-sharing for/if directives. Each band's MPI reductions
(Parallel_Reduce::reduce_pool) are collected into per-band arrays and
performed serially inside #pragma omp single barriers, eliminating the
need for MPI_THREAD_MULTIPLE and nested parallelism.

Key changes:
- line_minimize_with_block_op: 5-step parallel pipeline (norm, reduce,
  normalize+epsilo, reduce, update) with n_band > 4 guard.
- calc_grad_with_block_op: 7-step parallel pipeline (norm, reduce,
  normalize+epsilo, reduce, err+beta, reduce, update) with n_band > 4 guard.
- Replace BlasConnector::dot with manual std::norm accumulation to avoid
  thread-safety issues with BLAS dot inside OpenMP loops.
Replace per-band dot_real_op + vector_div_constant_op calls with a
3-step parallel pipeline: (1) parallel norm accumulation, (2) serial
MPI reduce, (3) parallel division. This avoids repeated BLAS1 calls
and nested threading from vector_div_constant_op's internal parallel.
Uses if(notconv > 4) guard.
- math_kernel_op_vec.cpp: add if(dim > 256) guards to all OpenMP
  parallel loops (vector_mul_real_op, vector_mul_vector_op,
  vector_div_constant_op, vector_div_vector_op, vector_add_vector_op)
  to avoid thread-spawn overhead on small arrays.

- diago_openmp_consistency_test.cpp: new gtest suite verifying that
  BPCG and Davidson produce bitwise-identical eigenvalues across
  OMP_NUM_THREADS={1,2,4}.

- CMakeLists.txt: add MODULE_HSOLVER_openmp_consistency target.
Copilot AI review requested due to automatic review settings June 25, 2026 14:57

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request extends and refactors ABACUS’s Hamiltonian diagonalization (HSolver) stack by adding a new PPCG solver (CPU/CUDA), introducing band-level OpenMP parallelism in iterative solvers, consolidating shared orthogonalization logic, and adding optional diagnostics (CSV tracing) plus an auto-selection helper for solver recommendation/override.

Changes:

  • Add PPCG solver implementation and integrate it into PW HSolver (plus input/printing support).
  • Add OpenMP parallel regions across BPCG/Davidson/DavSubspace and refactor orthogonalization helpers into DiagOrthogonalizer.
  • Add diagnostics/tools and test/benchmark targets (PPCG tests/bench, OpenMP consistency test, DiagoTrace, DiagoAutoSelector).

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
source/source_io/module_parameter/read_input_item_elec_stru.cpp Extend ks_solver input options/validation to include ppcg.
source/source_hsolver/test/diago_ppcg_test.cpp Add PPCG unit tests against LAPACK reference eigenvalues.
source/source_hsolver/test/diago_ppcg_bench.cpp Add PPCG CPU benchmark executable emitting CSV.
source/source_hsolver/test/diago_ppcg_bench_cuda.cpp Add PPCG CUDA benchmark executable emitting CSV.
source/source_hsolver/test/diago_openmp_consistency_test.cpp Add OpenMP thread-count consistency test for BPCG/Davidson.
source/source_hsolver/test/diago_mock.h Minor comment formatting adjustments in mock utilities.
source/source_hsolver/test/diago_david_bench.cpp Add Davidson benchmark executable emitting CSV.
source/source_hsolver/test/diago_bpcg_bench.cpp Add BPCG benchmark executable emitting CSV.
source/source_hsolver/test/CMakeLists.txt Register new tests/benchmarks and adjust link libraries.
source/source_hsolver/module_diag/diago_trace.h Add optional CSV convergence tracing utility.
source/source_hsolver/module_diag/diago_auto_selector.h Add heuristic solver recommendation/auto-select helper (env-controlled).
source/source_hsolver/module_diag/diag_orthogonalizer.h Introduce shared orthogonalization/projection kernels for solvers.
source/source_hsolver/kernels/bpcg_kernel_op.cpp Add OpenMP parallelization and restructure CPU kernels (BPCG-related).
source/source_hsolver/hsolver_pw.h Add static storage for PPCG “extra bands” across SCF steps.
source/source_hsolver/hsolver_pw.cpp Integrate PPCG path and DiagoAutoSelector into PW solver flow.
source/source_hsolver/hsolver_pw_sdft.cpp Extend supported solver list to include ppcg.
source/source_hsolver/diago_ppcg.h Add PPCG solver interface/controls (extra bands, block sizes).
source/source_hsolver/diago_ppcg.cpp Add PPCG solver implementation (CPU/GPU templates), with tracing hooks.
source/source_hsolver/diago_david.cpp Add OpenMP parallel regions + integrate DiagoTrace (Davidson).
source/source_hsolver/diago_dav_subspace.cpp Add OpenMP parallel regions in DavSubspace.
source/source_hsolver/diago_cg.cpp Use shared orthogonalizer utilities + integrate DiagoTrace (CG).
source/source_hsolver/diago_bpcg.cpp Use shared orthogonalizer utilities + integrate DiagoTrace (BPCG).
source/source_hsolver/CMakeLists.txt Add PPCG implementation source to HSolver build objects.
source/source_estate/elecstate_print.cpp Add ppcg short code for SCF iteration printing.
source/source_base/kernels/math_kernel_op_vec.cpp Add OpenMP threshold guards for vector kernels on CPU.
CMakeFiles/CMakeSystem.cmake New file added (appears to be generated build artifact).
cmake/FindLapack.cmake Switch to directly including CMake’s built-in FindLAPACK module.
cmake/FindBlas.cmake Switch to directly including CMake’s built-in FindBLAS module.
Comments suppressed due to low confidence (2)

cmake/FindBlas.cmake:12

  • Including CMake’s built-in FindBLAS module directly bypasses find_package(... REQUIRED) behavior; if BLAS is not found, configuration can proceed with BLAS_LIBRARIES empty, causing later link errors that are harder to diagnose. Add an explicit fatal check after the include to preserve REQUIRED semantics.
include("${CMAKE_ROOT}/Modules/FindBLAS.cmake")

if(NOT TARGET BLAS::BLAS)
    add_library(BLAS::BLAS UNKNOWN IMPORTED)
    set_target_properties(BLAS::BLAS PROPERTIES

source/source_hsolver/kernels/bpcg_kernel_op.cpp:291

  • precondition_op now allocates a std::vector of length dim inside the (potentially parallel) loop over notconv. This can cause significant allocation overhead and memory churn for large dim/notconv, reducing the benefit of OpenMP parallelization. Prefer a per-thread buffer allocated once per thread and reused across m.
#ifdef _OPENMP
#pragma omp parallel for schedule(static) if(notconv > 4)
#endif
        for (int m = 0; m < notconv; m++)
        {
            std::vector<Real> pre(dim, 0.0);
            for (size_t i = 0; i < dim; i++)
            {
                Real x = std::abs(precondition[i] - eigenvalues[m]);
                pre[i] = 0.5 * (1.0 + x + sqrt(1 + (x - 1.0) * (x - 1.0)));
            }
            ModuleBase::vector_div_vector_op<T, base_device::DEVICE_CPU>()(
                                                             dim,
                                                             psi_iter + (nbase + m) * dim,
                                                             psi_iter + (nbase + m) * dim,
                                                             pre.data());
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/source_hsolver/diago_ppcg.cpp
Comment thread source/source_hsolver/diago_ppcg.cpp
Comment thread cmake/FindLapack.cmake
Comment thread CMakeFiles/CMakeSystem.cmake Outdated
Comment thread source/source_hsolver/hsolver_pw.cpp Outdated
Comment thread source/source_hsolver/hsolver_pw.cpp
Roux-sq and others added 11 commits June 25, 2026 23:11
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…dson convergence flags

std::vector<bool> packs bits and is not thread-safe under concurrent
parallel writes from OpenMP, causing non-deterministic hangs/crashes
(e.g. 01_PW/035_PW_15_SO with many threads). Use std::vector<int>
for independent per-element writes in diago_david and diago_dav_subspace.
DiagOrthogonalizer::rotate_parallel copied workspace back to block
using the global band count (ncol), but both buffers are local to the
MPI band group and only contain plintrans.ncolB columns. With bndpar>1
this overran the buffer, corrupted heap metadata, and caused segfaults
or malloc_consolidate errors in BPCG-based SDFT runs such as
06_SDFT/12_PW_BPCG_SDFT_5D11S.

Use plintrans.ncolB for the syncmem copy size.
The original 1e-10 tolerance was stricter than necessary for this
regression test. The solvers are now configured with tight convergence
thresholds, so 1e-5 is sufficient to verify thread-count invariance
without being sensitive to benign floating-point reordering.
Many unit tests declared test data/scripts with install() so the
fixtures were only available after cmake --install, causing ctest to
report failures or "Not Run" right after cmake --build. Replace
install(FILES ...) and install(DIRECTORY ...) with configure_file(...
COPYONLY) or file(COPY ...) so the fixtures are copied into the build
tree during configuration.

This fixes the bulk of locally runnable unit tests; remaining failures
are due to missing optional dependencies (ELPA, LIBRI, MLALGO) or
pre-existing test issues unrelated to this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants