[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor by Roux-sq · Pull Request #7517 · deepmodeling/abacus-develop

Roux-sq · 2026-06-25T14:57:18Z

Reminder

Have you linked an issue with this pull request?
Have you added adequate unit tests and/or case tests for your pull request?
Have you noticed possible changes of behavior below or in the linked issue?
Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

A unit test is added for each new feature or bug fix.

What's changed?

Example: My changes might affect the performance of the application under certain conditions, and I have tested the impact on various scenarios...

Any changes of core modules? (ignore if not applicable)

Example: I have added a new virtual function in the esolver base class in order to ...

[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor

Overview

This PR introduces a systematic extension and refactor of the ABACUS Hamiltonian eigenvalue solver module. The main changes include:

New PPCG solver: implements the Projected Preconditioned Conjugate Gradient method with band-locking, Rayleigh-Ritz acceleration, extra buffer bands, and CPU/CUDA Device-path support.
OpenMP parallel optimization: adds band-level OpenMP parallel regions to iterative solvers such as BPCG, Davidson, and DavSubspace.
Shared orthogonalization refactor: adds DiagOrthogonalizer to unify duplicated orthogonalization logic across multiple solvers.
DiagoTrace convergence diagnostics: provides optional CSV tracing to help analyze the iteration process of different solvers.
DiagoAutoSelector solver recommendation: provides explainable solver recommendations based on problem size and runtime environment, without changing default behavior.
Test and benchmark support: adds PPCG tests, OpenMP consistency tests, DiagoTrace validation paths, and related benchmark targets.

By default, these changes preserve the original solve flow. The new diagnostics and recommendation features are enabled explicitly through environment variables.

The main changes are concentrated in source/source_hsolver/, including the new solver, shared orthogonalization utilities, diagnostic tools, and related tests/benchmarks.

Main Changes

PPCG Eigenvalue Solver

This PR adds the DiagoPPCG class, implementing the Projected Preconditioned Conjugate Gradient method. The implementation follows the existing ABACUS T, Device template and device abstraction system, supports the CPU path, and provides corresponding CUDA GPU support and benchmark targets.

Key features:

Band-locking: converged bands are locked to reduce drift and avoid repeated work in later iterations.
Rayleigh-Ritz acceleration: periodically performs subspace diagonalization to improve convergence.
Extra buffer bands: uses an additional 10% of bands as a convergence buffer to reduce the risk of stagnation.
MPI band-group distribution: adapts to the existing MPI band-group distribution; locked and unlocked bands are handled through their corresponding paths.
CPU/CUDA device paths: adapts CPU and CUDA GPU execution through the ABACUS device abstraction layer.
Subspace dimension strategy: uses a 2D subspace in the first iteration and a 3D subspace in later iterations.

The PPCG path can be selected through the INPUT solver setting.

OpenMP Parallel Optimization

Band-level OpenMP parallelism is added to multiple iterative solvers to improve efficiency in multi-band cases.

Solver	Parallelized regions
BPCG	`line_minimize_with_block_op`, `calc_grad_with_block_op`, `normalize_op`
Davidson	convergence check in `diag_once`, gradient calculation, preconditioning, refresh
DavSubspace	convergence check in `diag_once`, generalized eigenproblem assembly, refresh

The OpenMP regions use schedule(static) and are guarded by if(nband > threshold) conditions to avoid thread overhead on small problems.

Shared Orthogonalization Refactor

This PR adds DiagOrthogonalizer<T, Device> at:

source/source_hsolver/module_diag/diag_orthogonalizer.h

The class provides shared orthogonalization operations that repeatedly appear in different solvers:

Cholesky-based orthonormalization: cholesky_orth_parallel
Projection-based orthogonalization: project_out_parallel, overlap_with_metric, project_out_with_coeff
Schmidt orthogonalization with S metric: schmidt_orthogonalize_s_metric
Plain Gram-Schmidt orthogonalization and orthogonality checks

This refactor replaces part of the scattered hand-written orthogonalization logic in BPCG, CG, Davidson, and PPCG, reducing duplicated implementations and providing a more centralized maintenance point.

Davidson requires extra care: its orthogonalization logic is not an isolated step, but is coupled with dynamic subspace expansion, nbase/notconv state updates, new-direction generation in cal_grad, and the refresh restart process. Therefore, the Davidson path is refactored conservatively: only the safely reusable low-level orthogonalization operations are extracted, while the solver-specific iteration flow remains intact.

DiagoTrace Convergence Diagnostics

This PR adds a lightweight header-only CSV tracer:

source/source_hsolver/module_diag/diago_trace.h

It records convergence data for each iteration, helping with performance analysis and debugging.

Environment variables:

ABACUS_DIAGO_TRACE=1              # Enable tracing; rank 0 only by default
ABACUS_DIAGO_TRACE_FILE=<path>    # Custom output path
ABACUS_DIAGO_TRACE_ALL_RANKS=1    # Output one file per MPI rank

Output columns:

solver, rank, iter, nband, max_residual, avg_residual, n_converged, orth_error, note

Currently integrated into:

PPCG
BPCG
CG
Davidson

By default, DiagoTrace is fully disabled: it does not create files and does not change the solve flow. In MPI runs, only rank 0 writes CSV by default. When ABACUS_DIAGO_TRACE_ALL_RANKS=1 is enabled, each rank writes a separate file to avoid multiple processes writing to the same CSV file.

DiagoAutoSelector Solver Recommendation

This PR adds a header-only solver recommendation utility:

source/source_hsolver/module_diag/diago_auto_selector.h

It gives explainable solver recommendations based on problem features, including:

current method;
calculation type;
matrix/basis size;
number of bands;
number of MPI pool processes;
SCF iteration index;
device type.

Environment variables:

ABACUS_DIAGO_AUTO_REPORT=1    # Report recommendation only; do not change the actual solver
ABACUS_DIAGO_AUTO_SELECT=1    # Use the recommended solver to override the current method

The recommendation rules are roughly prioritized as follows:

GPU device -> bpcg: PPCG has CUDA-path support, but before sufficient benchmark data is available, AutoSelector conservatively recommends the more regular block-CG path for GPU cases.
NSCF calculation -> dav, prioritizing the more robust full-convergence path.
MPI + many bands -> bpcg, as block operations can better benefit from parallel execution.
Many bands -> ppcg, where projected/block updates may be beneficial.
Large PW subspace + non-initial SCF step -> dav_subspace.
Default case -> cg, as the conservative fallback.

AutoSelector is a heuristic recommendation system. It does not claim to provide the optimal solver for every system. Its main purpose is to provide explainable selection hints for benchmarks and future rule calibration.

Safety mechanisms:

ABACUS_DIAGO_AUTO_REPORT=1 reports recommendations only and does not change the solver.
Actual switching happens only when ABACUS_DIAGO_AUTO_SELECT=1 is set.
After the preconditioner has been set up, AutoSelector does not switch across the dav_subspace boundary, avoiding inconsistent preconditioner usage.

Benchmark and Test Infrastructure

This PR adds or extends the following test/benchmark targets:

Target	Description
`MODULE_HSOLVER_ppcg`	PPCG unit tests, including random Hermitian matrices and fixed-size matrices
`MODULE_HSOLVER_ppcg_bench`	PPCG performance benchmark
`MODULE_HSOLVER_bpcg_bench`	BPCG performance benchmark
`MODULE_HSOLVER_david_bench`	Davidson performance benchmark
`MODULE_HSOLVER_ppcg_bench_cuda`	PPCG benchmark target for CUDA environments
`MODULE_HSOLVER_openmp_consistency`	OpenMP consistency test covering BPCG and Davidson with different `OMP_NUM_THREADS` settings

Example test commands:

# PPCG unit test
ctest --test-dir build -V -R MODULE_HSOLVER_ppcg

# OpenMP consistency test
ctest --test-dir build -V -R MODULE_HSOLVER_openmp_consistency

DiagoTrace can be checked with:

export ABACUS_DIAGO_TRACE=1
export ABACUS_DIAGO_TRACE_FILE=diago_trace.csv

After running the relevant hsolver tests, check that diago_trace.csv is generated and contains the header:

solver,rank,iter,nband,max_residual,avg_residual,n_converged,orth_error,note

DiagoAutoSelector can first be validated in report-only mode:

export ABACUS_DIAGO_AUTO_REPORT=1
unset ABACUS_DIAGO_AUTO_SELECT

This mode reports the recommendation and reason only, without changing the actual solver. It is suitable as an observation tool before benchmarking.

Compatibility and Behavior Changes

By default, this PR does not change existing user behavior:

Existing solvers are still selected according to input parameters.
OpenMP regions are guarded by _OPENMP and band-count thresholds.
DiagoTrace is disabled by default.
DiagoAutoSelector is disabled by default.
ABACUS_DIAGO_AUTO_REPORT=1 reports recommendations only and does not change the solver.
The recommended solver is used only when ABACUS_DIAGO_AUTO_SELECT=1 is explicitly set.

Therefore, the new capabilities are added incrementally. PW solver selection may change only when auto-selection mode is explicitly enabled.

Breaking Changes

No default breaking changes.

Notes:

ABACUS_DIAGO_AUTO_SELECT=1 is explicit opt-in behavior and may change the selected PW solver.
DiagoTrace generates CSV files when enabled; it produces no extra output by default.
AutoSelector is currently a heuristic rule system and can be further calibrated with more benchmark data.

Follow-Up Work

Potential follow-up work includes:

Calibrating AutoSelector rules with more real-system benchmark data.
Recovering performance for Davidson paths affected by the orthogonalization refactor.
Extending DiagoTrace statistics, for example by outputting globally reduced residuals.
Running systematic benchmarks for PPCG, BPCG, and Davidson across different band counts and parallel scales.

…feat: add 4 bigger matrixs to test ppcg algorithm and update CMakeLists.txt

… the efficiency of previous ppcg and new ppcg

…p and calc_grad_with_block_op Refactor the outer band loops to use coarse-grained #pragma omp parallel with separate work-sharing for/if directives. Each band's MPI reductions (Parallel_Reduce::reduce_pool) are collected into per-band arrays and performed serially inside #pragma omp single barriers, eliminating the need for MPI_THREAD_MULTIPLE and nested parallelism. Key changes: - line_minimize_with_block_op: 5-step parallel pipeline (norm, reduce, normalize+epsilo, reduce, update) with n_band > 4 guard. - calc_grad_with_block_op: 7-step parallel pipeline (norm, reduce, normalize+epsilo, reduce, err+beta, reduce, update) with n_band > 4 guard. - Replace BlasConnector::dot with manual std::norm accumulation to avoid thread-safety issues with BLAS dot inside OpenMP loops.

Replace per-band dot_real_op + vector_div_constant_op calls with a 3-step parallel pipeline: (1) parallel norm accumulation, (2) serial MPI reduce, (3) parallel division. This avoids repeated BLAS1 calls and nested threading from vector_div_constant_op's internal parallel. Uses if(notconv > 4) guard.

- math_kernel_op_vec.cpp: add if(dim > 256) guards to all OpenMP parallel loops (vector_mul_real_op, vector_mul_vector_op, vector_div_constant_op, vector_div_vector_op, vector_add_vector_op) to avoid thread-spawn overhead on small arrays. - diago_openmp_consistency_test.cpp: new gtest suite verifying that BPCG and Davidson produce bitwise-identical eigenvalues across OMP_NUM_THREADS={1,2,4}. - CMakeLists.txt: add MODULE_HSOLVER_openmp_consistency target.

…openmp-orth-ppcg-v2

…g-v2

Copilot

Pull request overview

This pull request extends and refactors ABACUS’s Hamiltonian diagonalization (HSolver) stack by adding a new PPCG solver (CPU/CUDA), introducing band-level OpenMP parallelism in iterative solvers, consolidating shared orthogonalization logic, and adding optional diagnostics (CSV tracing) plus an auto-selection helper for solver recommendation/override.

Changes:

Add PPCG solver implementation and integrate it into PW HSolver (plus input/printing support).
Add OpenMP parallel regions across BPCG/Davidson/DavSubspace and refactor orthogonalization helpers into DiagOrthogonalizer.
Add diagnostics/tools and test/benchmark targets (PPCG tests/bench, OpenMP consistency test, DiagoTrace, DiagoAutoSelector).

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
source/source_io/module_parameter/read_input_item_elec_stru.cpp	Extend `ks_solver` input options/validation to include `ppcg`.
source/source_hsolver/test/diago_ppcg_test.cpp	Add PPCG unit tests against LAPACK reference eigenvalues.
source/source_hsolver/test/diago_ppcg_bench.cpp	Add PPCG CPU benchmark executable emitting CSV.
source/source_hsolver/test/diago_ppcg_bench_cuda.cpp	Add PPCG CUDA benchmark executable emitting CSV.
source/source_hsolver/test/diago_openmp_consistency_test.cpp	Add OpenMP thread-count consistency test for BPCG/Davidson.
source/source_hsolver/test/diago_mock.h	Minor comment formatting adjustments in mock utilities.
source/source_hsolver/test/diago_david_bench.cpp	Add Davidson benchmark executable emitting CSV.
source/source_hsolver/test/diago_bpcg_bench.cpp	Add BPCG benchmark executable emitting CSV.
source/source_hsolver/test/CMakeLists.txt	Register new tests/benchmarks and adjust link libraries.
source/source_hsolver/module_diag/diago_trace.h	Add optional CSV convergence tracing utility.
source/source_hsolver/module_diag/diago_auto_selector.h	Add heuristic solver recommendation/auto-select helper (env-controlled).
source/source_hsolver/module_diag/diag_orthogonalizer.h	Introduce shared orthogonalization/projection kernels for solvers.
source/source_hsolver/kernels/bpcg_kernel_op.cpp	Add OpenMP parallelization and restructure CPU kernels (BPCG-related).
source/source_hsolver/hsolver_pw.h	Add static storage for PPCG “extra bands” across SCF steps.
source/source_hsolver/hsolver_pw.cpp	Integrate PPCG path and DiagoAutoSelector into PW solver flow.
source/source_hsolver/hsolver_pw_sdft.cpp	Extend supported solver list to include `ppcg`.
source/source_hsolver/diago_ppcg.h	Add PPCG solver interface/controls (extra bands, block sizes).
source/source_hsolver/diago_ppcg.cpp	Add PPCG solver implementation (CPU/GPU templates), with tracing hooks.
source/source_hsolver/diago_david.cpp	Add OpenMP parallel regions + integrate DiagoTrace (Davidson).
source/source_hsolver/diago_dav_subspace.cpp	Add OpenMP parallel regions in DavSubspace.
source/source_hsolver/diago_cg.cpp	Use shared orthogonalizer utilities + integrate DiagoTrace (CG).
source/source_hsolver/diago_bpcg.cpp	Use shared orthogonalizer utilities + integrate DiagoTrace (BPCG).
source/source_hsolver/CMakeLists.txt	Add PPCG implementation source to HSolver build objects.
source/source_estate/elecstate_print.cpp	Add `ppcg` short code for SCF iteration printing.
source/source_base/kernels/math_kernel_op_vec.cpp	Add OpenMP threshold guards for vector kernels on CPU.
CMakeFiles/CMakeSystem.cmake	New file added (appears to be generated build artifact).
cmake/FindLapack.cmake	Switch to directly including CMake’s built-in FindLAPACK module.
cmake/FindBlas.cmake	Switch to directly including CMake’s built-in FindBLAS module.

Comments suppressed due to low confidence (2)

cmake/FindBlas.cmake:12

Including CMake’s built-in FindBLAS module directly bypasses find_package(... REQUIRED) behavior; if BLAS is not found, configuration can proceed with BLAS_LIBRARIES empty, causing later link errors that are harder to diagnose. Add an explicit fatal check after the include to preserve REQUIRED semantics.

include("${CMAKE_ROOT}/Modules/FindBLAS.cmake")

if(NOT TARGET BLAS::BLAS)
    add_library(BLAS::BLAS UNKNOWN IMPORTED)
    set_target_properties(BLAS::BLAS PROPERTIES

source/source_hsolver/kernels/bpcg_kernel_op.cpp:291

precondition_op now allocates a std::vector of length dim inside the (potentially parallel) loop over notconv. This can cause significant allocation overhead and memory churn for large dim/notconv, reducing the benefit of OpenMP parallelization. Prefer a per-thread buffer allocated once per thread and reused across m.

#ifdef _OPENMP
#pragma omp parallel for schedule(static) if(notconv > 4)
#endif
        for (int m = 0; m < notconv; m++)
        {
            std::vector<Real> pre(dim, 0.0);
            for (size_t i = 0; i < dim; i++)
            {
                Real x = std::abs(precondition[i] - eigenvalues[m]);
                pre[i] = 0.5 * (1.0 + x + sqrt(1 + (x - 1.0) * (x - 1.0)));
            }
            ModuleBase::vector_div_vector_op<T, base_device::DEVICE_CPU>()(
                                                             dim,
                                                             psi_iter + (nbase + m) * dim,
                                                             psi_iter + (nbase + m) * dim,
                                                             pre.data());
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…dson convergence flags std::vector<bool> packs bits and is not thread-safe under concurrent parallel writes from OpenMP, causing non-deterministic hangs/crashes (e.g. 01_PW/035_PW_15_SO with many threads). Use std::vector<int> for independent per-element writes in diago_david and diago_dav_subspace.

DiagOrthogonalizer::rotate_parallel copied workspace back to block using the global band count (ncol), but both buffers are local to the MPI band group and only contain plintrans.ncolB columns. With bndpar>1 this overran the buffer, corrupted heap metadata, and caused segfaults or malloc_consolidate errors in BPCG-based SDFT runs such as 06_SDFT/12_PW_BPCG_SDFT_5D11S. Use plintrans.ncolB for the syncmem copy size.

The original 1e-10 tolerance was stricter than necessary for this regression test. The solvers are now configured with tight convergence thresholds, so 1e-5 is sufficient to verify thread-count invariance without being sensitive to benign floating-point reordering.

Many unit tests declared test data/scripts with install() so the fixtures were only available after cmake --install, causing ctest to report failures or "Not Run" right after cmake --build. Replace install(FILES ...) and install(DIRECTORY ...) with configure_file(... COPYONLY) or file(COPY ...) so the fixtures are copied into the build tree during configuration. This fixes the bulk of locally runnable unit tests; remaining failures are due to missing optional dependencies (ELPA, LIBRI, MLALGO) or pre-existing test issues unrelated to this PR.

collapsar-z and others added 30 commits May 15, 2026 16:17

加入了PPCG的实现

b69e1e9

fix ppcg and pass tests

2d51b95

fix: bugs in diago_ppcg_test.cpp; delete diago_ppcg_simple_test.cpp; …

0ac427a

…feat: add 4 bigger matrixs to test ppcg algorithm and update CMakeLists.txt

feat: add some methods to faster ppcg algorithm, add tests to compare…

9a1618d

… the efficiency of previous ppcg and new ppcg

fix: bugs in compare bash

2348988

remove benchmark dir

e8f3406

remove benchmark dir

205516f

add annotation to ppcg code, change ppcg code to faster it

0182cc8

Add OpenMP parallelization to bpcg, davidson, dav_subspace kernels

90ea6f6

fix some bugs in ppcg and tried to faster the algo

58b3a95

add gpu

542bb4d

add gpu

66f4f85

WIP: 本地修改

f4ecedf

made ppcg FASTER

1822b95

Refactor hsolver orthogonalization kernels

7019422

add bench.cpp

17e6880

perf: restore batch gemm and planSchmidtOrth in Davidson

5756596

try to fix ppcg

ff49bd6

Merge branch 'origin/feat/hsolver-orth-refactor-opt' into feat/merge-…

c549a85

…openmp-orth-ppcg-v2

Merge branch 'origin/feat/openmp_opt' into feat/merge-openmp-orth-ppc…

5732c50

…g-v2

Merge branch 'origin/fix/ppcg-v2' into feat/merge-openmp-orth-ppcg-v2

a61a9a6

fix: remove duplicate benchmark targets after merging openmp_opt

f3e2e0b

remove md files

2960874

fix MPI benchmark

348359f

add more MPI for ppcg

fd4b61e

review all the changes, clear redundant part

0eae506

merge all the feats

467864c

Copilot AI review requested due to automatic review settings June 25, 2026 14:57

Copilot started reviewing on behalf of Roux-sq June 25, 2026 14:57 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Roux-sq and others added 11 commits June 25, 2026 23:11

Potential fix for pull request finding

30c8e17

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Merge branch 'develop' into diago_final

2aeb87f

fix bugs suggested by copilot

4746593

fix memory_recorder.h reference

d8a881f

改动xj

d4906c2

改动xj

d85547e

改动xj

d27e745

mohanchen added the project_learning label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor#7517

[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor#7517
Roux-sq wants to merge 41 commits into
deepmodeling:developfrom
collapsar-z:diago_final

Roux-sq commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Roux-sq commented Jun 25, 2026

Reminder

Linked Issue

Unit Tests and/or Case Tests for my changes

What's changed?

Any changes of core modules? (ignore if not applicable)

[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor

Overview

Main Changes

PPCG Eigenvalue Solver

OpenMP Parallel Optimization

Shared Orthogonalization Refactor

DiagoTrace Convergence Diagnostics

DiagoAutoSelector Solver Recommendation

Benchmark and Test Infrastructure

Compatibility and Behavior Changes

Breaking Changes

Follow-Up Work

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants