[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor#7517
Open
Roux-sq wants to merge 41 commits into
Open
[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor#7517Roux-sq wants to merge 41 commits into
Roux-sq wants to merge 41 commits into
Conversation
…feat: add 4 bigger matrixs to test ppcg algorithm and update CMakeLists.txt
… the efficiency of previous ppcg and new ppcg
…p and calc_grad_with_block_op Refactor the outer band loops to use coarse-grained #pragma omp parallel with separate work-sharing for/if directives. Each band's MPI reductions (Parallel_Reduce::reduce_pool) are collected into per-band arrays and performed serially inside #pragma omp single barriers, eliminating the need for MPI_THREAD_MULTIPLE and nested parallelism. Key changes: - line_minimize_with_block_op: 5-step parallel pipeline (norm, reduce, normalize+epsilo, reduce, update) with n_band > 4 guard. - calc_grad_with_block_op: 7-step parallel pipeline (norm, reduce, normalize+epsilo, reduce, err+beta, reduce, update) with n_band > 4 guard. - Replace BlasConnector::dot with manual std::norm accumulation to avoid thread-safety issues with BLAS dot inside OpenMP loops.
Replace per-band dot_real_op + vector_div_constant_op calls with a 3-step parallel pipeline: (1) parallel norm accumulation, (2) serial MPI reduce, (3) parallel division. This avoids repeated BLAS1 calls and nested threading from vector_div_constant_op's internal parallel. Uses if(notconv > 4) guard.
- math_kernel_op_vec.cpp: add if(dim > 256) guards to all OpenMP
parallel loops (vector_mul_real_op, vector_mul_vector_op,
vector_div_constant_op, vector_div_vector_op, vector_add_vector_op)
to avoid thread-spawn overhead on small arrays.
- diago_openmp_consistency_test.cpp: new gtest suite verifying that
BPCG and Davidson produce bitwise-identical eigenvalues across
OMP_NUM_THREADS={1,2,4}.
- CMakeLists.txt: add MODULE_HSOLVER_openmp_consistency target.
…openmp-orth-ppcg-v2
There was a problem hiding this comment.
Pull request overview
This pull request extends and refactors ABACUS’s Hamiltonian diagonalization (HSolver) stack by adding a new PPCG solver (CPU/CUDA), introducing band-level OpenMP parallelism in iterative solvers, consolidating shared orthogonalization logic, and adding optional diagnostics (CSV tracing) plus an auto-selection helper for solver recommendation/override.
Changes:
- Add PPCG solver implementation and integrate it into PW HSolver (plus input/printing support).
- Add OpenMP parallel regions across BPCG/Davidson/DavSubspace and refactor orthogonalization helpers into
DiagOrthogonalizer. - Add diagnostics/tools and test/benchmark targets (PPCG tests/bench, OpenMP consistency test, DiagoTrace, DiagoAutoSelector).
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| source/source_io/module_parameter/read_input_item_elec_stru.cpp | Extend ks_solver input options/validation to include ppcg. |
| source/source_hsolver/test/diago_ppcg_test.cpp | Add PPCG unit tests against LAPACK reference eigenvalues. |
| source/source_hsolver/test/diago_ppcg_bench.cpp | Add PPCG CPU benchmark executable emitting CSV. |
| source/source_hsolver/test/diago_ppcg_bench_cuda.cpp | Add PPCG CUDA benchmark executable emitting CSV. |
| source/source_hsolver/test/diago_openmp_consistency_test.cpp | Add OpenMP thread-count consistency test for BPCG/Davidson. |
| source/source_hsolver/test/diago_mock.h | Minor comment formatting adjustments in mock utilities. |
| source/source_hsolver/test/diago_david_bench.cpp | Add Davidson benchmark executable emitting CSV. |
| source/source_hsolver/test/diago_bpcg_bench.cpp | Add BPCG benchmark executable emitting CSV. |
| source/source_hsolver/test/CMakeLists.txt | Register new tests/benchmarks and adjust link libraries. |
| source/source_hsolver/module_diag/diago_trace.h | Add optional CSV convergence tracing utility. |
| source/source_hsolver/module_diag/diago_auto_selector.h | Add heuristic solver recommendation/auto-select helper (env-controlled). |
| source/source_hsolver/module_diag/diag_orthogonalizer.h | Introduce shared orthogonalization/projection kernels for solvers. |
| source/source_hsolver/kernels/bpcg_kernel_op.cpp | Add OpenMP parallelization and restructure CPU kernels (BPCG-related). |
| source/source_hsolver/hsolver_pw.h | Add static storage for PPCG “extra bands” across SCF steps. |
| source/source_hsolver/hsolver_pw.cpp | Integrate PPCG path and DiagoAutoSelector into PW solver flow. |
| source/source_hsolver/hsolver_pw_sdft.cpp | Extend supported solver list to include ppcg. |
| source/source_hsolver/diago_ppcg.h | Add PPCG solver interface/controls (extra bands, block sizes). |
| source/source_hsolver/diago_ppcg.cpp | Add PPCG solver implementation (CPU/GPU templates), with tracing hooks. |
| source/source_hsolver/diago_david.cpp | Add OpenMP parallel regions + integrate DiagoTrace (Davidson). |
| source/source_hsolver/diago_dav_subspace.cpp | Add OpenMP parallel regions in DavSubspace. |
| source/source_hsolver/diago_cg.cpp | Use shared orthogonalizer utilities + integrate DiagoTrace (CG). |
| source/source_hsolver/diago_bpcg.cpp | Use shared orthogonalizer utilities + integrate DiagoTrace (BPCG). |
| source/source_hsolver/CMakeLists.txt | Add PPCG implementation source to HSolver build objects. |
| source/source_estate/elecstate_print.cpp | Add ppcg short code for SCF iteration printing. |
| source/source_base/kernels/math_kernel_op_vec.cpp | Add OpenMP threshold guards for vector kernels on CPU. |
| CMakeFiles/CMakeSystem.cmake | New file added (appears to be generated build artifact). |
| cmake/FindLapack.cmake | Switch to directly including CMake’s built-in FindLAPACK module. |
| cmake/FindBlas.cmake | Switch to directly including CMake’s built-in FindBLAS module. |
Comments suppressed due to low confidence (2)
cmake/FindBlas.cmake:12
- Including CMake’s built-in FindBLAS module directly bypasses find_package(... REQUIRED) behavior; if BLAS is not found, configuration can proceed with BLAS_LIBRARIES empty, causing later link errors that are harder to diagnose. Add an explicit fatal check after the include to preserve REQUIRED semantics.
include("${CMAKE_ROOT}/Modules/FindBLAS.cmake")
if(NOT TARGET BLAS::BLAS)
add_library(BLAS::BLAS UNKNOWN IMPORTED)
set_target_properties(BLAS::BLAS PROPERTIES
source/source_hsolver/kernels/bpcg_kernel_op.cpp:291
- precondition_op now allocates a std::vector of length dim inside the (potentially parallel) loop over notconv. This can cause significant allocation overhead and memory churn for large dim/notconv, reducing the benefit of OpenMP parallelization. Prefer a per-thread buffer allocated once per thread and reused across m.
#ifdef _OPENMP
#pragma omp parallel for schedule(static) if(notconv > 4)
#endif
for (int m = 0; m < notconv; m++)
{
std::vector<Real> pre(dim, 0.0);
for (size_t i = 0; i < dim; i++)
{
Real x = std::abs(precondition[i] - eigenvalues[m]);
pre[i] = 0.5 * (1.0 + x + sqrt(1 + (x - 1.0) * (x - 1.0)));
}
ModuleBase::vector_div_vector_op<T, base_device::DEVICE_CPU>()(
dim,
psi_iter + (nbase + m) * dim,
psi_iter + (nbase + m) * dim,
pre.data());
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…dson convergence flags std::vector<bool> packs bits and is not thread-safe under concurrent parallel writes from OpenMP, causing non-deterministic hangs/crashes (e.g. 01_PW/035_PW_15_SO with many threads). Use std::vector<int> for independent per-element writes in diago_david and diago_dav_subspace.
DiagOrthogonalizer::rotate_parallel copied workspace back to block using the global band count (ncol), but both buffers are local to the MPI band group and only contain plintrans.ncolB columns. With bndpar>1 this overran the buffer, corrupted heap metadata, and caused segfaults or malloc_consolidate errors in BPCG-based SDFT runs such as 06_SDFT/12_PW_BPCG_SDFT_5D11S. Use plintrans.ncolB for the syncmem copy size.
The original 1e-10 tolerance was stricter than necessary for this regression test. The solvers are now configured with tight convergence thresholds, so 1e-5 is sufficient to verify thread-count invariance without being sensitive to benign floating-point reordering.
Many unit tests declared test data/scripts with install() so the fixtures were only available after cmake --install, causing ctest to report failures or "Not Run" right after cmake --build. Replace install(FILES ...) and install(DIRECTORY ...) with configure_file(... COPYONLY) or file(COPY ...) so the fixtures are copied into the build tree during configuration. This fixes the bulk of locally runnable unit tests; remaining failures are due to missing optional dependencies (ELPA, LIBRI, MLALGO) or pre-existing test issues unrelated to this PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reminder
Linked Issue
Fix #...
Unit Tests and/or Case Tests for my changes
What's changed?
Any changes of core modules? (ignore if not applicable)
[HSolver] PPCG Solver (CPU/CUDA), Parallel Optimization, and Diagonalization Refactor
Overview
This PR introduces a systematic extension and refactor of the ABACUS Hamiltonian eigenvalue solver module. The main changes include:
DiagOrthogonalizerto unify duplicated orthogonalization logic across multiple solvers.By default, these changes preserve the original solve flow. The new diagnostics and recommendation features are enabled explicitly through environment variables.
The main changes are concentrated in
source/source_hsolver/, including the new solver, shared orthogonalization utilities, diagnostic tools, and related tests/benchmarks.Main Changes
PPCG Eigenvalue Solver
This PR adds the
DiagoPPCGclass, implementing the Projected Preconditioned Conjugate Gradient method. The implementation follows the existing ABACUST, Devicetemplate and device abstraction system, supports the CPU path, and provides corresponding CUDA GPU support and benchmark targets.Key features:
The PPCG path can be selected through the INPUT solver setting.
OpenMP Parallel Optimization
Band-level OpenMP parallelism is added to multiple iterative solvers to improve efficiency in multi-band cases.
line_minimize_with_block_op,calc_grad_with_block_op,normalize_opdiag_once, gradient calculation, preconditioning, refreshdiag_once, generalized eigenproblem assembly, refreshThe OpenMP regions use
schedule(static)and are guarded byif(nband > threshold)conditions to avoid thread overhead on small problems.Shared Orthogonalization Refactor
This PR adds
DiagOrthogonalizer<T, Device>at:The class provides shared orthogonalization operations that repeatedly appear in different solvers:
cholesky_orth_parallelproject_out_parallel,overlap_with_metric,project_out_with_coeffschmidt_orthogonalize_s_metricThis refactor replaces part of the scattered hand-written orthogonalization logic in BPCG, CG, Davidson, and PPCG, reducing duplicated implementations and providing a more centralized maintenance point.
Davidson requires extra care: its orthogonalization logic is not an isolated step, but is coupled with dynamic subspace expansion,
nbase/notconvstate updates, new-direction generation incal_grad, and therefreshrestart process. Therefore, the Davidson path is refactored conservatively: only the safely reusable low-level orthogonalization operations are extracted, while the solver-specific iteration flow remains intact.DiagoTrace Convergence Diagnostics
This PR adds a lightweight header-only CSV tracer:
It records convergence data for each iteration, helping with performance analysis and debugging.
Environment variables:
Output columns:
Currently integrated into:
By default, DiagoTrace is fully disabled: it does not create files and does not change the solve flow. In MPI runs, only rank 0 writes CSV by default. When
ABACUS_DIAGO_TRACE_ALL_RANKS=1is enabled, each rank writes a separate file to avoid multiple processes writing to the same CSV file.DiagoAutoSelector Solver Recommendation
This PR adds a header-only solver recommendation utility:
It gives explainable solver recommendations based on problem features, including:
Environment variables:
The recommendation rules are roughly prioritized as follows:
bpcg: PPCG has CUDA-path support, but before sufficient benchmark data is available, AutoSelector conservatively recommends the more regular block-CG path for GPU cases.dav, prioritizing the more robust full-convergence path.bpcg, as block operations can better benefit from parallel execution.ppcg, where projected/block updates may be beneficial.dav_subspace.cg, as the conservative fallback.AutoSelector is a heuristic recommendation system. It does not claim to provide the optimal solver for every system. Its main purpose is to provide explainable selection hints for benchmarks and future rule calibration.
Safety mechanisms:
ABACUS_DIAGO_AUTO_REPORT=1reports recommendations only and does not change the solver.ABACUS_DIAGO_AUTO_SELECT=1is set.dav_subspaceboundary, avoiding inconsistent preconditioner usage.Benchmark and Test Infrastructure
This PR adds or extends the following test/benchmark targets:
MODULE_HSOLVER_ppcgMODULE_HSOLVER_ppcg_benchMODULE_HSOLVER_bpcg_benchMODULE_HSOLVER_david_benchMODULE_HSOLVER_ppcg_bench_cudaMODULE_HSOLVER_openmp_consistencyOMP_NUM_THREADSsettingsExample test commands:
DiagoTrace can be checked with:
After running the relevant hsolver tests, check that
diago_trace.csvis generated and contains the header:DiagoAutoSelector can first be validated in report-only mode:
This mode reports the recommendation and reason only, without changing the actual solver. It is suitable as an observation tool before benchmarking.
Compatibility and Behavior Changes
By default, this PR does not change existing user behavior:
_OPENMPand band-count thresholds.ABACUS_DIAGO_AUTO_REPORT=1reports recommendations only and does not change the solver.ABACUS_DIAGO_AUTO_SELECT=1is explicitly set.Therefore, the new capabilities are added incrementally. PW solver selection may change only when auto-selection mode is explicitly enabled.
Breaking Changes
No default breaking changes.
Notes:
ABACUS_DIAGO_AUTO_SELECT=1is explicit opt-in behavior and may change the selected PW solver.Follow-Up Work
Potential follow-up work includes: