perf/Optimize neighbor search setup#7528
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes the MD neighbor-search setup in NeighborSearch by making periodic image generation and inside/ghost classification more cache-friendly and optionally parallel for large workloads, while aiming to preserve the existing output ordering semantics.
Changes:
- Precomputes “base cell” atoms once (type/local order) and reuses them to fill periodic images deterministically by image index.
- Adds an OpenMP-enabled setup path for large generated-atom counts, and an OpenMP-enabled inside/ghost classification path with ordered sequential write-back.
- Adds
ModuleBase::timerinstrumentation toset_member_variables,init, andbuild_neighbors.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const int atoms_per_image = static_cast<int>(base_atoms.size()); | ||
| const int nimage_x = glayerX_ + glayerX_minus_; | ||
| const int nimage_y = glayerY_ + glayerY_minus_; | ||
| const int nimage_z = glayerZ_ + glayerZ_minus_; | ||
| const int nimages = nimage_x * nimage_y * nimage_z; | ||
| const int total_atoms = nimages * atoms_per_image; | ||
|
|
There was a problem hiding this comment.
Fixed in c40ee17. I updated the setup count calculations to use checked long long intermediates and explicitly validate that nimage_x/y/z, nimages, and total_atoms fit in int before they are used for indexing or vector sizing. If the generated neighbor search atom count exceeds the supported int indexing range, the code now exits via ModuleBase::WARNING_QUIT instead of overflowing silently.
Summary
This PR optimizes the MD neighbor-search setup path in
NeighborSearch.The change keeps the existing ordering semantics while reducing setup overhead for larger systems:
A conservative threshold is used so smaller systems keep the serial setup path.
Correctness
The implementation preserves:
all_atoms_image-major / type / local-atom order.atom_idas the linear all-atom index.all_atoms_scan.Performance
Environment:
np=1OMP_PROC_BIND=spreadOMP_PLACES=cores2048 atom LJ NVE, 200 steps
NeighborSearch::initset_member_variables10.79 s10.42 s0.79 s0.38 s10.61 s10.23 s0.76 s0.36 s11.22 s10.85 s0.77 s0.36 s8192 atom LJ NVE, 100 steps
NeighborSearch::initset_member_variables24.19 s23.82 s1.60 s0.78 s21.68 s21.32 s0.75 s0.34 s21.07 s20.71 s0.64 s0.34 s20.70 s20.34 s0.52 s0.30 s21.44 s21.07 s0.65 s0.37 sBest 8192 result
24.19 s -> 20.70 s, about1.17x.NeighborSearch::init:1.60 s -> 0.52 s, about3.08x.set_member_variables:0.78 s -> 0.30 s, about2.60x.The 2048-atom case is mostly noise-bound and does not benefit from high thread counts, so this PR should be read as a larger-system setup optimization.
Tests
git diff --checkc9a28da13: passed