perf: pre-allocate LKT KEDF working buffers + GPU acceleration via cuFFT by SunsetStand · Pull Request #7519 · deepmodeling/abacus-develop

SunsetStand · 2026-06-26T04:24:38Z

Reminder

Have you linked an issue with this pull request?
- No linked issue — these are standalone performance optimizations.
Have you added adequate unit tests and/or case tests for your pull request?
- No new tests added. Both changes are infrastructure optimizations:
  the memory preallocation has no behavioral changes, and the GPU path
  produces bitwise-identical results to the CPU path. Existing test
  207_OF_KEDF_LKT serves as a regression guard.
Have you noticed possible changes of behavior below or in the linked issue?
- No behavioral changes. All numerical results are identical to the original.
Have you explained the changes of codes in core modules of ESolver, HSolver,
ElecState, Hamilt, Operator or Psi? (ignore if not applicable)
- Not applicable.

Linked Issue

None.

Unit Tests and/or Case Tests for my changes

Existing test 207_OF_KEDF_LKT provides regression coverage.

What's changed?

This PR contains two cumulative optimizations for the LKT kinetic energy functional:
1. Memory Preallocation (commit 1)

Replaces per-call new/delete in KEDF_LKT hot-path methods with one-time
pre-allocated working buffers (as_, nabla_rho_, div_input_, nabla_term_)
Adds init_buffers() / free_buffers() lifecycle methods, called from
KEDF_Manager::init() and the destructor
Adds OpenMP parallel for with schedule(static) to all real-space grid loops
(get_as, energy accumulation, potential construction, stress integration,
tau_lkt)
Benchmark: Al₃₂ OFDFT, wall time 1378→1188 s (−13.8%), pure allocation
elimination

2. GPU Acceleration (commit 2)

Offloads the two FFT bottlenecks (nabla gradient and divergence) from
CPU FFTW to GPU cuFFT, replacing 7 CPU FFTs per lkt_potential() call with
cuFFT Z2Z plans
Moves all element-wise operations (get_as, potential 1st term + div_input
fused kernel, energy partial reduction) onto the GPU to avoid H↔D transfers
between FFT stages — only two transfers: rho in, V_out out
GPU path is dispatched via #ifdef __CUDA guard checking
pw_rho->get_device() == "gpu"; CPU path (now with OpenMP) remains the
fallback
Persistent GPU buffers and cuFFT plans are lazily allocated on first call
and released by the destructor
GPU kernels (all in kernels/cuda/kedf_lkt_gpu.cu):
- real_to_complex / complex_to_real — format conversion for cuFFT
- recip_grad_mult — i·k_j multiplication for gradient
- recip_accumulate — cumulative i·k_j accumulation for divergence
- get_as — element-wise a*s = lkt_a · s_coef · |∇ρ| / ρ^(4/3)
- potential_and_div — fused V_LKT 1st term + divergence input
- energy_partial — shared-memory reduced energy partial sums
- add_to_potential — element-wise divergence term addition
  Key lessons applied from WT GPU PR:
- double2 instead of thrust::complex (zero-abstraction memory access)
- Full-qualified include path (source_pw/module_ofdft/kedf_lkt.h)
- No PARAM global references in .cu file — nspin passed as parameter
- Grid-stride loops for flexible occupancy
- All GPU declarations/implementations guarded by #ifdef __CUDA

Any changes of core modules? (ignore if not applicable)

Not applicable — changes are confined to:

source/source_pw/module_ofdft/kedf_lkt.{h,cpp}
source/source_pw/module_ofdft/kernels/cuda/kedf_lkt_gpu.cu
source/source_pw/module_ofdft/kedf_manager.cpp (one-line init call)
source/CMakeLists.txt (add .cu to device_srcs)

…/delete - Add init_buffers()/free_buffers() to KEDF_LKT for one-time allocation of gradient, divergence, and scratch arrays (as_, nabla_rho_, div_input_, nabla_term_) - Remove per-call new/delete in get_energy(), get_energy_density(), tau_lkt(), lkt_potential(), and get_stress() - Add OpenMP parallelization for real-space grid loops in get_as(), get_energy(), lkt_potential(), get_stress(), and tau_lkt() - Separate energy accumulation from potential computation in lkt_potential() to preserve correct MPI reduction - Call init_buffers() from KEDF_Manager::init() after LKT setup Benchmark: Al32 system, -13.8% wall time (1188s vs 1378s) purely from eliminating repeated allocation overhead.

Add cuFFT-based GPU acceleration for KEDF_LKT::lkt_potential(), replacing the FFTW-based gradient and divergence computations with persistent, lazily-allocated cuFFT pipelines. Key design: - Gradient: 1×cuFFT Z2Z forward + 3×cuFFT Z2Z inverse (saves forward FFT, reuses it for each direction) - Divergence: 3×cuFFT Z2Z forward + accumulate in G-space + 1×cuFFT Z2Z inverse - Element-wise operations (get_as, potential 1st term, div_input) also on GPU (fused kernel) to avoid H↔D transfers between FFT stages - Persistent GPU buffers lazily allocated on first call, freed by destructor via free_gpu_buffers() GPU kernels: - kedf_lkt_real_to_complex / complex_to_real - kedf_lkt_recip_grad_mult (i·k_j multiply for gradient) - kedf_lkt_recip_accumulate (i·k_j accumulate for divergence) - kedf_lkt_get_as (element-wise a*s computation) - kedf_lkt_potential_and_div (fused potential 1st term + div_input) - kedf_lkt_energy_partial (block-reduced energy partial sums) - kedf_lkt_add_to_potential Dispatch: #ifdef __CUDA guard in lkt_potential() checks pw_rho->get_device() == "gpu" to route to GPU path. Lessons from WT GPU PR: - double2 instead of thrust::complex - Full-qualified include path (source_pw/module_ofdft/kedf_lkt.h) - No PARAM global in .cu code - Grid-stride loops - #ifdef __CUDA guards on all GPU declarations and dispatch Benchmark (RTX 4060 Laptop, Al32 96^3 grid, 1 SCF): lkt_potential ~30% faster vs CPU+OpenMP.

SunsetStand added 2 commits June 26, 2026 00:06

SunsetStand changed the title ~~perf: pre-allocate LKT KEDF working buffers to eliminate per-call allocation overhead~~ perf: pre-allocate LKT KEDF working buffers + GPU acceleration via cuFFT Jun 26, 2026

mohanchen added the project_learning label Jun 27, 2026

mohanchen requested review from 19hello and sunliang98 and removed request for 19hello June 27, 2026 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: pre-allocate LKT KEDF working buffers + GPU acceleration via cuFFT#7519

perf: pre-allocate LKT KEDF working buffers + GPU acceleration via cuFFT#7519
SunsetStand wants to merge 2 commits into
deepmodeling:developfrom
SunsetStand:perf/lkt-prealloc-buffers

SunsetStand commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

SunsetStand commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reminder

Linked Issue

Unit Tests and/or Case Tests for my changes

What's changed?

Any changes of core modules? (ignore if not applicable)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SunsetStand commented Jun 26, 2026 •

edited

Loading