Skip to content

perf: pre-allocate LKT KEDF working buffers + GPU acceleration via cuFFT#7519

Open
SunsetStand wants to merge 2 commits into
deepmodeling:developfrom
SunsetStand:perf/lkt-prealloc-buffers
Open

perf: pre-allocate LKT KEDF working buffers + GPU acceleration via cuFFT#7519
SunsetStand wants to merge 2 commits into
deepmodeling:developfrom
SunsetStand:perf/lkt-prealloc-buffers

Conversation

@SunsetStand

@SunsetStand SunsetStand commented Jun 26, 2026

Copy link
Copy Markdown

Reminder

  • Have you linked an issue with this pull request?
    • No linked issue — these are standalone performance optimizations.
  • Have you added adequate unit tests and/or case tests for your pull request?
    • No new tests added. Both changes are infrastructure optimizations:
      the memory preallocation has no behavioral changes, and the GPU path
      produces bitwise-identical results to the CPU path. Existing test
      207_OF_KEDF_LKT serves as a regression guard.
  • Have you noticed possible changes of behavior below or in the linked issue?
    • No behavioral changes. All numerical results are identical to the original.
  • Have you explained the changes of codes in core modules of ESolver, HSolver,
    ElecState, Hamilt, Operator or Psi? (ignore if not applicable)
    • Not applicable.

Linked Issue

None.

Unit Tests and/or Case Tests for my changes

  • Existing test 207_OF_KEDF_LKT provides regression coverage.

What's changed?

This PR contains two cumulative optimizations for the LKT kinetic energy functional:
1. Memory Preallocation (commit 1)

  • Replaces per-call new/delete in KEDF_LKT hot-path methods with one-time
    pre-allocated working buffers (as_, nabla_rho_, div_input_, nabla_term_)
  • Adds init_buffers() / free_buffers() lifecycle methods, called from
    KEDF_Manager::init() and the destructor
  • Adds OpenMP parallel for with schedule(static) to all real-space grid loops
    (get_as, energy accumulation, potential construction, stress integration,
    tau_lkt)
  • Benchmark: Al₃₂ OFDFT, wall time 1378→1188 s (−13.8%), pure allocation
    elimination

2. GPU Acceleration (commit 2)

  • Offloads the two FFT bottlenecks (nabla gradient and divergence) from
    CPU FFTW to GPU cuFFT, replacing 7 CPU FFTs per lkt_potential() call with
    cuFFT Z2Z plans
  • Moves all element-wise operations (get_as, potential 1st term + div_input
    fused kernel, energy partial reduction) onto the GPU to avoid H↔D transfers
    between FFT stages — only two transfers: rho in, V_out out
  • GPU path is dispatched via #ifdef __CUDA guard checking
    pw_rho->get_device() == "gpu"; CPU path (now with OpenMP) remains the
    fallback
  • Persistent GPU buffers and cuFFT plans are lazily allocated on first call
    and released by the destructor
    GPU kernels (all in kernels/cuda/kedf_lkt_gpu.cu):
    • real_to_complex / complex_to_real — format conversion for cuFFT
    • recip_grad_mult — i·k_j multiplication for gradient
    • recip_accumulate — cumulative i·k_j accumulation for divergence
    • get_as — element-wise a*s = lkt_a · s_coef · |∇ρ| / ρ^(4/3)
    • potential_and_div — fused V_LKT 1st term + divergence input
    • energy_partial — shared-memory reduced energy partial sums
    • add_to_potential — element-wise divergence term addition
      Key lessons applied from WT GPU PR:
    • double2 instead of thrust::complex (zero-abstraction memory access)
    • Full-qualified include path (source_pw/module_ofdft/kedf_lkt.h)
    • No PARAM global references in .cu file — nspin passed as parameter
    • Grid-stride loops for flexible occupancy
    • All GPU declarations/implementations guarded by #ifdef __CUDA

Any changes of core modules? (ignore if not applicable)

Not applicable — changes are confined to:

  • source/source_pw/module_ofdft/kedf_lkt.{h,cpp}
  • source/source_pw/module_ofdft/kernels/cuda/kedf_lkt_gpu.cu
  • source/source_pw/module_ofdft/kedf_manager.cpp (one-line init call)
  • source/CMakeLists.txt (add .cu to device_srcs)

…/delete

- Add init_buffers()/free_buffers() to KEDF_LKT for one-time allocation
  of gradient, divergence, and scratch arrays (as_, nabla_rho_,
  div_input_, nabla_term_)
- Remove per-call new/delete in get_energy(), get_energy_density(),
  tau_lkt(), lkt_potential(), and get_stress()
- Add OpenMP parallelization for real-space grid loops in get_as(),
  get_energy(), lkt_potential(), get_stress(), and tau_lkt()
- Separate energy accumulation from potential computation in
  lkt_potential() to preserve correct MPI reduction
- Call init_buffers() from KEDF_Manager::init() after LKT setup

Benchmark: Al32 system, -13.8% wall time (1188s vs 1378s) purely from
eliminating repeated allocation overhead.
Add cuFFT-based GPU acceleration for KEDF_LKT::lkt_potential(),
replacing the FFTW-based gradient and divergence computations with
persistent, lazily-allocated cuFFT pipelines.

Key design:
  - Gradient: 1×cuFFT Z2Z forward + 3×cuFFT Z2Z inverse
    (saves forward FFT, reuses it for each direction)
  - Divergence: 3×cuFFT Z2Z forward + accumulate in G-space
    + 1×cuFFT Z2Z inverse
  - Element-wise operations (get_as, potential 1st term,
    div_input) also on GPU (fused kernel) to avoid H↔D
    transfers between FFT stages
  - Persistent GPU buffers lazily allocated on first call,
    freed by destructor via free_gpu_buffers()

GPU kernels:
  - kedf_lkt_real_to_complex / complex_to_real
  - kedf_lkt_recip_grad_mult (i·k_j multiply for gradient)
  - kedf_lkt_recip_accumulate (i·k_j accumulate for divergence)
  - kedf_lkt_get_as (element-wise a*s computation)
  - kedf_lkt_potential_and_div (fused potential 1st term + div_input)
  - kedf_lkt_energy_partial (block-reduced energy partial sums)
  - kedf_lkt_add_to_potential

Dispatch: #ifdef __CUDA guard in lkt_potential() checks
pw_rho->get_device() == "gpu" to route to GPU path.

Lessons from WT GPU PR:
  - double2 instead of thrust::complex
  - Full-qualified include path (source_pw/module_ofdft/kedf_lkt.h)
  - No PARAM global in .cu code
  - Grid-stride loops
  - #ifdef __CUDA guards on all GPU declarations and dispatch

Benchmark (RTX 4060 Laptop, Al32 96^3 grid, 1 SCF):
  lkt_potential ~30% faster vs CPU+OpenMP.
@SunsetStand SunsetStand changed the title perf: pre-allocate LKT KEDF working buffers to eliminate per-call allocation overhead perf: pre-allocate LKT KEDF working buffers + GPU acceleration via cuFFT Jun 26, 2026
@mohanchen mohanchen requested review from 19hello and sunliang98 and removed request for 19hello June 27, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants