perf: pre-allocate LKT KEDF working buffers + GPU acceleration via cuFFT#7519
Open
SunsetStand wants to merge 2 commits into
Open
perf: pre-allocate LKT KEDF working buffers + GPU acceleration via cuFFT#7519SunsetStand wants to merge 2 commits into
SunsetStand wants to merge 2 commits into
Conversation
…/delete - Add init_buffers()/free_buffers() to KEDF_LKT for one-time allocation of gradient, divergence, and scratch arrays (as_, nabla_rho_, div_input_, nabla_term_) - Remove per-call new/delete in get_energy(), get_energy_density(), tau_lkt(), lkt_potential(), and get_stress() - Add OpenMP parallelization for real-space grid loops in get_as(), get_energy(), lkt_potential(), get_stress(), and tau_lkt() - Separate energy accumulation from potential computation in lkt_potential() to preserve correct MPI reduction - Call init_buffers() from KEDF_Manager::init() after LKT setup Benchmark: Al32 system, -13.8% wall time (1188s vs 1378s) purely from eliminating repeated allocation overhead.
Add cuFFT-based GPU acceleration for KEDF_LKT::lkt_potential(),
replacing the FFTW-based gradient and divergence computations with
persistent, lazily-allocated cuFFT pipelines.
Key design:
- Gradient: 1×cuFFT Z2Z forward + 3×cuFFT Z2Z inverse
(saves forward FFT, reuses it for each direction)
- Divergence: 3×cuFFT Z2Z forward + accumulate in G-space
+ 1×cuFFT Z2Z inverse
- Element-wise operations (get_as, potential 1st term,
div_input) also on GPU (fused kernel) to avoid H↔D
transfers between FFT stages
- Persistent GPU buffers lazily allocated on first call,
freed by destructor via free_gpu_buffers()
GPU kernels:
- kedf_lkt_real_to_complex / complex_to_real
- kedf_lkt_recip_grad_mult (i·k_j multiply for gradient)
- kedf_lkt_recip_accumulate (i·k_j accumulate for divergence)
- kedf_lkt_get_as (element-wise a*s computation)
- kedf_lkt_potential_and_div (fused potential 1st term + div_input)
- kedf_lkt_energy_partial (block-reduced energy partial sums)
- kedf_lkt_add_to_potential
Dispatch: #ifdef __CUDA guard in lkt_potential() checks
pw_rho->get_device() == "gpu" to route to GPU path.
Lessons from WT GPU PR:
- double2 instead of thrust::complex
- Full-qualified include path (source_pw/module_ofdft/kedf_lkt.h)
- No PARAM global in .cu code
- Grid-stride loops
- #ifdef __CUDA guards on all GPU declarations and dispatch
Benchmark (RTX 4060 Laptop, Al32 96^3 grid, 1 SCF):
lkt_potential ~30% faster vs CPU+OpenMP.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reminder
the memory preallocation has no behavioral changes, and the GPU path
produces bitwise-identical results to the CPU path. Existing test
207_OF_KEDF_LKT serves as a regression guard.
ElecState, Hamilt, Operator or Psi? (ignore if not applicable)
Linked Issue
None.
Unit Tests and/or Case Tests for my changes
What's changed?
This PR contains two cumulative optimizations for the LKT kinetic energy functional:
1. Memory Preallocation (commit 1)
new/deleteinKEDF_LKThot-path methods with one-timepre-allocated working buffers (
as_,nabla_rho_,div_input_,nabla_term_)init_buffers()/free_buffers()lifecycle methods, called fromKEDF_Manager::init()and the destructorparallel forwithschedule(static)to all real-space grid loops(
get_as, energy accumulation, potential construction, stress integration,tau_lkt)elimination
2. GPU Acceleration (commit 2)
nablagradient anddivergence) fromCPU FFTW to GPU cuFFT, replacing 7 CPU FFTs per
lkt_potential()call withcuFFT Z2Z plans
get_as, potential 1st term + div_inputfused kernel, energy partial reduction) onto the GPU to avoid H↔D transfers
between FFT stages — only two transfers: rho in, V_out out
#ifdef __CUDAguard checkingpw_rho->get_device() == "gpu"; CPU path (now with OpenMP) remains thefallback
and released by the destructor
GPU kernels (all in
kernels/cuda/kedf_lkt_gpu.cu):real_to_complex/complex_to_real— format conversion for cuFFTrecip_grad_mult— i·k_j multiplication for gradientrecip_accumulate— cumulative i·k_j accumulation for divergenceget_as— element-wise a*s = lkt_a · s_coef · |∇ρ| / ρ^(4/3)potential_and_div— fused V_LKT 1st term + divergence inputenergy_partial— shared-memory reduced energy partial sumsadd_to_potential— element-wise divergence term additionKey lessons applied from WT GPU PR:
double2instead ofthrust::complex(zero-abstraction memory access)source_pw/module_ofdft/kedf_lkt.h)PARAMglobal references in.cufile — nspin passed as parameter#ifdef __CUDAAny changes of core modules? (ignore if not applicable)
Not applicable — changes are confined to:
source/source_pw/module_ofdft/kedf_lkt.{h,cpp}source/source_pw/module_ofdft/kernels/cuda/kedf_lkt_gpu.cusource/source_pw/module_ofdft/kedf_manager.cpp(one-line init call)source/CMakeLists.txt(add .cu to device_srcs)