Skip to content

perf: avoid redundant dp.compute() across MPI ranks#7524

Open
curry-sthuc wants to merge 2 commits into
deepmodeling:developfrom
curry-sthuc:feature/mpi-openmp-cuda-accel
Open

perf: avoid redundant dp.compute() across MPI ranks#7524
curry-sthuc wants to merge 2 commits into
deepmodeling:developfrom
curry-sthuc:feature/mpi-openmp-cuda-accel

Conversation

@curry-sthuc

Copy link
Copy Markdown

Problem

In ESolver_DP::runner(), every MPI rank calls dp.compute() independently with identical input, producing the same energy, forces, and virial. With N ranks, this results in N-fold redundant computation, and multiple concurrent deepmd inference calls cause CPU contention.

Solution

Only rank 0 calls dp.compute(), then broadcasts results via MPI_Bcast. Implemented with #ifdef __MPI guards — serial builds are unaffected.

Performance (864 Al atoms, 100 MD steps)

np Before After
1 34s 33s
2 39s 33s
4 55s 23s
8 102s 28s

Checklist

  • No linked issue (standalone optimization)
  • No new tests needed — computation results are mathematically identical
  • No behavioral changes
  • No core module changes

curry-sthuc added 2 commits June 26, 2026 16:23
Only rank 0 calls dp.compute() and broadcasts results to all ranks
via MPI_Bcast, avoiding N-fold redundant computation and CPU contention.
Only rank 0 needs coord and cell vectors for dp.compute(). Non-rank-0 ranks receive results via MPI_Bcast and never use these vectors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants