feat(pt): add custom save behaviors by OutisLi · Pull Request #5589 · deepmodeling/deepmd-kit

OutisLi · 2026-06-25T15:37:16Z

Summary by CodeRabbit

New Features
- Added training.save_dir for periodic checkpoints and validating.save_best_dir for best-validation checkpoints.
- Added training.ckpt_keep_ratio for ratio-based sliding-window checkpoint retention.
Bug Fixes
- Improved checkpoint filename/“latest” aliasing and symlink/pointer behavior for periodic and EMA checkpoints when save_dir is set.
- Ensured “best” checkpoints are written only to the configured best-checkpoint directory.
- Eagerly creates the validator checkpoint directory during initialization.
Documentation
- Documented save_dir and ckpt_keep_ratio; updated the training advanced guide and example config.
Tests
- Added unit tests for retention rounding/edge cases and filesystem tests for redirecting checkpoints and custom best-checkpoint locations.

coderabbitai · 2026-06-25T15:46:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 868c797c-78e7-4f5a-aa6d-189ab65a291e

📥 Commits

Reviewing files that changed from the base of the PR and between b0a4b15 and bcc11fa.

📒 Files selected for processing (2)

deepmd/utils/argcheck.py
doc/train/training-advanced.md

✅ Files skipped from review due to trivial changes (1)

doc/train/training-advanced.md

🚧 Files skipped from review as they are similar to previous changes (1)

deepmd/utils/argcheck.py

📝 Walkthrough

Walkthrough

PyTorch training now supports configurable checkpoint output and retention, and full validation can write best checkpoints to a separate directory. Checkpoint path construction, symlink/pointer handling, configuration schema, docs, examples, and tests were updated.

Changes

Checkpoint path and retention updates

Layer / File(s)	Summary
Utilities and config contract `deepmd/pt/train/utils.py`, `deepmd/utils/argcheck.py`, `doc/train/training-advanced.md`, `examples/water/dpa4/input.json`, `source/tests/pt/test_train_utils.py`	New checkpoint helpers and training/validation config fields are added for `save_dir`, `ckpt_keep_ratio`, and `save_best_dir`, with docs, an example config, and helper tests updated.
Trainer retention setup `deepmd/pt/train/training.py`, `source/tests/pt/test_training.py`	`Trainer` now resolves `save_dir`, derives the keep count from `ckpt_keep_ratio` after `num_steps` is known, and updates regular and EMA retention limits.
Best checkpoint directory wiring `deepmd/pt/train/training.py`, `deepmd/pt/train/validation.py`, `source/tests/pt/test_validation.py`	`resolve_best_checkpoint_dir` is used for full-validation checkpoint directories, `FullValidator` creates the directory during initialization, and tests cover custom best-checkpoint locations.
Checkpoint save paths and symlinks `deepmd/pt/train/training.py`, `source/tests/pt/test_training.py`	Periodic, final, and zero-step checkpoint writes now use `latest_checkpoint_path(..., save_dir)`, and tests verify the resulting files and symlinks.

Sequence Diagram(s)

sequenceDiagram
  participant Trainer
  participant latest_checkpoint_path
  participant save_dir
  participant checkpoint_file
  Trainer->>latest_checkpoint_path: resolve prefix, step, and save_dir
  latest_checkpoint_path-->>Trainer: checkpoint path
  Trainer->>save_dir: write periodic checkpoint file
  Trainer->>checkpoint_file: update pointer to the resolved path

sequenceDiagram
  participant Trainer
  participant resolve_best_checkpoint_dir
  participant FullValidator
  participant checkpoint_dir
  Trainer->>resolve_best_checkpoint_dir: resolve validating.save_best_dir or save_ckpt parent
  resolve_best_checkpoint_dir-->>Trainer: checkpoint_dir
  Trainer->>FullValidator: create validator with checkpoint_dir
  FullValidator->>checkpoint_dir: mkdir(parents=True, exist_ok=True)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

deepmodeling/deepmd-kit#5420: Shares the PyTorch checkpoint and EMA save/restore path area in deepmd/pt/train/training.py.

Suggested reviewers

njzjz
wanghan-iapcm
iProzd

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is concise and accurately summarizes the main change: customizable checkpoint save behavior in PyTorch training.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pt/train/utils.py`:
- Around line 283-284: The checkpoint retention calculation in the helper that
returns the keep count is undercounting because it ignores the final off-cadence
checkpoint written by Trainer.run() at num_steps. Update the logic around
total_periodic_ckpts/ckpt_keep_ratio so it accounts for the extra terminal
checkpoint (for example, by including the final step in the total when num_steps
is not an exact multiple of save_freq), and keep the existing max(1, ...)
safeguards intact.

In `@examples/water/dpa4/input.json`:
- Around line 123-124: The save_best_dir setting is unused in this example
because the validation path that triggers best-checkpoint saving is never
enabled. Update the input in this example by either turning on the
validating.full_validation flow so ckpt_best can be created, or remove the
save_best_dir field from the example to avoid misleading users; make the change
in the example configuration where tf32_infer and save_best_dir are defined.

In `@source/tests/pt/test_training.py`:
- Around line 967-968: Add the standard training test timeout guard to the new
validation test so it cannot hang CI; decorate
test_full_validation_save_best_dir with `@TRAINING_TEST_TIMEOUT` alongside the
existing `@patch` on FullValidator.evaluate_all_systems, matching the pattern used
by other training tests that call trainer.run().
- Around line 1228-1231: The checkpoint alias test is incorrectly asserting that
the prefix files are symlinks, which breaks on platforms where
symlink_prefix_files() falls back to copying. Update the test in the
checkpoint-saving area to validate the alias by checking that the Path resolves
to the expected target file, without requiring is_symlink(), using the existing
save_ckpt and ema_save_ckpt references so the test remains cross-platform.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 40ff16a5-1016-4465-ba48-a71de2e87b50

📥 Commits

Reviewing files that changed from the base of the PR and between 5733301 and fc780b1.

📒 Files selected for processing (9)

deepmd/pt/train/training.py
deepmd/pt/train/utils.py
deepmd/pt/train/validation.py
deepmd/utils/argcheck.py
doc/train/training-advanced.md
examples/water/dpa4/input.json
source/tests/pt/test_train_utils.py
source/tests/pt/test_training.py
source/tests/pt/test_validation.py

njzjz-bot

Thanks for adding the checkpoint save directory and ratio-based retention knobs. I found a few issues worth fixing before merge:

ckpt_keep_ratio currently under-counts when the final checkpoint is off-cadence. In resolve_keep_ckpt_count(), total_periodic_ckpts = num_steps // save_freq ignores the final checkpoint that Trainer.run() still writes when num_steps % save_freq != 0. For example, num_steps=5, save_freq=2, ckpt_keep_ratio=0.5 produces checkpoints at steps 2, 4, and 5, but the helper returns ceil(0.5 * (5 // 2)) = 1; the documented formula ceil(ckpt_keep_ratio * numb_steps / save_freq) would keep 2. Please account for the terminal checkpoint, e.g. use ceil(num_steps / save_freq) (with the existing minimum-one guard).
The new save_best_dir in examples/water/dpa4/input.json is misleading unless full validation is enabled. Since validating.full_validation defaults to false, ckpt_best will not actually be used by this example. Either enable the full-validation flow in the example or omit save_best_dir there.
The new test_save_dir_redirects_checkpoints_with_local_symlinks assumes Path(...).is_symlink(), but symlink_prefix_files() copies files on Windows. If these tests are expected to be portable, please avoid requiring symlinks in the assertion (or explicitly scope the test/docs to non-Windows behavior).

Reviewed by OpenClaw 2026.6.8 (model: custom-chat-jinzhezeng-group/gpt-5.5).

codecov · 2026-06-25T17:31:05Z

Codecov Report

❌ Patch coverage is 93.18182% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.30%. Comparing base (5733301) to head (bcc11fa).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
deepmd/pt/train/training.py	88.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5589      +/-   ##
==========================================
+ Coverage   82.27%   82.30%   +0.02%     
==========================================
  Files         887      887              
  Lines      100331   100481     +150     
  Branches     4060     4060              
==========================================
+ Hits        82550    82700     +150     
+ Misses      16320    16317       -3     
- Partials     1461     1464       +3

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wanghan-iapcm · 2026-06-26T10:39:59Z

Non-blocking doc nit: the documented ckpt_keep_ratio formula doesn't match the code.

Both the argcheck help text and the user doc state a single ceil:

ceil(ckpt_keep_ratio * numb_steps / save_freq)

but resolve_keep_ckpt_count computes a nested ceil — ceil(ratio * ceil(num_steps / save_freq)):

deepmd-kit/deepmd/pt/train/utils.py

Lines 283 to 285 in b0a4b15

    
               return None 
        
           total_ckpts = max(1, ceil(num_steps / save_freq)) 
        
           return max(1, ceil(ckpt_keep_ratio * total_ckpts))

These differ whenever numb_steps isn't a multiple of save_freq. E.g. numb_steps=11, save_freq=10, ratio=0.9: the code keeps 2, the documented formula gives 1. The code is correct; only the docs are off.

Suggest updating both to the nested form ceil(ckpt_keep_ratio * ceil(numb_steps / save_freq)) (the resolve_keep_ckpt_count docstring already states it correctly):

deepmd-kit/deepmd/utils/argcheck.py

Lines 4997 to 5002 in b0a4b15

    
           doc_ckpt_keep_ratio = ( 
        
               "An alternative to `max_ckpt_keep` that sets the number of retained " 
        
               "checkpoints as a fraction in (0, 1) of the run: the most recent " 
        
               "`ceil(ckpt_keep_ratio * numb_steps / save_freq)` checkpoints are kept. " 
        
               "When set, it overrides `max_ckpt_keep` and `ema_ckpt_keep`." 
        
           )

deepmd-kit/doc/train/training-advanced.md

Line 107 in b0a4b15

    
           - {ref}`ckpt_keep_ratio <training/ckpt_keep_ratio>` An alternative to `max_ckpt_keep` (PyTorch backend) that keeps a sliding window of `ceil(ckpt_keep_ratio * numb_steps / save_freq)` most recent checkpoints, i.e. the final `ckpt_keep_ratio` fraction of the run by step. It overrides `max_ckpt_keep` (and `ema_ckpt_keep`) when set, and works the same whether the run length is given by `numb_steps` or `numb_epoch`.

wanghan-iapcm

see my non-blocking comment

OutisLi · 2026-06-26T14:30:43Z

see my non-blocking comment

done

OutisLi added 2 commits June 25, 2026 23:36

feat(pt): add save_dir to set specific ckpt saving folder

a7c1635

feat(pt): add ckpt_keep_ratio to set max_ckpt_keep automatically

fc780b1

dosubot Bot added the new feature label Jun 25, 2026

OutisLi requested review from njzjz and wanghan-iapcm June 25, 2026 15:38

github-actions Bot added Python Docs Examples labels Jun 25, 2026

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread deepmd/pt/train/utils.py Outdated

Comment thread examples/water/dpa4/input.json

Comment thread source/tests/pt/test_training.py

Comment thread source/tests/pt/test_training.py Outdated

njzjz-bot reviewed Jun 25, 2026

View reviewed changes

fix

b0a4b15

wanghan-iapcm reviewed Jun 26, 2026

View reviewed changes

doc

bcc11fa

OutisLi requested a review from wanghan-iapcm June 26, 2026 14:33

njzjz-bot reviewed Jun 27, 2026

View reviewed changes

Comment thread examples/water/dpa4/input.json

wanghan-iapcm approved these changes Jun 27, 2026

View reviewed changes

OutisLi added this pull request to the merge queue Jun 28, 2026

Merged via the queue into deepmodeling:master with commit a9bcbc5 Jun 28, 2026
70 checks passed

OutisLi deleted the pr/save branch June 28, 2026 07:59

Uh oh!

Conversation

OutisLi commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wanghan-iapcm commented Jun 26, 2026

Uh oh!

wanghan-iapcm left a comment

Choose a reason for hiding this comment

Uh oh!

OutisLi commented Jun 26, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OutisLi commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

codecov Bot commented Jun 25, 2026 •

edited

Loading