Skip to content

feat(pt): add custom save behaviors#5589

Merged
OutisLi merged 4 commits into
deepmodeling:masterfrom
OutisLi:pr/save
Jun 28, 2026
Merged

feat(pt): add custom save behaviors#5589
OutisLi merged 4 commits into
deepmodeling:masterfrom
OutisLi:pr/save

Conversation

@OutisLi

@OutisLi OutisLi commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added training.save_dir for periodic checkpoints and validating.save_best_dir for best-validation checkpoints.
    • Added training.ckpt_keep_ratio for ratio-based sliding-window checkpoint retention.
  • Bug Fixes

    • Improved checkpoint filename/“latest” aliasing and symlink/pointer behavior for periodic and EMA checkpoints when save_dir is set.
    • Ensured “best” checkpoints are written only to the configured best-checkpoint directory.
    • Eagerly creates the validator checkpoint directory during initialization.
  • Documentation

    • Documented save_dir and ckpt_keep_ratio; updated the training advanced guide and example config.
  • Tests

    • Added unit tests for retention rounding/edge cases and filesystem tests for redirecting checkpoints and custom best-checkpoint locations.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 868c797c-78e7-4f5a-aa6d-189ab65a291e

📥 Commits

Reviewing files that changed from the base of the PR and between b0a4b15 and bcc11fa.

📒 Files selected for processing (2)
  • deepmd/utils/argcheck.py
  • doc/train/training-advanced.md
✅ Files skipped from review due to trivial changes (1)
  • doc/train/training-advanced.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/utils/argcheck.py

📝 Walkthrough

Walkthrough

PyTorch training now supports configurable checkpoint output and retention, and full validation can write best checkpoints to a separate directory. Checkpoint path construction, symlink/pointer handling, configuration schema, docs, examples, and tests were updated.

Changes

Checkpoint path and retention updates

Layer / File(s) Summary
Utilities and config contract
deepmd/pt/train/utils.py, deepmd/utils/argcheck.py, doc/train/training-advanced.md, examples/water/dpa4/input.json, source/tests/pt/test_train_utils.py
New checkpoint helpers and training/validation config fields are added for save_dir, ckpt_keep_ratio, and save_best_dir, with docs, an example config, and helper tests updated.
Trainer retention setup
deepmd/pt/train/training.py, source/tests/pt/test_training.py
Trainer now resolves save_dir, derives the keep count from ckpt_keep_ratio after num_steps is known, and updates regular and EMA retention limits.
Best checkpoint directory wiring
deepmd/pt/train/training.py, deepmd/pt/train/validation.py, source/tests/pt/test_validation.py
resolve_best_checkpoint_dir is used for full-validation checkpoint directories, FullValidator creates the directory during initialization, and tests cover custom best-checkpoint locations.
Checkpoint save paths and symlinks
deepmd/pt/train/training.py, source/tests/pt/test_training.py
Periodic, final, and zero-step checkpoint writes now use latest_checkpoint_path(..., save_dir), and tests verify the resulting files and symlinks.

Sequence Diagram(s)

sequenceDiagram
  participant Trainer
  participant latest_checkpoint_path
  participant save_dir
  participant checkpoint_file
  Trainer->>latest_checkpoint_path: resolve prefix, step, and save_dir
  latest_checkpoint_path-->>Trainer: checkpoint path
  Trainer->>save_dir: write periodic checkpoint file
  Trainer->>checkpoint_file: update pointer to the resolved path
Loading
sequenceDiagram
  participant Trainer
  participant resolve_best_checkpoint_dir
  participant FullValidator
  participant checkpoint_dir
  Trainer->>resolve_best_checkpoint_dir: resolve validating.save_best_dir or save_ckpt parent
  resolve_best_checkpoint_dir-->>Trainer: checkpoint_dir
  Trainer->>FullValidator: create validator with checkpoint_dir
  FullValidator->>checkpoint_dir: mkdir(parents=True, exist_ok=True)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • njzjz
  • wanghan-iapcm
  • iProzd
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and accurately summarizes the main change: customizable checkpoint save behavior in PyTorch training.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pt/train/utils.py`:
- Around line 283-284: The checkpoint retention calculation in the helper that
returns the keep count is undercounting because it ignores the final off-cadence
checkpoint written by Trainer.run() at num_steps. Update the logic around
total_periodic_ckpts/ckpt_keep_ratio so it accounts for the extra terminal
checkpoint (for example, by including the final step in the total when num_steps
is not an exact multiple of save_freq), and keep the existing max(1, ...)
safeguards intact.

In `@examples/water/dpa4/input.json`:
- Around line 123-124: The save_best_dir setting is unused in this example
because the validation path that triggers best-checkpoint saving is never
enabled. Update the input in this example by either turning on the
validating.full_validation flow so ckpt_best can be created, or remove the
save_best_dir field from the example to avoid misleading users; make the change
in the example configuration where tf32_infer and save_best_dir are defined.

In `@source/tests/pt/test_training.py`:
- Around line 967-968: Add the standard training test timeout guard to the new
validation test so it cannot hang CI; decorate
test_full_validation_save_best_dir with `@TRAINING_TEST_TIMEOUT` alongside the
existing `@patch` on FullValidator.evaluate_all_systems, matching the pattern used
by other training tests that call trainer.run().
- Around line 1228-1231: The checkpoint alias test is incorrectly asserting that
the prefix files are symlinks, which breaks on platforms where
symlink_prefix_files() falls back to copying. Update the test in the
checkpoint-saving area to validate the alias by checking that the Path resolves
to the expected target file, without requiring is_symlink(), using the existing
save_ckpt and ema_save_ckpt references so the test remains cross-platform.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 40ff16a5-1016-4465-ba48-a71de2e87b50

📥 Commits

Reviewing files that changed from the base of the PR and between 5733301 and fc780b1.

📒 Files selected for processing (9)
  • deepmd/pt/train/training.py
  • deepmd/pt/train/utils.py
  • deepmd/pt/train/validation.py
  • deepmd/utils/argcheck.py
  • doc/train/training-advanced.md
  • examples/water/dpa4/input.json
  • source/tests/pt/test_train_utils.py
  • source/tests/pt/test_training.py
  • source/tests/pt/test_validation.py

Comment thread deepmd/pt/train/utils.py Outdated
Comment thread examples/water/dpa4/input.json
Comment thread source/tests/pt/test_training.py
Comment thread source/tests/pt/test_training.py Outdated

@njzjz-bot njzjz-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the checkpoint save directory and ratio-based retention knobs. I found a few issues worth fixing before merge:

  1. ckpt_keep_ratio currently under-counts when the final checkpoint is off-cadence. In resolve_keep_ckpt_count(), total_periodic_ckpts = num_steps // save_freq ignores the final checkpoint that Trainer.run() still writes when num_steps % save_freq != 0. For example, num_steps=5, save_freq=2, ckpt_keep_ratio=0.5 produces checkpoints at steps 2, 4, and 5, but the helper returns ceil(0.5 * (5 // 2)) = 1; the documented formula ceil(ckpt_keep_ratio * numb_steps / save_freq) would keep 2. Please account for the terminal checkpoint, e.g. use ceil(num_steps / save_freq) (with the existing minimum-one guard).

  2. The new save_best_dir in examples/water/dpa4/input.json is misleading unless full validation is enabled. Since validating.full_validation defaults to false, ckpt_best will not actually be used by this example. Either enable the full-validation flow in the example or omit save_best_dir there.

  3. The new test_save_dir_redirects_checkpoints_with_local_symlinks assumes Path(...).is_symlink(), but symlink_prefix_files() copies files on Windows. If these tests are expected to be portable, please avoid requiring symlinks in the assertion (or explicitly scope the test/docs to non-Windows behavior).

Reviewed by OpenClaw 2026.6.8 (model: custom-chat-jinzhezeng-group/gpt-5.5).

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.18182% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.30%. Comparing base (5733301) to head (bcc11fa).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/pt/train/training.py 88.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5589      +/-   ##
==========================================
+ Coverage   82.27%   82.30%   +0.02%     
==========================================
  Files         887      887              
  Lines      100331   100481     +150     
  Branches     4060     4060              
==========================================
+ Hits        82550    82700     +150     
+ Misses      16320    16317       -3     
- Partials     1461     1464       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wanghan-iapcm

Copy link
Copy Markdown
Collaborator

Non-blocking doc nit: the documented ckpt_keep_ratio formula doesn't match the code.

Both the argcheck help text and the user doc state a single ceil:

ceil(ckpt_keep_ratio * numb_steps / save_freq)

but resolve_keep_ckpt_count computes a nested ceil — ceil(ratio * ceil(num_steps / save_freq)):

return None
total_ckpts = max(1, ceil(num_steps / save_freq))
return max(1, ceil(ckpt_keep_ratio * total_ckpts))

These differ whenever numb_steps isn't a multiple of save_freq. E.g. numb_steps=11, save_freq=10, ratio=0.9: the code keeps 2, the documented formula gives 1. The code is correct; only the docs are off.

Suggest updating both to the nested form ceil(ckpt_keep_ratio * ceil(numb_steps / save_freq)) (the resolve_keep_ckpt_count docstring already states it correctly):

doc_ckpt_keep_ratio = (
"An alternative to `max_ckpt_keep` that sets the number of retained "
"checkpoints as a fraction in (0, 1) of the run: the most recent "
"`ceil(ckpt_keep_ratio * numb_steps / save_freq)` checkpoints are kept. "
"When set, it overrides `max_ckpt_keep` and `ema_ckpt_keep`."
)

- {ref}`ckpt_keep_ratio <training/ckpt_keep_ratio>` An alternative to `max_ckpt_keep` (PyTorch backend) that keeps a sliding window of `ceil(ckpt_keep_ratio * numb_steps / save_freq)` most recent checkpoints, i.e. the final `ckpt_keep_ratio` fraction of the run by step. It overrides `max_ckpt_keep` (and `ema_ckpt_keep`) when set, and works the same whether the run length is given by `numb_steps` or `numb_epoch`.

@wanghan-iapcm wanghan-iapcm left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my non-blocking comment

@OutisLi

OutisLi commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

see my non-blocking comment

done

@OutisLi OutisLi requested a review from wanghan-iapcm June 26, 2026 14:33
Comment thread examples/water/dpa4/input.json
@OutisLi OutisLi added this pull request to the merge queue Jun 28, 2026
Merged via the queue into deepmodeling:master with commit a9bcbc5 Jun 28, 2026
70 checks passed
@OutisLi OutisLi deleted the pr/save branch June 28, 2026 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants