feat(pt): add custom save behaviors#5589
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughPyTorch training now supports configurable checkpoint output and retention, and full validation can write best checkpoints to a separate directory. Checkpoint path construction, symlink/pointer handling, configuration schema, docs, examples, and tests were updated. ChangesCheckpoint path and retention updates
Sequence Diagram(s)sequenceDiagram
participant Trainer
participant latest_checkpoint_path
participant save_dir
participant checkpoint_file
Trainer->>latest_checkpoint_path: resolve prefix, step, and save_dir
latest_checkpoint_path-->>Trainer: checkpoint path
Trainer->>save_dir: write periodic checkpoint file
Trainer->>checkpoint_file: update pointer to the resolved path
sequenceDiagram
participant Trainer
participant resolve_best_checkpoint_dir
participant FullValidator
participant checkpoint_dir
Trainer->>resolve_best_checkpoint_dir: resolve validating.save_best_dir or save_ckpt parent
resolve_best_checkpoint_dir-->>Trainer: checkpoint_dir
Trainer->>FullValidator: create validator with checkpoint_dir
FullValidator->>checkpoint_dir: mkdir(parents=True, exist_ok=True)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepmd/pt/train/utils.py`:
- Around line 283-284: The checkpoint retention calculation in the helper that
returns the keep count is undercounting because it ignores the final off-cadence
checkpoint written by Trainer.run() at num_steps. Update the logic around
total_periodic_ckpts/ckpt_keep_ratio so it accounts for the extra terminal
checkpoint (for example, by including the final step in the total when num_steps
is not an exact multiple of save_freq), and keep the existing max(1, ...)
safeguards intact.
In `@examples/water/dpa4/input.json`:
- Around line 123-124: The save_best_dir setting is unused in this example
because the validation path that triggers best-checkpoint saving is never
enabled. Update the input in this example by either turning on the
validating.full_validation flow so ckpt_best can be created, or remove the
save_best_dir field from the example to avoid misleading users; make the change
in the example configuration where tf32_infer and save_best_dir are defined.
In `@source/tests/pt/test_training.py`:
- Around line 967-968: Add the standard training test timeout guard to the new
validation test so it cannot hang CI; decorate
test_full_validation_save_best_dir with `@TRAINING_TEST_TIMEOUT` alongside the
existing `@patch` on FullValidator.evaluate_all_systems, matching the pattern used
by other training tests that call trainer.run().
- Around line 1228-1231: The checkpoint alias test is incorrectly asserting that
the prefix files are symlinks, which breaks on platforms where
symlink_prefix_files() falls back to copying. Update the test in the
checkpoint-saving area to validate the alias by checking that the Path resolves
to the expected target file, without requiring is_symlink(), using the existing
save_ckpt and ema_save_ckpt references so the test remains cross-platform.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 40ff16a5-1016-4465-ba48-a71de2e87b50
📒 Files selected for processing (9)
deepmd/pt/train/training.pydeepmd/pt/train/utils.pydeepmd/pt/train/validation.pydeepmd/utils/argcheck.pydoc/train/training-advanced.mdexamples/water/dpa4/input.jsonsource/tests/pt/test_train_utils.pysource/tests/pt/test_training.pysource/tests/pt/test_validation.py
njzjz-bot
left a comment
There was a problem hiding this comment.
Thanks for adding the checkpoint save directory and ratio-based retention knobs. I found a few issues worth fixing before merge:
-
ckpt_keep_ratiocurrently under-counts when the final checkpoint is off-cadence. Inresolve_keep_ckpt_count(),total_periodic_ckpts = num_steps // save_freqignores the final checkpoint thatTrainer.run()still writes whennum_steps % save_freq != 0. For example,num_steps=5,save_freq=2,ckpt_keep_ratio=0.5produces checkpoints at steps 2, 4, and 5, but the helper returnsceil(0.5 * (5 // 2)) = 1; the documented formulaceil(ckpt_keep_ratio * numb_steps / save_freq)would keep 2. Please account for the terminal checkpoint, e.g. useceil(num_steps / save_freq)(with the existing minimum-one guard). -
The new
save_best_dirinexamples/water/dpa4/input.jsonis misleading unless full validation is enabled. Sincevalidating.full_validationdefaults to false,ckpt_bestwill not actually be used by this example. Either enable the full-validation flow in the example or omitsave_best_dirthere. -
The new
test_save_dir_redirects_checkpoints_with_local_symlinksassumesPath(...).is_symlink(), butsymlink_prefix_files()copies files on Windows. If these tests are expected to be portable, please avoid requiring symlinks in the assertion (or explicitly scope the test/docs to non-Windows behavior).
Reviewed by OpenClaw 2026.6.8 (model: custom-chat-jinzhezeng-group/gpt-5.5).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5589 +/- ##
==========================================
+ Coverage 82.27% 82.30% +0.02%
==========================================
Files 887 887
Lines 100331 100481 +150
Branches 4060 4060
==========================================
+ Hits 82550 82700 +150
+ Misses 16320 16317 -3
- Partials 1461 1464 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Non-blocking doc nit: the documented Both the argcheck help text and the user doc state a single ceil:
but deepmd-kit/deepmd/pt/train/utils.py Lines 283 to 285 in b0a4b15 These differ whenever Suggest updating both to the nested form deepmd-kit/deepmd/utils/argcheck.py Lines 4997 to 5002 in b0a4b15 deepmd-kit/doc/train/training-advanced.md Line 107 in b0a4b15 |
wanghan-iapcm
left a comment
There was a problem hiding this comment.
see my non-blocking comment
done |
Summary by CodeRabbit
New Features
training.save_dirfor periodic checkpoints andvalidating.save_best_dirfor best-validation checkpoints.training.ckpt_keep_ratiofor ratio-based sliding-window checkpoint retention.Bug Fixes
save_diris set.Documentation
save_dirandckpt_keep_ratio; updated the training advanced guide and example config.Tests