Skip to content

Add opt-in lazy pruning to CloudPath.walk#574

Merged
pjbull merged 6 commits into
masterfrom
pjbull-iterative-walk-518
Jun 28, 2026
Merged

Add opt-in lazy pruning to CloudPath.walk#574
pjbull merged 6 commits into
masterfrom
pjbull-iterative-walk-518

Conversation

@pjbull

@pjbull pjbull commented Jun 27, 2026

Copy link
Copy Markdown
Member

Closes #518

Summary

Adds a lazy keyword argument to CloudPath.walk so callers can opt into os.walk-style in-place pruning of dirnames to skip fetching subtrees they don't need.

  • lazy=False (default): unchanged, fast behavior. The whole subtree is fetched with a single recursive listing, then walked. Pruning dirnames affects which directories are yielded but the tree is already fetched.
  • lazy=True: lists each directory on demand. When top_down=True, mutating dirnames in place prunes those subtrees so their contents are never listed — matching os.walk / pathlib.Path.walk and avoiding network requests for large, sparsely-traversed trees.

The lazy implementation is iterative (explicit stack), mirroring CPython's pathlib.Path.walk, to avoid recursion limits and per-level yield from overhead.

Why opt-in rather than always lazy

An always-lazy walk regressed full traversals badly on real backends because it makes one sequential listing request per directory. Keeping the eager walk as the default preserves performance for the common "walk everything" case while still exposing the pruning win as an opt-in.

Performance (live S3, performance_tests/5_100_12, 7272 files, ~606 dirs)

Mode Mean time Directory list calls Files visited
walk() default (eager) 1.24s 1 7272
walk(lazy=True) no pruning 33.19s 606 7272
walk(lazy=True) + pruning 0.36s 6 72

The default matches pre-change performance (single recursive listing). lazy=True with pruning collapses 606 list calls to 6.

Notes

  • Default-raises-on-error is kept (cloudpathlib raises by default where pathlib ignores); on_error is honored in both modes.
  • follow_symlinks is accepted for signature parity but ignored (no cloud symlinks).
  • Azure mock updated to yield BlobPrefix for directories in non-recursive listings, matching the real SDK.

Testing

  • Comprehensive walk tests covering both modes across all rigs: lazy-vs-eager equivalence, lazy pruning skips both the yield and the listing, eager pruning affects output only, self-skip, and on_error behavior (parametrized over lazy).
  • Full suite passes locally (1089 passed, 5 skipped); black/flake8/mypy clean.
  • Live S3 perf measured before/after (table above).

Replace the pre-built tree approach with an on-demand, per-directory
listing using _list_dir(recursive=False). When top_down=True, callers
can modify dirnames in-place to prune branches—matching standard
os.walk/Path.walk behavior—and avoid fetching subtrees that are never
visited.

Changes:
- Remove _walk_results_from_tree and rewrite walk() via a new _walk()
  helper that recurses lazily one directory level at a time.
- Fix mock_azureblob.mock_item_paged to yield BlobPrefix for
  subdirectories in non-recursive listings, matching real Azure SDK
  walk_blobs() behavior.
- Add test_walk_dirnames_pruning to verify that pruned directories are
  never visited.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot temporarily deployed to pull request June 27, 2026 18:47 Inactive
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.2%. Comparing base (35f88db) to head (98a071a).

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #574     +/-   ##
========================================
+ Coverage    94.0%   94.2%   +0.2%     
========================================
  Files          28      28             
  Lines        2232    2259     +27     
========================================
+ Hits         2100    2130     +30     
+ Misses        132     129      -3     
Files with missing lines Coverage Δ
cloudpathlib/cloudpath.py 95.2% <100.0%> (+0.6%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ror test

- Use _lstrip_path_root + Path.as_posix() instead of raw str() in
  test_walk_dirnames_pruning so Windows backslashes don't cause mismatches.
- Add test_walk_on_error that uses patch.object to simulate _list_dir
  failures, covering the on_error callback and bare-raise paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot temporarily deployed to pull request June 27, 2026 19:00 Inactive
…ver dir

Prevents PermissionError on Windows when the server thread still holds a
file handle at teardown time. Each test now only removes its own subdir
(server_dir / test_dir); the module-scoped http_server/https_server
fixtures remain responsible for final server_dir cleanup.

Fixes intermittent test_close_file_idempotent[/https_rig] errors on Windows CI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot temporarily deployed to pull request June 27, 2026 19:20 Inactive
Replaces the recursive _walk helper with an explicit-stack implementation
modeled on CPython's pathlib.Path.walk. This avoids RecursionError on deeply
nested trees and the per-level 'yield from' overhead, while preserving lazy
one-directory-at-a-time listing and in-place dirnames pruning (issue #518).

Adds test_walk_skips_self to cover the 'directory lists itself' guard.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot temporarily deployed to pull request June 27, 2026 19:31 Inactive
…ssion

The always-lazy walk regressed full traversals 8-28x on real backends
because it made one sequential listing per directory. Keep the fast
eager walk (single recursive listing) as the default and gate the
pruning-capable per-directory listing behind lazy=True so callers
choose the tradeoff.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pjbull pjbull changed the title Make CloudPath.walk iterative for sparse traversal Add opt-in lazy pruning to CloudPath.walk Jun 27, 2026
@github-actions github-actions Bot temporarily deployed to pull request June 27, 2026 20:01 Inactive
@github-actions github-actions Bot temporarily deployed to pull request June 27, 2026 20:11 Inactive
@pjbull pjbull merged commit 5ed46d3 into master Jun 28, 2026
51 of 52 checks passed
@pjbull pjbull deleted the pjbull-iterative-walk-518 branch June 28, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iterative walk for sparsely traversing large nested directories

1 participant