Add opt-in lazy pruning to CloudPath.walk#574
Merged
Conversation
Replace the pre-built tree approach with an on-demand, per-directory listing using _list_dir(recursive=False). When top_down=True, callers can modify dirnames in-place to prune branches—matching standard os.walk/Path.walk behavior—and avoid fetching subtrees that are never visited. Changes: - Remove _walk_results_from_tree and rewrite walk() via a new _walk() helper that recurses lazily one directory level at a time. - Fix mock_azureblob.mock_item_paged to yield BlobPrefix for subdirectories in non-recursive listings, matching real Azure SDK walk_blobs() behavior. - Add test_walk_dirnames_pruning to verify that pruned directories are never visited. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #574 +/- ##
========================================
+ Coverage 94.0% 94.2% +0.2%
========================================
Files 28 28
Lines 2232 2259 +27
========================================
+ Hits 2100 2130 +30
+ Misses 132 129 -3
🚀 New features to boost your workflow:
|
…ror test - Use _lstrip_path_root + Path.as_posix() instead of raw str() in test_walk_dirnames_pruning so Windows backslashes don't cause mismatches. - Add test_walk_on_error that uses patch.object to simulate _list_dir failures, covering the on_error callback and bare-raise paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ver dir Prevents PermissionError on Windows when the server thread still holds a file handle at teardown time. Each test now only removes its own subdir (server_dir / test_dir); the module-scoped http_server/https_server fixtures remain responsible for final server_dir cleanup. Fixes intermittent test_close_file_idempotent[/https_rig] errors on Windows CI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the recursive _walk helper with an explicit-stack implementation modeled on CPython's pathlib.Path.walk. This avoids RecursionError on deeply nested trees and the per-level 'yield from' overhead, while preserving lazy one-directory-at-a-time listing and in-place dirnames pruning (issue #518). Adds test_walk_skips_self to cover the 'directory lists itself' guard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ssion The always-lazy walk regressed full traversals 8-28x on real backends because it made one sequential listing per directory. Keep the fast eager walk (single recursive listing) as the default and gate the pruning-capable per-directory listing behind lazy=True so callers choose the tradeoff. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #518
Summary
Adds a
lazykeyword argument toCloudPath.walkso callers can opt intoos.walk-style in-place pruning ofdirnamesto skip fetching subtrees they don't need.lazy=False(default): unchanged, fast behavior. The whole subtree is fetched with a single recursive listing, then walked. Pruningdirnamesaffects which directories are yielded but the tree is already fetched.lazy=True: lists each directory on demand. Whentop_down=True, mutatingdirnamesin place prunes those subtrees so their contents are never listed — matchingos.walk/pathlib.Path.walkand avoiding network requests for large, sparsely-traversed trees.The lazy implementation is iterative (explicit stack), mirroring CPython's
pathlib.Path.walk, to avoid recursion limits and per-levelyield fromoverhead.Why opt-in rather than always lazy
An always-lazy walk regressed full traversals badly on real backends because it makes one sequential listing request per directory. Keeping the eager walk as the default preserves performance for the common "walk everything" case while still exposing the pruning win as an opt-in.
Performance (live S3,
performance_tests/5_100_12, 7272 files, ~606 dirs)walk()default (eager)walk(lazy=True)no pruningwalk(lazy=True)+ pruningThe default matches pre-change performance (single recursive listing).
lazy=Truewith pruning collapses 606 list calls to 6.Notes
on_erroris honored in both modes.follow_symlinksis accepted for signature parity but ignored (no cloud symlinks).BlobPrefixfor directories in non-recursive listings, matching the real SDK.Testing
on_errorbehavior (parametrized overlazy).black/flake8/mypyclean.