Skip to content

huggingface: add bucket scanning#5017

Open
julien-c wants to merge 8 commits into
trufflesecurity:mainfrom
julien-c:huggingface-buckets
Open

huggingface: add bucket scanning#5017
julien-c wants to merge 8 commits into
trufflesecurity:mainfrom
julien-c:huggingface-buckets

Conversation

@julien-c

@julien-c julien-c commented Jun 4, 2026

Copy link
Copy Markdown

HF recently shipped storage buckets (Xet-backed object storage)

  • new --bucket <namespace/name> flag, plus --include-buckets / --ignore-buckets / --skip-all-buckets; --org / --user scans pick up buckets automatically
  • buckets aren't git repos, so this is a separate scan path modeled on the S3 source: list via the tree API, download, chunk through handlers.HandleFile

Tested against a real public bucket (results identical to trufflehog filesystem on the same file), and a planted canary AWS key comes back as a verified finding with correct bucket metadata.

cc @dxa4481


Note

Medium Risk
New external HF API and file download path with token auth; follows existing S3-style patterns but increases network surface and scan scope for org/user runs.

Overview
Adds Hugging Face storage bucket support to the huggingface source so TruffleHog can scan object storage alongside models, datasets, and spaces.

CLI and config: New --bucket, --include-buckets, --ignore-buckets, and --skip-all-buckets flags (plus proto/engine wiring). Org/user scans enumerate buckets automatically unless skipped. Validation now accepts bucket as a scan target.

Scan behavior: Buckets are not git repos. The source lists files via the HF tree API (with Link pagination), downloads each file through a dedicated client (no whole-request timeout on body reads), skips files over 250MB, and chunks content via handlers.HandleFile with HuggingFace bucket metadata—similar to the S3 source path.

API client: GetBucket, ListBucketsByAuthor, ListBucketFiles, and DownloadBucketFile with path-segment escaping for resolve URLs.

Docs (README, man page) and minor comment fixes on ScanHuggingface are updated.

Reviewed by Cursor Bugbot for commit 5a76b32. Bugbot is set up for automated code reviews on this repo. Configure here.

@julien-c julien-c requested a review from a team June 4, 2026 16:05
@julien-c julien-c requested review from a team as code owners June 4, 2026 16:05
Comment thread pkg/sources/huggingface/client.go
Comment thread pkg/sources/huggingface/bucket.go Outdated
Comment thread pkg/sources/huggingface/client.go Outdated
@CLAassistant

CLAassistant commented Jun 4, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@julien-c

This comment was marked as outdated.

Comment thread pkg/sources/huggingface/client.go Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit e4f90eb. Configure here.

Comment thread pkg/sources/huggingface/bucket.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants