Fix: Resolve known dedup issues in notifierWorker#5028
Fix: Resolve known dedup issues in notifierWorker#5028MuneebUllahKhan222 wants to merge 6 commits into
Conversation
|
|
||
| cache, err := lru.New[string, detectorspb.DecoderType](cacheSize) | ||
| // The cache size is set to 10000 entries, which is a balance between memory usage and the need for effective deduplication. | ||
| // On average a cache entry would be between 500-1000 bytes, so this would use around 5-10 MB of memory. |
There was a problem hiding this comment.
One thing you could consider is hashing; MD5 is fast and we wouldn't be worried about collisions here. A 1K cache key is biiiiiiiig lol
There was a problem hiding this comment.
Thanks for the MD5 suggestion, I hadn't considered it, but it was a great idea, so I went ahead and implemented it.
Although a cache size of 10,000 wouldn't really be an issue anymore since we're now hashing the keys (10,000 entries would only require about 160 KB for the hashes themselves), I've reduced the cache size to a more conservative value of 5,000 entries for now.
If we find that we need a larger cache in the future, we can always increase it.
camgunz
left a comment
There was a problem hiding this comment.
LGTM; but see what @trufflesecurity/scanning says
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c29e20e. Configure here.
rosecodym
left a comment
There was a problem hiding this comment.
I think this is ok. I'm a little confused by why it was originally written the way it was, though - it's not that there's no obvious purpose, its that it was apparently known to be wrong at that time but written incorrectly anyway. Maybe we didn't have rich enough metadata then to make a proper dedup key?

Problem
notifierWorkercould report the same credential multiple times due to a broken deduplication condition.Old deduplication condition
The condition guarding the deduplication cache only suppressed duplicates when the decoder type differed between two results for the same key:
When the decoder type was the same, the common case
val != result.DecoderTypeevaluated tofalse, so every occurrence passed through even if the key was already present in the cache.This commonly occurred for results produced from the peek-overlap between consecutive chunks, where the same credential at a chunk boundary appeared in both:
As a result, the same finding could be reported multiple times.
Cache eviction
Even if the deduplication condition had been working correctly, the cache size of 512 entries was small enough that entries could be evicted during larger scans, allowing previously seen findings to be re-admitted and reported again.
Fix
Simplify the deduplication logic to suppress any result whose key is already present in the cache:
This is safe because the deduplication key already contains sufficient source-specific metadata, including:
As a result, genuinely distinct occurrences of the same credential generate different keys and will not be incorrectly suppressed.
Additional Changes
Increase cache size
Increase the deduplication cache size from 512 to 5,000 entries to reduce eviction-driven re-admission during larger scans.
Store MD5 hashes instead of full keys
The cache now stores MD5 hashes of deduplication keys rather than the full key strings.
Since an MD5 hash is 16 bytes, a cache containing 5,000 entries requires approximately:
or roughly 80 KB of storage for the cached hashes themselves, making the larger cache size a reasonable memory trade-off.
Remove unused cache value
Change the cache value type from:
to:
Checklist:
make test-community)?make lintthis requires golangci-lint)?Note
Medium Risk
Changes which findings are emitted (fewer duplicates) in the core scan output path; MD5 key hashing is a theoretical collision risk but keys already encode location metadata.
Overview
Fixes duplicate secret notifications in
notifierWorkerby correcting LRU deduplication and sizing the cache for large scans.Dedup logic: Previously, cache hits only skipped emission when the stored decoder type differed from the new result, so same-decoder repeats (e.g. peek-overlap at chunk boundaries) still dispatched. The guard is now any cache hit skips the result. Keys still include detector name/type, raw values, and
SourceMetadata, so the same credential at different locations remains distinct.Cache: Size increases from 512 → 5,000. Entries store MD5 hashes of the dedupe key (value type
struct{}instead ofDecoderType) to cut memory while keeping large raw/metadata out of the LRU.Tests: Removes the
"overlap, retain false positives"case that expected two overlapping passthrough hits to both count as unverified secrets.Reviewed by Cursor Bugbot for commit bafce7b. Bugbot is set up for automated code reviews on this repo. Configure here.