Skip to content

Fix: Resolve known dedup issues in notifierWorker#5028

Open
MuneebUllahKhan222 wants to merge 6 commits into
mainfrom
INS-499
Open

Fix: Resolve known dedup issues in notifierWorker#5028
MuneebUllahKhan222 wants to merge 6 commits into
mainfrom
INS-499

Conversation

@MuneebUllahKhan222

@MuneebUllahKhan222 MuneebUllahKhan222 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Problem

notifierWorker could report the same credential multiple times due to a broken deduplication condition.

Old deduplication condition

The condition guarding the deduplication cache only suppressed duplicates when the decoder type differed between two results for the same key:

if val, ok := e.dedupeCache.Get(key); ok && (val != result.DecoderType) {
    continue
}

When the decoder type was the same, the common case val != result.DecoderType evaluated to false, so every occurrence passed through even if the key was already present in the cache.

This commonly occurred for results produced from the peek-overlap between consecutive chunks, where the same credential at a chunk boundary appeared in both:

  • Chunk N (as peek data)
  • Chunk N+1 (as main content)

As a result, the same finding could be reported multiple times.

Cache eviction

Even if the deduplication condition had been working correctly, the cache size of 512 entries was small enough that entries could be evicted during larger scans, allowing previously seen findings to be re-admitted and reported again.

Fix

Simplify the deduplication logic to suppress any result whose key is already present in the cache:

if _, ok := e.dedupeCache.Get(key); ok {
    continue
}

This is safe because the deduplication key already contains sufficient source-specific metadata, including:

  • File path
  • Line numbers
  • Commit hash
  • Other location-specific metadata

As a result, genuinely distinct occurrences of the same credential generate different keys and will not be incorrectly suppressed.

Additional Changes

Increase cache size

Increase the deduplication cache size from 512 to 5,000 entries to reduce eviction-driven re-admission during larger scans.

Store MD5 hashes instead of full keys

The cache now stores MD5 hashes of deduplication keys rather than the full key strings.

Since an MD5 hash is 16 bytes, a cache containing 5,000 entries requires approximately:

5,000 × 16 bytes = 80,000 bytes

or roughly 80 KB of storage for the cached hashes themselves, making the larger cache size a reasonable memory trade-off.

Remove unused cache value

Change the cache value type from:

lru.Cache[string, detectorspb.DecoderType]

to:

lru.Cache[string, struct{}]

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Changes which findings are emitted (fewer duplicates) in the core scan output path; MD5 key hashing is a theoretical collision risk but keys already encode location metadata.

Overview
Fixes duplicate secret notifications in notifierWorker by correcting LRU deduplication and sizing the cache for large scans.

Dedup logic: Previously, cache hits only skipped emission when the stored decoder type differed from the new result, so same-decoder repeats (e.g. peek-overlap at chunk boundaries) still dispatched. The guard is now any cache hit skips the result. Keys still include detector name/type, raw values, and SourceMetadata, so the same credential at different locations remains distinct.

Cache: Size increases from 512 → 5,000. Entries store MD5 hashes of the dedupe key (value type struct{} instead of DecoderType) to cut memory while keeping large raw/metadata out of the LRU.

Tests: Removes the "overlap, retain false positives" case that expected two overlapping passthrough hits to both count as unverified secrets.

Reviewed by Cursor Bugbot for commit bafce7b. Bugbot is set up for automated code reviews on this repo. Configure here.

@MuneebUllahKhan222 MuneebUllahKhan222 requested a review from a team June 10, 2026 11:14
@MuneebUllahKhan222 MuneebUllahKhan222 requested a review from a team as a code owner June 10, 2026 11:14
Comment thread pkg/engine/engine.go
Comment thread pkg/engine/engine.go Outdated

cache, err := lru.New[string, detectorspb.DecoderType](cacheSize)
// The cache size is set to 10000 entries, which is a balance between memory usage and the need for effective deduplication.
// On average a cache entry would be between 500-1000 bytes, so this would use around 5-10 MB of memory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing you could consider is hashing; MD5 is fast and we wouldn't be worried about collisions here. A 1K cache key is biiiiiiiig lol

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the MD5 suggestion, I hadn't considered it, but it was a great idea, so I went ahead and implemented it.

Although a cache size of 10,000 wouldn't really be an issue anymore since we're now hashing the keys (10,000 entries would only require about 160 KB for the hashes themselves), I've reduced the cache size to a more conservative value of 5,000 entries for now.

If we find that we need a larger cache in the future, we can always increase it.

@camgunz camgunz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; but see what @trufflesecurity/scanning says

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c29e20e. Configure here.

Comment thread pkg/engine/engine_test.go Outdated

@rosecodym rosecodym left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok. I'm a little confused by why it was originally written the way it was, though - it's not that there's no obvious purpose, its that it was apparently known to be wrong at that time but written incorrectly anyway. Maybe we didn't have rich enough metadata then to make a proper dedup key?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants