-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] - Optimize detector performance by reducing data passed to regex #2812
base: main
Are you sure you want to change the base?
Conversation
While I think this is a good idea, it would be a mistake to set a default max length. I can think of several detectors that can easily exceed this — JWT, private key, GCP, and Docker (#2677) to name a few. Because there are hundreds of detectors, the safer approach would be to make this opt-in. This would also interfere with detectors that require multiple parts (e.g., client ID & secret, username & password, secret & URL). Edit: Incidentally, this would likely solve #2739. |
@@ -36,7 +36,7 @@ func TestAlchemy_Pattern(t *testing.T) { | |||
for _, test := range tests { | |||
t.Run(test.name, func(t *testing.T) { | |||
chunkSpecificDetectors := make(map[ahocorasick.DetectorKey]detectors.Detector, 2) | |||
ahoCorasickCore.PopulateMatchingDetectors(test.input, chunkSpecificDetectors) | |||
ahoCorasickCore.FindDetectorMatches(test.input, chunkSpecificDetectors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this would compile anymore since the signature is now:
FindDetectorMatches(chunkData string) []DetectorMatch
It would need to be changed to:
-chunkSpecificDetectors := make(map[ahocorasick.DetectorKey]detectors.Detector, 2)
-ahoCorasickCore.FindDetectorMatches(test.input, chunkSpecificDetectors
+chunkSpecificDetectors := ahoCorasickCore.FindDetectorMatches(test.input)
Incidentally, I think the TestX_Pattern
should be put into a common test module instead of copied & pasted between tests.
pkg/detectors/detectors.go
Outdated
// MaxSecretSizeProvider is an optional interface that a detector can implement to | ||
// provide a custom max size for the secret it finds. | ||
type MaxSecretSizeProvider interface { | ||
ProvideMaxSecretSize() int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just MaxSecretSize
?
pkg/engine/engine.go
Outdated
matchedBytes := data.detector.Matches(data.chunk.Data) | ||
for _, match := range matchedBytes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does mergeMatches
aide in de-duplication at all? Some of my recent changes have been around de-duplicating results in chunks to prevent making the same network calls 2/10/20 times. Admittedly, a better solution would be #2262 rather than caching matches in a given chunk.
pkg/engine/engine_test.go
Outdated
for _, chunkSize := range chunkSizes { | ||
b.Run(fmt.Sprintf("ChunkSize_%d", chunkSize), func(b *testing.B) { | ||
b.ReportAllocs() | ||
b.SetBytes(int64(dataSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this to chunkSize to update the throughput on the benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag. |
1024 bytes would be safer, but could definitely still miss valid secrets. I think there'd be tremendous value in at least having a (hidden?) flag that runs both and checks for missed results (kind of like https://github.com/github/scientist), rather than binary on/off. Otherwise it can be tricky to identify affected detectors, which has been an issue with the verification overlap change. |
Description:
This PR introduces optimizations to improve the performance of detectors by reducing the amount of data passed to the regex within the
FromData
method. The changes leverage the knowledge of keyword positions to extract relevant portions of the chunk data, where the secret is likely to reside.Key changes:
Introduced
DetectorMatch
struct to represent a detected pattern’s metadata, including the detector key, detector instance, and a slice ofmatchSpan
structs representing the start and end offsets of matched keywords within the chunk.Added
matchSpan
struct to represent a single occurrence of a matched keyword, containing the start and end byte offsets within the chunk.Implemented
Matches
method forDetectorMatch
to extract the relevant portions of the chunk data based on the start and end positions of each match. The end position is determined by taking the minimum of the keyword position +maxMatchLength
(set to 300) and the length of the chunk data.Introduced
FindDetectorMatches
function (previouslyPopulateMatchingDetectors
) to return a slice ofDetectorMatch
instances, each containing the detector key, detector, and a slice of matches. Adjacent or overlapping matches are merged using themergeMatches
function to avoid duplicating or overlapping the matched portions of the chunk data.Updated the detection logic to use the
Matches
method ofDetectorMatch
to extract the relevant portions of the chunk data before passing them to theFromData
method of the detector.Introduced
MaxSecretSizeProvider
interface that detectors can optionally implement to provide a custom maximum size for the secrets they detect. The interface includes a single methodProvideMaxSecretSize() int64
that returns the maximum size of the secret the detector expects to find.As part of the
FindDetectorMatches
function, it checks if a detector implements theMaxSecretSizeProvider
interface. If implemented, theProvideMaxSecretSize
method is called to obtain the detector-specific maximum secret size, which is used to determine the end position of the match span. If the interface is not implemented, the defaultmaxMatchLength
constant is used.Implemented the
MaxSecretSizeProvider
interface in the relevant detectors (PrivateKeyDetector
,GCPDetector
, andGCPApplicationDefaultCredentialsDetector
) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞The optimization is based on the assumption that most secrets shouldn’t exceed a certain length from the keyword’s position. By default, the
maxMatchLength
constant is set to 300 characters. However, detectors that require a larger or smaller max size can implement theMaxSecretSizeProvider
interface and provide their own value through theProvideMaxSecretSize
method.These changes significantly reduce the amount of data the regex within
FromData
has to process, leading to improved detector performance while still ensuring accurate secret detection. The introduction of theMaxSecretSizeProvider
interface allows for flexibility in handling different secret sizes based on the specific requirements of each detector.Sequence Diagram
Benchmarks
Benchmark assessing the performance of
FromData
with verification disabled across various chunk sizes.orange: old chunk size (10kB)
green: new chunk size (512B overestimate for most detectors)
Checklist:
make test-community
)?make lint
this requires golangci-lint)?