Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] - Optimize detector performance by reducing data passed to regex #2812

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

ahrav
Copy link
Collaborator

@ahrav ahrav commented May 9, 2024

Description:

This PR introduces optimizations to improve the performance of detectors by reducing the amount of data passed to the regex within the FromData method. The changes leverage the knowledge of keyword positions to extract relevant portions of the chunk data, where the secret is likely to reside.

Key changes:

  1. Introduced DetectorMatch struct to represent a detected pattern’s metadata, including the detector key, detector instance, and a slice of matchSpan structs representing the start and end offsets of matched keywords within the chunk.

  2. Added matchSpan struct to represent a single occurrence of a matched keyword, containing the start and end byte offsets within the chunk.

  3. Implemented Matches method for DetectorMatch to extract the relevant portions of the chunk data based on the start and end positions of each match. The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data.

  4. Introduced FindDetectorMatches function (previously PopulateMatchingDetectors) to return a slice of DetectorMatch instances, each containing the detector key, detector, and a slice of matches. Adjacent or overlapping matches are merged using the mergeMatches function to avoid duplicating or overlapping the matched portions of the chunk data.

  5. Updated the detection logic to use the Matches method of DetectorMatch to extract the relevant portions of the chunk data before passing them to the FromData method of the detector.

  6. Introduced MaxSecretSizeProvider interface that detectors can optionally implement to provide a custom maximum size for the secrets they detect. The interface includes a single method ProvideMaxSecretSize() int64 that returns the maximum size of the secret the detector expects to find.

  7. As part of the FindDetectorMatches function, it checks if a detector implements the MaxSecretSizeProvider interface. If implemented, the ProvideMaxSecretSize method is called to obtain the detector-specific maximum secret size, which is used to determine the end position of the match span. If the interface is not implemented, the default maxMatchLength constant is used.

  8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞

The optimization is based on the assumption that most secrets shouldn’t exceed a certain length from the keyword’s position. By default, the maxMatchLength constant is set to 300 characters. However, detectors that require a larger or smaller max size can implement the MaxSecretSizeProvider interface and provide their own value through the ProvideMaxSecretSize method.

These changes significantly reduce the amount of data the regex within FromData has to process, leading to improved detector performance while still ensuring accurate secret detection. The introduction of the MaxSecretSizeProvider interface allows for flexibility in handling different secret sizes based on the specific requirements of each detector.

Sequence Diagram

optimized engine regex-2024-05-09-003033

Benchmarks

Benchmark assessing the performance of FromData with verification disabled across various chunk sizes.

orange: old chunk size (10kB)
green: new chunk size (512B overestimate for most detectors)
Screenshot 2024-05-08 at 8 50 56 AM

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@rgmz
Copy link
Contributor

rgmz commented May 9, 2024

The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data.

...
8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞

While I think this is a good idea, it would be a mistake to set a default max length. I can think of several detectors that can easily exceed this — JWT, private key, GCP, and Docker (#2677) to name a few. Because there are hundreds of detectors, the safer approach would be to make this opt-in.

This would also interfere with detectors that require multiple parts (e.g., client ID & secret, username & password, secret & URL).


Edit: Incidentally, this would likely solve #2739.

@@ -36,7 +36,7 @@ func TestAlchemy_Pattern(t *testing.T) {
for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
chunkSpecificDetectors := make(map[ahocorasick.DetectorKey]detectors.Detector, 2)
ahoCorasickCore.PopulateMatchingDetectors(test.input, chunkSpecificDetectors)
ahoCorasickCore.FindDetectorMatches(test.input, chunkSpecificDetectors)
Copy link
Contributor

@rgmz rgmz May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this would compile anymore since the signature is now:

FindDetectorMatches(chunkData string) []DetectorMatch

It would need to be changed to:

-chunkSpecificDetectors := make(map[ahocorasick.DetectorKey]detectors.Detector, 2)
-ahoCorasickCore.FindDetectorMatches(test.input, chunkSpecificDetectors
+chunkSpecificDetectors := ahoCorasickCore.FindDetectorMatches(test.input)

Incidentally, I think the TestX_Pattern should be put into a common test module instead of copied & pasted between tests.

// MaxSecretSizeProvider is an optional interface that a detector can implement to
// provide a custom max size for the secret it finds.
type MaxSecretSizeProvider interface {
ProvideMaxSecretSize() int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just MaxSecretSize?

Comment on lines 810 to 811
matchedBytes := data.detector.Matches(data.chunk.Data)
for _, match := range matchedBytes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does mergeMatches aide in de-duplication at all? Some of my recent changes have been around de-duplicating results in chunks to prevent making the same network calls 2/10/20 times. Admittedly, a better solution would be #2262 rather than caching matches in a given chunk.

@rgmz rgmz mentioned this pull request May 15, 2024
2 tasks
for _, chunkSize := range chunkSizes {
b.Run(fmt.Sprintf("ChunkSize_%d", chunkSize), func(b *testing.B) {
b.ReportAllocs()
b.SetBytes(int64(dataSize))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this to chunkSize to update the throughput on the benchmarks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake; the screenshot above was from a detector benchmark in aws_test.go. This benchmark should compare the performance of FindDetectorMatch across different chunk sizes.

Here is the correct benchmark:
Screenshot 2024-05-16 at 3 40 02 PM

@dustin-decker
Copy link
Contributor

The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data.
...
8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞

While I think this is a good idea, it would be a mistake to set a default max length. I can think of several detectors that can easily exceed this — JWT, private key, GCP, and Docker (#2677) to name a few. Because there are hundreds of detectors, the safer approach would be to make this opt-in.

This would also interfere with detectors that require multiple parts (e.g., client ID & secret, username & password, secret & URL).

Edit: Incidentally, this would likely solve #2739.

I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.

@rgmz
Copy link
Contributor

rgmz commented May 16, 2024

I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.

1024 bytes would be safer, but could definitely still miss valid secrets.

I think there'd be tremendous value in at least having a (hidden?) flag that runs both and checks for missed results (kind of like https://github.com/github/scientist), rather than binary on/off. Otherwise it can be tricky to identify affected detectors, which has been an issue with the verification overlap change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants