[feat] - Optimize detector performance by reducing data passed to regex #2812

ahrav · 2024-05-09T01:01:47Z

Description:

This PR introduces optimizations to improve the performance of detectors by reducing the amount of data passed to the regex within the FromData method. The changes leverage the knowledge of keyword positions to extract relevant portions of the chunk data, where the secret is likely to reside.

Key changes:

Introduced DetectorMatch struct to represent a detected pattern’s metadata, including the detector key, detector instance, and a slice of matchSpan structs representing the start and end offsets of matched keywords within the chunk.
Added matchSpan struct to represent a single occurrence of a matched keyword, containing the start and end byte offsets within the chunk.
Implemented Matches method for DetectorMatch to extract the relevant portions of the chunk data based on the start and end positions of each match. The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data.
Introduced FindDetectorMatches function (previously PopulateMatchingDetectors) to return a slice of DetectorMatch instances, each containing the detector key, detector, and a slice of matches. Adjacent or overlapping matches are merged using the mergeMatches function to avoid duplicating or overlapping the matched portions of the chunk data.
Updated the detection logic to use the Matches method of DetectorMatch to extract the relevant portions of the chunk data before passing them to the FromData method of the detector.
Introduced MaxSecretSizeProvider interface that detectors can optionally implement to provide a custom maximum size for the secrets they detect. The interface includes a single method ProvideMaxSecretSize() int64 that returns the maximum size of the secret the detector expects to find.
As part of the FindDetectorMatches function, it checks if a detector implements the MaxSecretSizeProvider interface. If implemented, the ProvideMaxSecretSize method is called to obtain the detector-specific maximum secret size, which is used to determine the end position of the match span. If the interface is not implemented, the default maxMatchLength constant is used.
Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞

The optimization is based on the assumption that most secrets shouldn’t exceed a certain length from the keyword’s position. By default, the maxMatchLength constant is set to 300 characters. However, detectors that require a larger or smaller max size can implement the MaxSecretSizeProvider interface and provide their own value through the ProvideMaxSecretSize method.

These changes significantly reduce the amount of data the regex within FromData has to process, leading to improved detector performance while still ensuring accurate secret detection. The introduction of the MaxSecretSizeProvider interface allows for flexibility in handling different secret sizes based on the specific requirements of each detector.

Sequence Diagram

Benchmarks

Benchmark assessing the performance of FromData with verification disabled across various chunk sizes.

orange: old chunk size (10kB)
green: new chunk size (512B overestimate for most detectors)

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

rgmz · 2024-05-09T01:49:12Z

The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data.

...
8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞

While I think this is a good idea, it would be a mistake to set a default max length. I can think of several detectors that can easily exceed this — JWT, private key, GCP, and Docker (#2677) to name a few. Because there are hundreds of detectors, the safer approach would be to make this opt-in.

This would also interfere with detectors that require multiple parts (e.g., client ID & secret, username & password, secret & URL).

Edit: Incidentally, this would likely solve #2739.

rgmz · 2024-05-09T01:53:56Z

pkg/detectors/alchemy/alchemy_test.go

@@ -36,7 +36,7 @@ func TestAlchemy_Pattern(t *testing.T) {
 	for _, test := range tests {
 		t.Run(test.name, func(t *testing.T) {
 			chunkSpecificDetectors := make(map[ahocorasick.DetectorKey]detectors.Detector, 2)
-			ahoCorasickCore.PopulateMatchingDetectors(test.input, chunkSpecificDetectors)
+			ahoCorasickCore.FindDetectorMatches(test.input, chunkSpecificDetectors)


I don't think this would compile anymore since the signature is now:

FindDetectorMatches(chunkData string) []DetectorMatch

It would need to be changed to:

-chunkSpecificDetectors := make(map[ahocorasick.DetectorKey]detectors.Detector, 2) -ahoCorasickCore.FindDetectorMatches(test.input, chunkSpecificDetectors +chunkSpecificDetectors := ahoCorasickCore.FindDetectorMatches(test.input)

Incidentally, I think the TestX_Pattern should be put into a common test module instead of copied & pasted between tests.

rgmz · 2024-05-09T01:54:29Z

pkg/detectors/detectors.go

+// MaxSecretSizeProvider is an optional interface that a detector can implement to
+// provide a custom max size for the secret it finds.
+type MaxSecretSizeProvider interface {
+	ProvideMaxSecretSize() int64


How about just MaxSecretSize?

rgmz · 2024-05-09T02:02:41Z

pkg/engine/engine.go

+	matchedBytes := data.detector.Matches(data.chunk.Data)
+	for _, match := range matchedBytes {


Does mergeMatches aide in de-duplication at all? Some of my recent changes have been around de-duplicating results in chunks to prevent making the same network calls 2/10/20 times. Admittedly, a better solution would be #2262 rather than caching matches in a given chunk.

dustin-decker · 2024-05-16T20:43:56Z

pkg/engine/engine_test.go

+	for _, chunkSize := range chunkSizes {
+		b.Run(fmt.Sprintf("ChunkSize_%d", chunkSize), func(b *testing.B) {
+			b.ReportAllocs()
+			b.SetBytes(int64(dataSize))


Update this to chunkSize to update the throughput on the benchmarks

I made a mistake; the screenshot above was from a detector benchmark in aws_test.go. This benchmark should compare the performance of FindDetectorMatch across different chunk sizes.

Here is the correct benchmark:

dustin-decker · 2024-05-16T21:26:44Z

The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data.
...
8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞

While I think this is a good idea, it would be a mistake to set a default max length. I can think of several detectors that can easily exceed this — JWT, private key, GCP, and Docker (#2677) to name a few. Because there are hundreds of detectors, the safer approach would be to make this opt-in.

This would also interfere with detectors that require multiple parts (e.g., client ID & secret, username & password, secret & URL).

Edit: Incidentally, this would likely solve #2739.

I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.

rgmz · 2024-05-16T22:15:54Z

I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.

1024 bytes would be safer, but could definitely still miss valid secrets.

I think there'd be tremendous value in at least having a (hidden?) flag that runs both and checks for missed results (kind of like https://github.com/github/scientist), rather than binary on/off. Otherwise it can be tricky to identify affected detectors, which has been an issue with the verification overlap change.

dustin-decker · 2024-06-05T11:03:58Z

I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.

1024 bytes would be safer, but could definitely still miss valid secrets.

I think there'd be tremendous value in at least having a (hidden?) flag that runs both and checks for missed results (kind of like https://github.com/github/scientist), rather than binary on/off. Otherwise it can be tricky to identify affected detectors, which has been an issue with the verification overlap change.

@rgmz, this idea is implemented in #2918
Great suggestion!

dustin-decker

Great improvement!

* Detectors beginning w/ a * Detectors beginning w/ b * Detectors beginning w/ c * Detectors beginning w/ d * Detectors beginning w/ e * Detectors beginning w/ f * Detectors beginning w/ f&g * fix * Detectors beginning w/ i-l * Detectors beginning w/ m-p * Detectors beginning w/ r-s * Detectors beginning w/ t * Detectors beginning w/ u-z * revert alconst * remaining fixes * lint

* Detector comparison mode * remove else * return error if results dont match * update default hidden flag to not scan entire chunks

… merging and extracting

…ex (trufflesecurity#2812) * optimize maching detetors * update method name * updates * update naming * updates * update comment * updates * remove testcase * update default match len to 512 * update * update test * add support for multpart cred provider * add ability to scan entire chunk * encapsulate matches logic within FindDetectorMatches * use []byte directly * nil chunk data * use []byte * set hidden flag to true * remove * [refactor] - multi part detectors (trufflesecurity#2906) * Detectors beginning w/ a * Detectors beginning w/ b * Detectors beginning w/ c * Detectors beginning w/ d * Detectors beginning w/ e * Detectors beginning w/ f * Detectors beginning w/ f&g * fix * Detectors beginning w/ i-l * Detectors beginning w/ m-p * Detectors beginning w/ r-s * Detectors beginning w/ t * Detectors beginning w/ u-z * revert alconst * remaining fixes * lint * [feat] - Add Support for `compareDetectionStrategies` Mode (trufflesecurity#2918) * Detector comparison mode * remove else * return error if results dont match * update default hidden flag to not scan entire chunks * fix tests * enhance encapsulation by including methods on DetectorMatch to handle merging and extracting * remove space * fix * update detector * updates * remove else * run comparison concurrently

ahrav added 9 commits May 6, 2024 12:28

optimize maching detetors

f5e49aa

update method name

6cf9925

updates

27b9f93

update naming

8a41c7d

updates

815e5b4

update comment

35438b6

updates

be16624

merge main

bfd42ff

remove testcase

685fbe6

rgmz reviewed May 9, 2024

View reviewed changes

update default match len to 512

73ce354

rgmz mentioned this pull request May 15, 2024

Update SendGrid detector #2833

Merged

2 tasks

dustin-decker reviewed May 16, 2024

View reviewed changes

ahrav added 2 commits May 16, 2024 14:41

merge main

d10bc71

update

0d9e204

ahrav added 8 commits May 16, 2024 15:48

update test

275c11f

add support for multpart cred provider

6e30c0a

add ability to scan entire chunk

726c118

encapsulate matches logic within FindDetectorMatches

52eccae

use []byte directly

f80309e

nil chunk data

10179b0

use []byte

d25da38

set hidden flag to true

f4b7887

ahrav requested a review from dustin-decker June 3, 2024 14:28

ahrav marked this pull request as ready for review June 3, 2024 14:28

ahrav requested review from a team as code owners June 3, 2024 14:28

remove

33db9f7

dustin-decker approved these changes Jun 5, 2024

View reviewed changes

ahrav and others added 11 commits June 5, 2024 07:46

[feat] - Add Support for compareDetectionStrategies Mode (#2918)

a9285d5

* Detector comparison mode * remove else * return error if results dont match * update default hidden flag to not scan entire chunks

fix tests

5847a9a

enhance encapsulation by including methods on DetectorMatch to handle…

e397cf9

… merging and extracting

remove space

aa578a8

Merge branch 'main' into refactor-optimize-matching-detectors

2208f85

fix

8f11191

update detector

10316da

updates

8294230

remove else

5f43608

run comparison concurrently

86a1652

ahrav merged commit ce1ce29 into main Jun 5, 2024
12 checks passed

ahrav deleted the refactor-optimize-matching-detectors branch June 5, 2024 20:28

ahrav mentioned this pull request Jun 8, 2024

re2 error: re2/re2.cc:772: DFA out of memory: pattern length 102, program size 928, list count 352, bytemap range 35 #2739

Closed

This was referenced Jun 9, 2024

Match optimization bug if keyword isn't a prefix #2949

Open

Fix test compilation errors #2964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] - Optimize detector performance by reducing data passed to regex #2812

[feat] - Optimize detector performance by reducing data passed to regex #2812

ahrav commented May 9, 2024 •

edited

Loading

rgmz commented May 9, 2024 •

edited

Loading

rgmz May 9, 2024 •

edited

Loading

rgmz May 9, 2024

rgmz May 9, 2024

dustin-decker May 16, 2024

ahrav May 16, 2024

dustin-decker commented May 16, 2024

rgmz commented May 16, 2024 •

edited

Loading

dustin-decker commented Jun 5, 2024

dustin-decker left a comment

		matchedBytes := data.detector.Matches(data.chunk.Data)
		for _, match := range matchedBytes {

[feat] - Optimize detector performance by reducing data passed to regex #2812

[feat] - Optimize detector performance by reducing data passed to regex #2812

Conversation

ahrav commented May 9, 2024 • edited Loading

Description:

Sequence Diagram

Benchmarks

Checklist:

rgmz commented May 9, 2024 • edited Loading

rgmz May 9, 2024 • edited Loading

Choose a reason for hiding this comment

rgmz May 9, 2024

Choose a reason for hiding this comment

rgmz May 9, 2024

Choose a reason for hiding this comment

dustin-decker May 16, 2024

Choose a reason for hiding this comment

ahrav May 16, 2024

Choose a reason for hiding this comment

dustin-decker commented May 16, 2024

rgmz commented May 16, 2024 • edited Loading

dustin-decker commented Jun 5, 2024

dustin-decker left a comment

Choose a reason for hiding this comment

ahrav commented May 9, 2024 •

edited

Loading

rgmz commented May 9, 2024 •

edited

Loading

rgmz May 9, 2024 •

edited

Loading

rgmz commented May 16, 2024 •

edited

Loading