fix: flaky results: same start and end marker #413

M4tteoP · 2024-12-09T22:32:46Z

I'm investigating some flakiness and found that the same marker might be retrieved for both the starter and the final one. When this happens, the function responsible for retrieving all the marked lines can not locate the starter marker. As a result, it uses all the logs up to the beginning of the file and processes IDs from lines belonging to previous tests, leading to incorrect results.
My idea is that it can happen when go-ftw is too fast and reads again the same line before the engine/connector manages to write/flush the new logs (which include the final marker) into the file.

From a visual perspective (Note: The read is performed backward.):

] Expected scenario:
start marker: 8888-8888
final marker: 9999-9999

...
log id:10
8888-8888
log id:11
9999-9999

We look for the final marker
Se start collecting log lines
We stop when reaching the start marker
log id:10 line is not collected, being out of the markers.

] Assuming the case that both the start marker and final marker have the same value (equal to the start marker):
start marker: 8888-8888
final marker: 8888-8888

...
log id:10
8888-8888
log id:11
9999-9999

We look for the final marker, in that case, the line 8888-8888 matches
From there we start collecting logs
log id:11 is not collected, but rather log id:10 and all the previous ones are collected.

The PR proposes to:

Add a debug log when the start marker is not found and unbounded logs (up to the beginning of the file) are used
Introduce a retry mechanism that, when looking for the final marker, repeats the seeking of the log file if the final marker found is equal to the start maker.
Do not introduce changes so far if the final marker is found equal to the start marker after some attempts, but print a verbose Warn log line.

It would allow us to have more visibility on the issue and bring in some more resiliency. Anyways I think that it is not expected to have the same start and final marker line, therefore I think adding a check would be beneficial.

I will follow with a test/reproducer whenever I manage to reproduce it in an easy and consistent way

theseion · 2024-12-10T06:48:05Z

What you want already exists:

readLimit limits the amount of backwards scanning to avoid reading the entire log file if it is large
if the start marker can't be found, another request is sent to try and flush the log and then the search is repeated, up to maxMarkerRetries

CheckLogForMarker returns the found line in each case and this line is then used as start / end marker. Even if they are identical, the search logic in getMarkedLines should ensure that finding the lines between the markers works.

I see one possible issue: when the log is not flushed immediately, markAndFlush will produce multiple marker lines in the log. If the start marker line is identical to these (e.g., there's no differing timestamp) then the these lines would match the start marker line and there wouldn't exist any other log lines in between. Example:

<marker ID>
log 1
log 2
<marker ID>
<marker ID>

We can solve this problem by using a suffix to the UUID (I'd like to keep a single UUID to make it easier to identify the matching markers when reading the log manually). E.g., the start marker could be <UUID>-s and the end marker <UUID>-e.

M4tteoP · 2024-12-10T10:28:57Z

Thanks Max!

readLimit limits the amount of backwards scanning to avoid reading the entire log file if it is large

In this scenario readLimit would just reduce the amount of unbounded log lines read, but, in the first place, we are reading the wrong log lines. So, with readLimit we would just read fewer wrong lines.

if the start marker can't be found, another request is sent to try and flush the log and then the search is repeated, up to maxMarkerRetries.

This logic does no help in this scenario: the start marker is correctly found, so no retries are performed.

CheckLogForMarker returns the found line in each case and this line is then used as start / end marker. Even if they are identical, the search logic in getMarkedLines should ensure that finding the lines between the markers works.

If I'm getting the code right, getMarkedLines search logic is not working properly with two identical markers. So we should:

Never give to getMarkedLines two identical markers.
Or adapt getMarkedLines to not really look for the identical markers, but just rely on the UUID (but then there is no point on passing to that function the two marker lines because it will have to find them on its own).

Going step by step:

markAndFlush is called, a request is performed to log the starter marker. Now logs will be like this:

...
log 0   <-- old log belonging to the previous request
[msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]

markAndFlush internally calls CheckLogForMarker which looks, starting from the end, for a line with 7d16c799-954f-4338-81c9-fb39b32a9521. The line is found and identified as the start marker line. No retries are performed.
RunStage stores the start marker via SetStartMarker. The start marker now is [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"].
RunStage performs the real request that might or might not trigger some rules and the related logs will be printed.
markAndFlush is called again to print the end marker. The request is performed correctly, but the Engine is not fast enough on writing rules logs and the end marker.
markAndFlush internally calls CheckLogForMarker when logs are looking like the following:

...
log 0 
[msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
log 1
log 2

We can notably see that we don't have yet a trailing marker with 7d16c799-954f-4338-81c9-fb39b32a9521 but a different unique_id.

CheckLogForMarker looks again for a line with 7d16c799-954f-4338-81c9-fb39b32a9521 and within readLimit lines, it finds [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
RunStage stores the final marker via SetEndMarker using the same line of the start marker.
Start marker: [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
End marker: [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
At this point getMarkedLines is called, providing start and end marker.
(not so relevant, this point might not happen and the outcome would be the same) In the meanwhile the Engine might managed to write the whole logs. So they might now look like the following:

...
log 0 
[msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
log 1
log 2
log 3
[msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "gfrFHrMGygNvrCkkWRY"]

getMarkedLines starts from the bottom and looks for the end marker. The real end marker (gfrFHrMGygNvrCkkWRY) and all the logs related to this request are skipped. It finally finds the end marker in what actually is the start marker.

...
✅ log 0 
⬆ [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
❌ log 1
❌ log 2
❌ log 3
❌ [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "gfrFHrMGygNvrCkkWRY"]

From WWAVDNpnKkXiHleWcdD it starts reading old log lines (such as log 0) until:

it finds the start marker: but it will never happend, because the start marker has been misused as end marker
or it reaches readLimit reading readLimit wrong log lines
or it reaches the beginning of the document.

All these log lines are then used to extract IDs and computing wrong outcomes. The correct logs to be used (log 1, log 2, log 3) are not taken into account

theseion · 2024-12-11T05:47:10Z

Thanks for the analysis. You're right, in that case both markers would be the same line. Both cases, the one you described and the one I came up with, would be fixed by using identifying suffixes for the start and end markers, since CheckLogForMarker could no longer confuse the two.

M4tteoP · 2024-12-11T15:33:02Z

Thank you! It sounds like a very reasonable idea to have the same uuid just followed by a suffix.
I have:

Implemented the suffix idea
Removed the SeekAttempts
Kept some more debug verbosity when markers are not found
Added some logic to not consider as marked lines all the lines found before finding the end marker (backward read). The tests were adding one or more empty lines, in real-case scenarios I guess there could be logs that we should just ignore

theseion

We don't have a test yet, that specifically reproduces the issue, right? Could you add one?

theseion · 2024-12-12T07:37:25Z

runner/run.go

@@ -166,7 +173,8 @@ func RunStage(runContext *TestRunContext, ftwCheck *check.FTWCheck, testCase sch
 	}

 	if notRunningInCloudMode(ftwCheck) {
-		startMarker, err := markAndFlush(runContext, &testInput, stageId)
+		startUUID := stageId + startUUIDSuffix


Suggested change

startUUID := stageId + startUUIDSuffix

startId := stageId + startUUIDSuffix

theseion · 2024-12-12T07:37:44Z

runner/run.go

+	startUUIDSuffix = "-s"
+	endUUIDSuffix   = "-e"


Suggested change

startUUIDSuffix = "-s"

endUUIDSuffix = "-e"

startUuidSuffix = "-s"

endUuidSuffix = "-e"

theseion · 2024-12-12T07:38:02Z

runner/run.go

@@ -190,7 +198,8 @@ func RunStage(runContext *TestRunContext, ftwCheck *check.FTWCheck, testCase sch
 	}

 	if notRunningInCloudMode(ftwCheck) {
-		endMarker, err := markAndFlush(runContext, &testInput, stageId)
+		endUUID := stageId + endUUIDSuffix


Suggested change

endUUID := stageId + endUUIDSuffix

endId := stageId + endUUIDSuffix

theseion · 2024-12-12T07:38:49Z

waflog/read.go

-			endFound = true
+
+		if !endFound {
+			// Skip lines until we find the end marker. Reading backwards, the line we are looking for are


Suggested change

// Skip lines until we find the end marker. Reading backwards, the line we are looking for are

// Skip lines until we find the end marker. Reading backwards, the lines we are looking for are

theseion · 2024-12-12T07:41:36Z

waflog/read.go

 			break
 		}

 		saneCopy := make([]byte, len(line))
 		copy(saneCopy, line)
 		ll.markedLines = append(ll.markedLines, saneCopy)
 	}
+	if !startFound {
+		log.Debug().Msgf("starter marker not found while getting marked lines")


Suggested change

log.Debug().Msgf("starter marker not found while getting marked lines")

log.Debug().Msg("starter marker not found while getting marked lines")

theseion · 2024-12-12T07:42:31Z

waflog/read.go


 	return ll.markedLines
 }

 // CheckLogForMarker reads the log file and searches for a marker line.
 // stageID is the ID of the current stage, which is part of the marker line
+// At this point, stageID includes also the start or end suffix.
 // readLimit is the maximum numbers of lines to check
 func (ll *FTWLogLines) CheckLogForMarker(stageID string, readLimit uint) []byte {


Suggested change

func (ll *FTWLogLines) CheckLogForMarker(stageID string, readLimit uint) []byte {

func (ll *FTWLogLines) CheckLogForMarker(markerId string, readLimit uint) []byte {

theseion · 2024-12-12T07:43:18Z

waflog/read.go

 // stageID is the ID of the current stage, which is part of the marker line
+// At this point, stageID includes also the start or end suffix.


Suggested change

// stageID is the ID of the current stage, which is part of the marker line

// At this point, stageID includes also the start or end suffix.

// markerId is the ID of the current stage + suffix (for start / end), which is part of the marker line

theseion · 2024-12-12T07:43:56Z

waflog/read_test.go

@@ -201,12 +201,15 @@ func (s *readTestSuite) TestFTWLogLines_Contains() {
 	s.NotNil(cfg)

 	stageID := "dead-beaf-deadbeef-deadbeef-dead"


Suggested change

stageID := "dead-beaf-deadbeef-deadbeef-dead"

stageId := "dead-beaf-deadbeef-deadbeef-dead"

theseion · 2024-12-12T07:44:54Z

waflog/read_test.go

@@ -368,12 +373,13 @@ func (s *readTestSuite) TestFindAllIdsInLogs() {
 	s.NotNil(cfg)

 	stageID := "dead-beaf-deadbeef-deadbeef-dead"


Suggested change

stageID := "dead-beaf-deadbeef-deadbeef-dead"

stageId := "dead-beaf-deadbeef-deadbeef-dead"

M4tteoP changed the title ~~Flaky start end marker~~ Flaky results: same start and end marker Dec 9, 2024

M4tteoP changed the title ~~Flaky results: same start and end marker~~ fix: flaky results: same start and end marker Dec 9, 2024

M4tteoP marked this pull request as ready for review December 9, 2024 22:39

M4tteoP added 4 commits December 11, 2024 11:08

reduce flakiness

b67a054

space

3e31033

space

2cc5a08

true

0d3cd46

M4tteoP force-pushed the flaky_start_end_marker branch from 1adaf21 to 22f5749 Compare December 11, 2024 15:15

start and end marker

5b3386a

M4tteoP force-pushed the flaky_start_end_marker branch from 0f798b0 to 5b3386a Compare December 11, 2024 15:32

M4tteoP added 2 commits December 11, 2024 16:35

fix

8a8ec48

less diffs

8c93235

theseion requested changes Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: flaky results: same start and end marker #413

fix: flaky results: same start and end marker #413

M4tteoP commented Dec 9, 2024 •

edited

Loading

theseion commented Dec 10, 2024

M4tteoP commented Dec 10, 2024 •

edited

Loading

theseion commented Dec 11, 2024

M4tteoP commented Dec 11, 2024

theseion left a comment

theseion Dec 12, 2024

theseion Dec 12, 2024

theseion Dec 12, 2024

theseion Dec 12, 2024

theseion Dec 12, 2024

theseion Dec 12, 2024

theseion Dec 12, 2024

theseion Dec 12, 2024

theseion Dec 12, 2024

	startUUID := stageId + startUUIDSuffix
	startId := stageId + startUUIDSuffix

	endUUID := stageId + endUUIDSuffix
	endId := stageId + endUUIDSuffix

	// Skip lines until we find the end marker. Reading backwards, the line we are looking for are
	// Skip lines until we find the end marker. Reading backwards, the lines we are looking for are

	log.Debug().Msgf("starter marker not found while getting marked lines")
	log.Debug().Msg("starter marker not found while getting marked lines")

	func (ll *FTWLogLines) CheckLogForMarker(stageID string, readLimit uint) []byte {
	func (ll *FTWLogLines) CheckLogForMarker(markerId string, readLimit uint) []byte {

		// stageID is the ID of the current stage, which is part of the marker line
		// At this point, stageID includes also the start or end suffix.

	// stageID is the ID of the current stage, which is part of the marker line
	// At this point, stageID includes also the start or end suffix.
	// markerId is the ID of the current stage + suffix (for start / end), which is part of the marker line

		@@ -201,12 +201,15 @@ func (s *readTestSuite) TestFTWLogLines_Contains() {
		s.NotNil(cfg)

		stageID := "dead-beaf-deadbeef-deadbeef-dead"

		@@ -368,12 +373,13 @@ func (s *readTestSuite) TestFindAllIdsInLogs() {
		s.NotNil(cfg)

		stageID := "dead-beaf-deadbeef-deadbeef-dead"

fix: flaky results: same start and end marker #413

Are you sure you want to change the base?

fix: flaky results: same start and end marker #413

Conversation

M4tteoP commented Dec 9, 2024 • edited Loading

theseion commented Dec 10, 2024

M4tteoP commented Dec 10, 2024 • edited Loading

theseion commented Dec 11, 2024

M4tteoP commented Dec 11, 2024

theseion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

M4tteoP commented Dec 9, 2024 •

edited

Loading

M4tteoP commented Dec 10, 2024 •

edited

Loading