Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: flaky results: same start and end marker #413

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

M4tteoP
Copy link
Member

@M4tteoP M4tteoP commented Dec 9, 2024

I'm investigating some flakiness and found that the same marker might be retrieved for both the starter and the final one. When this happens, the function responsible for retrieving all the marked lines can not locate the starter marker. As a result, it uses all the logs up to the beginning of the file and processes IDs from lines belonging to previous tests, leading to incorrect results.
My idea is that it can happen when go-ftw is too fast and reads again the same line before the engine/connector manages to write/flush the new logs (which include the final marker) into the file.

From a visual perspective (Note: The read is performed backward.):

] Expected scenario:
start marker: 8888-8888
final marker: 9999-9999

...
log id:10
8888-8888
log id:11
9999-9999
  1. We look for the final marker
  2. Se start collecting log lines
  3. We stop when reaching the start marker
  4. log id:10 line is not collected, being out of the markers.

] Assuming the case that both the start marker and final marker have the same value (equal to the start marker):
start marker: 8888-8888
final marker: 8888-8888

...
log id:10
8888-8888
log id:11
9999-9999
  1. We look for the final marker, in that case, the line 8888-8888 matches
  2. From there we start collecting logs
  3. log id:11 is not collected, but rather log id:10 and all the previous ones are collected.

The PR proposes to:

  1. Add a debug log when the start marker is not found and unbounded logs (up to the beginning of the file) are used
  2. Introduce a retry mechanism that, when looking for the final marker, repeats the seeking of the log file if the final marker found is equal to the start maker.
  3. Do not introduce changes so far if the final marker is found equal to the start marker after some attempts, but print a verbose Warn log line.

It would allow us to have more visibility on the issue and bring in some more resiliency. Anyways I think that it is not expected to have the same start and final marker line, therefore I think adding a check would be beneficial.

I will follow with a test/reproducer whenever I manage to reproduce it in an easy and consistent way

@M4tteoP M4tteoP changed the title Flaky start end marker Flaky results: same start and end marker Dec 9, 2024
@M4tteoP M4tteoP changed the title Flaky results: same start and end marker fix: flaky results: same start and end marker Dec 9, 2024
@M4tteoP M4tteoP marked this pull request as ready for review December 9, 2024 22:39
@theseion
Copy link
Collaborator

What you want already exists:

  1. readLimit limits the amount of backwards scanning to avoid reading the entire log file if it is large
  2. if the start marker can't be found, another request is sent to try and flush the log and then the search is repeated, up to maxMarkerRetries

CheckLogForMarker returns the found line in each case and this line is then used as start / end marker. Even if they are identical, the search logic in getMarkedLines should ensure that finding the lines between the markers works.

I see one possible issue: when the log is not flushed immediately, markAndFlush will produce multiple marker lines in the log. If the start marker line is identical to these (e.g., there's no differing timestamp) then the these lines would match the start marker line and there wouldn't exist any other log lines in between. Example:

<marker ID>
log 1
log 2
<marker ID>
<marker ID>

We can solve this problem by using a suffix to the UUID (I'd like to keep a single UUID to make it easier to identify the matching markers when reading the log manually). E.g., the start marker could be <UUID>-s and the end marker <UUID>-e.

@M4tteoP
Copy link
Member Author

M4tteoP commented Dec 10, 2024

Thanks Max!

  1. readLimit limits the amount of backwards scanning to avoid reading the entire log file if it is large

In this scenario readLimit would just reduce the amount of unbounded log lines read, but, in the first place, we are reading the wrong log lines. So, with readLimit we would just read fewer wrong lines.

  1. if the start marker can't be found, another request is sent to try and flush the log and then the search is repeated, up to maxMarkerRetries.

This logic does no help in this scenario: the start marker is correctly found, so no retries are performed.

CheckLogForMarker returns the found line in each case and this line is then used as start / end marker. Even if they are identical, the search logic in getMarkedLines should ensure that finding the lines between the markers works.

If I'm getting the code right, getMarkedLines search logic is not working properly with two identical markers. So we should:

  • Never give to getMarkedLines two identical markers.
  • Or adapt getMarkedLines to not really look for the identical markers, but just rely on the UUID (but then there is no point on passing to that function the two marker lines because it will have to find them on its own).

Going step by step:

  1. markAndFlush is called, a request is performed to log the starter marker. Now logs will be like this:
...
log 0   <-- old log belonging to the previous request
[msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
  1. markAndFlush internally calls CheckLogForMarker which looks, starting from the end, for a line with 7d16c799-954f-4338-81c9-fb39b32a9521. The line is found and identified as the start marker line. No retries are performed.
  2. RunStage stores the start marker via SetStartMarker. The start marker now is [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"].
  3. RunStage performs the real request that might or might not trigger some rules and the related logs will be printed.
  4. markAndFlush is called again to print the end marker. The request is performed correctly, but the Engine is not fast enough on writing rules logs and the end marker.
  5. markAndFlush internally calls CheckLogForMarker when logs are looking like the following:
...
log 0 
[msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
log 1
log 2

We can notably see that we don't have yet a trailing marker with 7d16c799-954f-4338-81c9-fb39b32a9521 but a different unique_id.

  1. CheckLogForMarker looks again for a line with 7d16c799-954f-4338-81c9-fb39b32a9521 and within readLimit lines, it finds [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]

  2. RunStage stores the final marker via SetEndMarker using the same line of the start marker.
    Start marker: [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
    End marker: [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]

  3. At this point getMarkedLines is called, providing start and end marker.

  4. (not so relevant, this point might not happen and the outcome would be the same) In the meanwhile the Engine might managed to write the whole logs. So they might now look like the following:

...
log 0 
[msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
log 1
log 2
log 3
[msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "gfrFHrMGygNvrCkkWRY"]
  1. getMarkedLines starts from the bottom and looks for the end marker. The real end marker (gfrFHrMGygNvrCkkWRY) and all the logs related to this request are skipped. It finally finds the end marker in what actually is the start marker.
...
✅ log 0 
⬆ [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "WWAVDNpnKkXiHleWcdD"]
❌ log 1
❌ log 2
❌ log 3
❌ [msg "X-CRS-Test 7d16c799-954f-4338-81c9-fb39b32a9521"][unique_id "gfrFHrMGygNvrCkkWRY"]
  1. From WWAVDNpnKkXiHleWcdD it starts reading old log lines (such as log 0) until:
  • it finds the start marker: but it will never happend, because the start marker has been misused as end marker
  • or it reaches readLimit reading readLimit wrong log lines
  • or it reaches the beginning of the document.
  1. All these log lines are then used to extract IDs and computing wrong outcomes. The correct logs to be used (log 1, log 2, log 3) are not taken into account

@theseion
Copy link
Collaborator

Thanks for the analysis. You're right, in that case both markers would be the same line. Both cases, the one you described and the one I came up with, would be fixed by using identifying suffixes for the start and end markers, since CheckLogForMarker could no longer confuse the two.

@M4tteoP M4tteoP force-pushed the flaky_start_end_marker branch from 1adaf21 to 22f5749 Compare December 11, 2024 15:15
@M4tteoP M4tteoP force-pushed the flaky_start_end_marker branch from 0f798b0 to 5b3386a Compare December 11, 2024 15:32
@M4tteoP
Copy link
Member Author

M4tteoP commented Dec 11, 2024

Thank you! It sounds like a very reasonable idea to have the same uuid just followed by a suffix.
I have:

  • Implemented the suffix idea
  • Removed the SeekAttempts
  • Kept some more debug verbosity when markers are not found
  • Added some logic to not consider as marked lines all the lines found before finding the end marker (backward read). The tests were adding one or more empty lines, in real-case scenarios I guess there could be logs that we should just ignore

Copy link
Collaborator

@theseion theseion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a test yet, that specifically reproduces the issue, right? Could you add one?

@@ -166,7 +173,8 @@ func RunStage(runContext *TestRunContext, ftwCheck *check.FTWCheck, testCase sch
}

if notRunningInCloudMode(ftwCheck) {
startMarker, err := markAndFlush(runContext, &testInput, stageId)
startUUID := stageId + startUUIDSuffix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
startUUID := stageId + startUUIDSuffix
startId := stageId + startUUIDSuffix

Comment on lines +28 to +29
startUUIDSuffix = "-s"
endUUIDSuffix = "-e"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
startUUIDSuffix = "-s"
endUUIDSuffix = "-e"
startUuidSuffix = "-s"
endUuidSuffix = "-e"

@@ -190,7 +198,8 @@ func RunStage(runContext *TestRunContext, ftwCheck *check.FTWCheck, testCase sch
}

if notRunningInCloudMode(ftwCheck) {
endMarker, err := markAndFlush(runContext, &testInput, stageId)
endUUID := stageId + endUUIDSuffix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
endUUID := stageId + endUUIDSuffix
endId := stageId + endUUIDSuffix

endFound = true

if !endFound {
// Skip lines until we find the end marker. Reading backwards, the line we are looking for are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Skip lines until we find the end marker. Reading backwards, the line we are looking for are
// Skip lines until we find the end marker. Reading backwards, the lines we are looking for are

break
}

saneCopy := make([]byte, len(line))
copy(saneCopy, line)
ll.markedLines = append(ll.markedLines, saneCopy)
}
if !startFound {
log.Debug().Msgf("starter marker not found while getting marked lines")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Debug().Msgf("starter marker not found while getting marked lines")
log.Debug().Msg("starter marker not found while getting marked lines")


return ll.markedLines
}

// CheckLogForMarker reads the log file and searches for a marker line.
// stageID is the ID of the current stage, which is part of the marker line
// At this point, stageID includes also the start or end suffix.
// readLimit is the maximum numbers of lines to check
func (ll *FTWLogLines) CheckLogForMarker(stageID string, readLimit uint) []byte {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (ll *FTWLogLines) CheckLogForMarker(stageID string, readLimit uint) []byte {
func (ll *FTWLogLines) CheckLogForMarker(markerId string, readLimit uint) []byte {

Comment on lines 192 to +193
// stageID is the ID of the current stage, which is part of the marker line
// At this point, stageID includes also the start or end suffix.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// stageID is the ID of the current stage, which is part of the marker line
// At this point, stageID includes also the start or end suffix.
// markerId is the ID of the current stage + suffix (for start / end), which is part of the marker line

@@ -201,12 +201,15 @@ func (s *readTestSuite) TestFTWLogLines_Contains() {
s.NotNil(cfg)

stageID := "dead-beaf-deadbeef-deadbeef-dead"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stageID := "dead-beaf-deadbeef-deadbeef-dead"
stageId := "dead-beaf-deadbeef-deadbeef-dead"

@@ -368,12 +373,13 @@ func (s *readTestSuite) TestFindAllIdsInLogs() {
s.NotNil(cfg)

stageID := "dead-beaf-deadbeef-deadbeef-dead"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stageID := "dead-beaf-deadbeef-deadbeef-dead"
stageId := "dead-beaf-deadbeef-deadbeef-dead"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants