Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate queue backlog #3978

Open
jonathanmetzman opened this issue May 7, 2024 · 8 comments · Fixed by #3980
Open

Investigate queue backlog #3978

jonathanmetzman opened this issue May 7, 2024 · 8 comments · Fixed by #3980

Comments

@jonathanmetzman
Copy link
Collaborator

There is a 100k unacked message backlog.
Interestingly, many of these tasks appear taken but not finished.

@jonathanmetzman
Copy link
Collaborator Author

Two that have been doing their tasks for weeks are stuck here: Testing for crash

@jonathanmetzman
Copy link
Collaborator Author

I've looked at two instances suffering from this issue. Their stacktraces looked roughly similar:

(gdb) bt                                                                                                                                                                                                                                                                                  
#0  0x00005cbe7d5d4b34 in sre_ucs1_match (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7c014, toplevel=toplevel@entry=0) at ./Modules/sre_lib.h:590                                                                                                                   
#1  0x00005cbe7d5db30d in sre_ucs1_search (pattern=<optimized out>, state=0x7ffee26f5bd0) at ./Modules/sre_lib.h:1443                                                                                                                                                                     
#2  sre_search (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7bfe8) at ./Modules/_sre.c:578                                                                                                                                                                           
#3  0x00005cbe7d5dd414 in pattern_subx (self=self@entry=0x5cbe7fc7bf90, ptemplate=<optimized out>, string=0x5cbe840925b0, count=0, subn=subn@entry=0) at ./Modules/_sre.c:1060                                                                                                            
#4  0x00005cbe7d5ddbf5 in _sre_SRE_Pattern_sub_impl (count=<optimized out>, string=<optimized out>, repl=<optimized out>, self=0x5cbe7fc7bf90) at ./Modules/_sre.c:1181                                                                                                                   
#5  _sre_SRE_Pattern_sub (self=0x5cbe7fc7bf90, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at ./Modules/clinic/_sre.c.h:416                                                                                                                                     
#6  0x00005cbe7d4b9ea7 in _PyMethodDef_RawFastCallKeywords (method=0x5cbe7d741320 <pattern_methods+96>, self=self@entry=0x5cbe7fc7bf90, args=args@entry=0x7c43d107cf10, nargs=nargs@entry=3, kwnames=kwnames@entry=0x0) at Objects/call.c:660                                             
#7  0x00005cbe7d6346be in _PyMethodDescr_FastCallKeywords (descrobj=descrobj@entry=0x7c43d311a550, args=0x7c43d107cf08, nargs=nargs@entry=4, kwnames=kwnames@entry=0x0) at Objects/descrobject.c:288                                                                                      
#8  0x00005cbe7d4a01b2 in call_function (pp_stack=pp_stack@entry=0x7ffee26f5e90, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4593                                                                                                                                 
#9  0x00005cbe7d4a119c in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3110                                                                                                                                                                  
#10 0x00005cbe7d568560 in PyEval_EvalFrameEx (throwflag=0, f=0x7c43d107cd70) at Python/ceval.c:547

This seems like Python is stuck in some infinite regex

@jonathanmetzman
Copy link
Collaborator Author

Got a stacktrace from a third:

Traceback (most recent call last):
  File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 249, in <module>
    main()
  File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 212, in main
    error_stacktrace, clean_exit, task_payload = task_loop()
  File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 146, in task_loop
    commands.process_command(task)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 249, in process_command
    task.high_end, task.is_command_override)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 159, in wrapper
    return func(task_name, task_argument, job_name, *args, **kwargs)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 431, in process_command_impl
    preprocess)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 218, in run_command
    result = task.execute(task_argument, job_name, uworker_env)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 127, in execute
    self.execute_locally(task_argument, job_type, uworker_env)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 63, in execute_locally
    uworker_output = utasks.uworker_main_no_io(self.module, uworker_input)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py", line 194, in uworker_main_no_io
    uworker_output = utask_module.utask_main(uworker_input)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 368, in utask_main
    fuzz_target, testcase, testcase_file_path, test_timeout)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 197, in test_for_crash_with_retries
    compare_crash=False)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 801, in test_for_crash_with_retries
    testcase.flaky_stack)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 688, in reproduce_with_retries
    state = self._get_crash_state(round_number, crash_result)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 664, in _get_crash_state
    state = crash_result.get_symbolized_data()
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/crash_result.py", line 48, in get_symbolized_data
    self.output, symbolize_flag=True)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/stack_parsing/stack_analyzer.py", line 113, in get_crash_data
    result = stack_parser.parse(crash_stacktrace_without_inlines)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 472, in parse
    self.match_assert(line, state, ASSERT_REGEX_GLIBC_SUFFIXED)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 322, in match_assert
    regex, line, state, new_type='ASSERT', new_frame_count=1)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 184, in update_state_on_match
    match = compiled_regex.match(line)
KeyboardInterrupt
root@clusterfuzz-

@jonathanmetzman
Copy link
Collaborator Author

I think a few hundred bots are blocked by this.

@jonathanmetzman
Copy link
Collaborator Author

Another thing we should have is better task killing. A bot should kill a process that runs a task for too long.
@oliverchang Do you know if we have this feature already? I think we do right?

@jonathanmetzman
Copy link
Collaborator Author

Another lesson we can learn is to have alerts when queue backlogs get too long. This is both a symptom of problems and an issue in of itself.

jonathanmetzman added a commit that referenced this issue May 8, 2024
Enormous stacktraces containing a giant array on a single line is
causing bots to freeze.
Although fuzztest really should not be printing an input this large,
let's try to be resilient when it misbehaves.

Fixes: #3978
@jonathanmetzman
Copy link
Collaborator Author

This has been mitigated in ClusterFuzz and fuzztest no longer does this.

@jonathanmetzman
Copy link
Collaborator Author

But we should still deal with backlogs better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant