Investigate queue backlog #3978

jonathanmetzman · 2024-05-07T21:11:56Z

There is a 100k unacked message backlog.
Interestingly, many of these tasks appear taken but not finished.

jonathanmetzman · 2024-05-07T21:12:39Z

Two that have been doing their tasks for weeks are stuck here: Testing for crash

jonathanmetzman · 2024-05-07T21:23:30Z

I've looked at two instances suffering from this issue. Their stacktraces looked roughly similar:

(gdb) bt                                                                                                                                                                                                                                                                                  
#0  0x00005cbe7d5d4b34 in sre_ucs1_match (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7c014, toplevel=toplevel@entry=0) at ./Modules/sre_lib.h:590                                                                                                                   
#1  0x00005cbe7d5db30d in sre_ucs1_search (pattern=<optimized out>, state=0x7ffee26f5bd0) at ./Modules/sre_lib.h:1443                                                                                                                                                                     
#2  sre_search (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7bfe8) at ./Modules/_sre.c:578                                                                                                                                                                           
#3  0x00005cbe7d5dd414 in pattern_subx (self=self@entry=0x5cbe7fc7bf90, ptemplate=<optimized out>, string=0x5cbe840925b0, count=0, subn=subn@entry=0) at ./Modules/_sre.c:1060                                                                                                            
#4  0x00005cbe7d5ddbf5 in _sre_SRE_Pattern_sub_impl (count=<optimized out>, string=<optimized out>, repl=<optimized out>, self=0x5cbe7fc7bf90) at ./Modules/_sre.c:1181                                                                                                                   
#5  _sre_SRE_Pattern_sub (self=0x5cbe7fc7bf90, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at ./Modules/clinic/_sre.c.h:416                                                                                                                                     
#6  0x00005cbe7d4b9ea7 in _PyMethodDef_RawFastCallKeywords (method=0x5cbe7d741320 <pattern_methods+96>, self=self@entry=0x5cbe7fc7bf90, args=args@entry=0x7c43d107cf10, nargs=nargs@entry=3, kwnames=kwnames@entry=0x0) at Objects/call.c:660                                             
#7  0x00005cbe7d6346be in _PyMethodDescr_FastCallKeywords (descrobj=descrobj@entry=0x7c43d311a550, args=0x7c43d107cf08, nargs=nargs@entry=4, kwnames=kwnames@entry=0x0) at Objects/descrobject.c:288                                                                                      
#8  0x00005cbe7d4a01b2 in call_function (pp_stack=pp_stack@entry=0x7ffee26f5e90, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4593                                                                                                                                 
#9  0x00005cbe7d4a119c in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3110                                                                                                                                                                  
#10 0x00005cbe7d568560 in PyEval_EvalFrameEx (throwflag=0, f=0x7c43d107cd70) at Python/ceval.c:547

This seems like Python is stuck in some infinite regex

jonathanmetzman · 2024-05-07T21:59:09Z

Got a stacktrace from a third:

Traceback (most recent call last):
  File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 249, in <module>
    main()
  File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 212, in main
    error_stacktrace, clean_exit, task_payload = task_loop()
  File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 146, in task_loop
    commands.process_command(task)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 249, in process_command
    task.high_end, task.is_command_override)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 159, in wrapper
    return func(task_name, task_argument, job_name, *args, **kwargs)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 431, in process_command_impl
    preprocess)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 218, in run_command
    result = task.execute(task_argument, job_name, uworker_env)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 127, in execute
    self.execute_locally(task_argument, job_type, uworker_env)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 63, in execute_locally
    uworker_output = utasks.uworker_main_no_io(self.module, uworker_input)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py", line 194, in uworker_main_no_io
    uworker_output = utask_module.utask_main(uworker_input)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 368, in utask_main
    fuzz_target, testcase, testcase_file_path, test_timeout)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 197, in test_for_crash_with_retries
    compare_crash=False)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 801, in test_for_crash_with_retries
    testcase.flaky_stack)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 688, in reproduce_with_retries
    state = self._get_crash_state(round_number, crash_result)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 664, in _get_crash_state
    state = crash_result.get_symbolized_data()
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/crash_result.py", line 48, in get_symbolized_data
    self.output, symbolize_flag=True)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/stack_parsing/stack_analyzer.py", line 113, in get_crash_data
    result = stack_parser.parse(crash_stacktrace_without_inlines)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 472, in parse
    self.match_assert(line, state, ASSERT_REGEX_GLIBC_SUFFIXED)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 322, in match_assert
    regex, line, state, new_type='ASSERT', new_frame_count=1)
  File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 184, in update_state_on_match
    match = compiled_regex.match(line)
KeyboardInterrupt
root@clusterfuzz-

jonathanmetzman · 2024-05-07T22:21:01Z

I think a few hundred bots are blocked by this.

jonathanmetzman · 2024-05-08T02:12:48Z

Another thing we should have is better task killing. A bot should kill a process that runs a task for too long.
@oliverchang Do you know if we have this feature already? I think we do right?

jonathanmetzman · 2024-05-08T02:14:52Z

Another lesson we can learn is to have alerts when queue backlogs get too long. This is both a symptom of problems and an issue in of itself.

Enormous stacktraces containing a giant array on a single line is causing bots to freeze. Although fuzztest really should not be printing an input this large, let's try to be resilient when it misbehaves. Fixes: #3978

jonathanmetzman · 2024-05-12T01:21:45Z

This has been mitigated in ClusterFuzz and fuzztest no longer does this.

jonathanmetzman · 2024-05-12T01:21:59Z

But we should still deal with backlogs better.

This was referenced May 8, 2024

Rate limit task retries. #3979

Open

Prevent accidental DoS on malformed stacktraces #3980

Merged

jonathanmetzman closed this as completed in #3980 May 8, 2024

jonathanmetzman reopened this May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate queue backlog #3978

Investigate queue backlog #3978

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 8, 2024

jonathanmetzman commented May 8, 2024

jonathanmetzman commented May 12, 2024

jonathanmetzman commented May 12, 2024

Investigate queue backlog #3978

Investigate queue backlog #3978

Comments

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 7, 2024

jonathanmetzman commented May 8, 2024

jonathanmetzman commented May 8, 2024

jonathanmetzman commented May 12, 2024

jonathanmetzman commented May 12, 2024