Thread Pool Deadlock While Task Wait #1040

COM8 · 2024-04-07T17:43:42Z

Description

There are potential cases where a cpr::ThreadPool can get stuck in a potential deadlock. They are both caused by a race condition involving task_cond.

It can happen that a thread gets stuck waiting for task_cond to get notified but during this there is no one who can notify him like if the caller invokes cpr::ThreadPool::Wait().

Example/How to Reproduce

Run the currently disabled ThreadPoolTests from within #1035 over and over again. From time to time they will get stuck.

Possible Fix

No response

Where did you get it from?

GitHub (branch e.g. master)

Additional Context/Your Environment

Fedora 38, #1035

The text was updated successfully, but these errors were encountered:

baderouaich · 2024-04-07T18:27:28Z

The ThreadPool being stuck I encountered, It was never stopping, it seems like the issue has to do with the scope of the status_locks, if you notice the Stop() method:

int ThreadPool::Stop() {
    std::unique_lock status_lock(status_wait_mutex); // <- this will lock the entire scope of Stop(), after notifying, 
// the mutex is still held during the notification which will not give the threads a chance to acquire it and get the notification (in the thread loop)
    
    if (STOP == status) {
      return -1;
    }
    status = STOP;
    status_wait_cond.notify_all();
    task_cond.notify_all();
    
    for (auto& i : threads) {
        if (i.thread->joinable()) {
            i.thread->join();
        }
    }

    threads.clear();
    cur_thread_num = 0;
    idle_thread_num = 0;
    
    return 0;
}

I think it should be:

int ThreadPool::Stop() {
     { 
         // lock only this scope to ensure that the mutex is not held during the notification process, allowing the waiting threads to acquire it immediately after being awakened
          std::unique_lock status_lock(status_wait_mutex);
          if (STOP == status) {
            return -1;
          }
          status = STOP;
     }
    status_wait_cond.notify_all();
    task_cond.notify_all();

 ...
}

I tried that locally and it seems the thread pool is waiting and stopping normally so far, I will test with more use cases and get back to you if there is another issue.

COM8 added Bug 🐛 Needs Investigation 🔍 labels Apr 7, 2024

COM8 added this to the CPR 1.11.0 milestone Apr 7, 2024

COM8 mentioned this issue Apr 7, 2024

cpr::ThreadPool high CPU usage when Paused #1035

Closed

KeeJef mentioned this issue Jul 31, 2024

Session User Engagement Report oxen-io/oxen-improvement-proposals#60

Open

COM8 modified the milestones: CPR 1.11.0, CPR 1.12.0 Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread Pool Deadlock While Task Wait #1040

Thread Pool Deadlock While Task Wait #1040

COM8 commented Apr 7, 2024

baderouaich commented Apr 7, 2024

Thread Pool Deadlock While Task Wait #1040

Thread Pool Deadlock While Task Wait #1040

Comments

COM8 commented Apr 7, 2024

Description

Example/How to Reproduce

Possible Fix

Where did you get it from?

Additional Context/Your Environment

baderouaich commented Apr 7, 2024