Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OS] Process.open_files() unsafe API usage can lead to deadlock #1967

Open
gabriellandau opened this issue Jul 22, 2021 · 0 comments · May be fixed by #2190
Open

[OS] Process.open_files() unsafe API usage can lead to deadlock #1967

gabriellandau opened this issue Jul 22, 2021 · 0 comments · May be fixed by #2190

Comments

@gabriellandau
Copy link

Summary

  • OS: Windows
  • Architecture: All
  • Psutil version: 5.8.0
  • Python version: 3.8.10
  • Type: performance

Description

The design of Process.open_files() has a serious design flaw that can lead to a deadlock. Per the psutil documentation:

Warning on Windows this method is not reliable due to some limitations of the underlying Windows API which may hang when retrieving certain file handles. In order to work around that psutil spawns a thread to determine the file handle name and kills it if it’s not responding after 100ms. That implies that this method on Windows is not guaranteed to enumerate all regular file handles (see issue 597). Tools like ProcessHacker has the same limitation.

Beyond this risk of missing data, this API also carries a deadlock risk.

The thread monitoring/termination logic is implemented here:

// If the thread hangs, kill it and cleanup.
if (dwWait == WAIT_TIMEOUT) {
psutil_debug(
"get handle name thread timed out after %i ms", THREAD_TIMEOUT);
if (TerminateThread(hThread, 0) == 0) {
PyErr_SetFromOSErrnoWithSyscall("TerminateThread");
CloseHandle(hThread);
return 1;
}
CloseHandle(hThread);
return 0;
}

The danger is that if the target thread is terminated while it is holding a lock, it can deadlock the entire process. MSDN warns about this here:

TerminateThread is a dangerous function that should only be used in the most extreme cases. You should call TerminateThread only if you know exactly what the target thread is doing, and you control all of the code that the target thread could possibly be running at the time of the termination. For example, TerminateThread can result in the following problems:

  • If the target thread owns a critical section, the critical section will not be released.
  • If the target thread is allocating memory from the heap, the heap lock will not be released.

In our environment, we were seeing deadlocks in python.exe where something had taken the loader lock then died. Unfortunately the thread was gone by the time we could inspect it with a debugger, leading to cryptic deadlocks, but !locks indicated that the loader lock was owned by a non-existent thread.

0:000> !locks

CritSec ntdll!LdrpLoaderLock+0 at 00000000772580d8
WaiterWoken        No
LockCount          203
RecursionCount     1
OwningThread       4c4
EntryCount         0
ContentionCount    d0
*** Locked

Scanned 687 critical sections

There was no thread 4c4 when we were able to dump that process.

Note that the loader lock is not the only lock that can cause a deadlock. In fact, many times when this deadlock occurs, !locks will not indicate anything abnormal.

Reproduction

To reproduce this, do the following on Windows (any version, really):

  1. Start the process that will hang.
start /low py -c "import os,psutil; print('Wait for PID %u CPU to drop to 0%% in Task Manager' % os.getpid()); [[p.as_dict() for p in psutil.process_iter()] for x in range(99999)]"
  1. Open Task Manager and find that process. It should be using most of one CPU core. For example, it uses approximately 8% on my 6C/12T machine.
  2. Now we need to slow down that process to increase the chances of the race condition occurring. Run the following to max out all your CPU cores:
py -c "import os,multiprocessing; [os.system('start python -c \"while True: pass\"') for i in range(multiprocessing.cpu_count())]"
  1. Wait a few seconds. The process from step 1 should drop to 0% CPU. You can now kill all the processes started in step 3.
  2. If the process from step 1 resumes using CPU, repeat steps 3 & 4 a few times. The reproduction is probabilistic but is very repeatable on my machine.

Using PyExt for symbols, we can see the call stack hung in open_files:

0:000> !pyext.pysymfix
Current symbol path: SRV*C:\Symbols*https://msdl.microsoft.com/download/symbols;srv*http://pythonsymbols.sdcline.com/symbols
Loading symbols...

0:000> .shell -ci "~* kP" findstr "qualname"
			struct _object * qualname = 0x000001d6`63ee9670Type: str 'Process.open_files')+0x228 [C:\A\34\s\Python\ceval.c @ 4298] 
			struct _object * qualname = 0x000001d6`63c92270Type: str 'Process.as_dict')+0x228 [C:\A\34\s\Python\ceval.c @ 4298] 
			struct _object * qualname = 0x000001d6`63a980d0Type: str '<listcomp>.<listcomp>')+0x228 [C:\A\34\s\Python\ceval.c @ 4298] 
			struct _object * qualname = 0x00000000`00000000)+0x228 [C:\A\34\s\Python\ceval.c @ 4298] 
.shell: Process exited
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant