Created attachment 1238818 [details]
script to help reproducing
Description of problem:
At random, some unattended installations were hanging.
The customer has done a lot of work and found a possible (probable) root cause, a way to reproduce and a possible patch. They'd like to know our take on it and a plan for its inclusion.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Based on collected backtraces from GDB and Anaconda source code I wrote a minimal working example which demonstrates a bug in Python.
Could you confirm me that you can reproduce it in your setup?
pkill -9 anaconda
copy mwe.py from attachment to VM
while true; do python mwe.py ; done
after few iteration script should be hanged
With little modifications I can reproduce it with Python 2.7.12 from the latest Fedora.
I also wrote and tested workaround for anaconda.
Add after line 180, before 'return subprocess.Popen(argv,' in util.py following lines:
from distutils import spawn
argv = spawn.find_executable(argv)
With this patch anaconda doesn't hang in my tests.
Installation does not complete in certain scenarios
Installation always completes
The step the customer has taken before getting to the patch:
I found reproduction scenario and possible root cause of the issue.
IMO root cause of the issue is implementation of subprocess.Popen, it is not safe in multithreaded programs.
Python should not call strerror() after fork(), because another thread can hold lock to __libc_setlocale_lock when fork() is called.
AnaStorageThread uses pyudev module, which calls __wcsmbs_load_conv which locks __libc_setlocale_lock( please look at attached backtrace)
After that context is switched back to the main thread.
Main thread creates AnaTimeInit, this thread calls subprocess.Popen which calls fork().
Child process has copy of __libc_setlocale_lock which is locked, python tries to execve() which fails, after that calls strerror() - it hangs waiting for __libc_setlocale_lock.
__libc_setlocale_lock never will be unlocked because AnaStorageThread and child of AnaTimeInit are separated processes.
Please look at attached ilustration.
Steps to reproduce:
1. Create a VM with only 1vCPU - I didn't test with more than one vCPU
2. Start installation in text mode with enabled sshd
3. pkill -9 anaconda
4. edit /sbin/anaconda
a) comment line 1271
# atexit.register(exitHandler, ksdata.reboot, anaconda.storage, anaconda.payload)
b) comment lines 1314, 1315
c) add line 1316
5. copy and install debuginfo packages
6. copy ks.cfg from attachment to /root/ks.cfg
7. copy file anadbg from attachment to /root/anadbg
8. run gdb -P /root/anadbg
after few minutes this command should fail or hang, re-execute it unitl hangs. On my setup 4 of 10 attempts hangs
Backtrace of hanged process is the same as backtrace of previously captured processes.
Created attachment 1266040 [details]
backtrace of subprocess
I seem to just have hit a similar issue (strerror called from Popen). Now the program to exec is multipath, located in /usr/sbin/multipath while the env path starts with /usr/bin.
I'll attach also backtrace of anaconda process stuck in read() from the subprocess.
Created attachment 1266042 [details]
backtrace of hanging anaconda process
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.