Created attachment 1238818[details]
script to help reproducing
Description of problem:
At random, some unattended installations were hanging.
The customer has done a lot of work and found a possible (probable) root cause, a way to reproduce and a possible patch. They'd like to know our take on it and a plan for its inclusion.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
[Customer comment:]
Based on collected backtraces from GDB and Anaconda source code I wrote a minimal working example which demonstrates a bug in Python.
Could you confirm me that you can reproduce it in your setup?
reproduction:
pkill -9 anaconda
copy mwe.py from attachment to VM
execute:
while true; do python mwe.py ; done
after few iteration script should be hanged
With little modifications I can reproduce it with Python 2.7.12 from the latest Fedora.
I also wrote and tested workaround for anaconda.
Add after line 180, before 'return subprocess.Popen(argv,' in util.py following lines:
from distutils import spawn
argv[0] = spawn.find_executable(argv[0])
With this patch anaconda doesn't hang in my tests.
Actual results:
Installation does not complete in certain scenarios
Expected results:
Installation always completes
Additional info:
The step the customer has taken before getting to the patch:
I found reproduction scenario and possible root cause of the issue.
IMO root cause of the issue is implementation of subprocess.Popen, it is not safe in multithreaded programs.
Python should not call strerror() after fork(), because another thread can hold lock to __libc_setlocale_lock when fork() is called.
AnaStorageThread uses pyudev module, which calls __wcsmbs_load_conv which locks __libc_setlocale_lock( please look at attached backtrace)
After that context is switched back to the main thread.
Main thread creates AnaTimeInit, this thread calls subprocess.Popen which calls fork().
Child process has copy of __libc_setlocale_lock which is locked, python tries to execve() which fails, after that calls strerror() - it hangs waiting for __libc_setlocale_lock.
__libc_setlocale_lock never will be unlocked because AnaStorageThread and child of AnaTimeInit are separated processes.
Please look at attached ilustration.
Steps to reproduce:
1. Create a VM with only 1vCPU - I didn't test with more than one vCPU
2. Start installation in text mode with enabled sshd
3. pkill -9 anaconda
4. edit /sbin/anaconda
a) comment line 1271
# atexit.register(exitHandler, ksdata.reboot, anaconda.storage, anaconda.payload)
b) comment lines 1314, 1315
# anaconda._intf.setup(ksdata)
# anaconda._intf.run()
c) add line 1316
threadMgr.wait_all()
5. copy and install debuginfo packages
anaconda-debuginfo-21.48.22.93-1.el7.x86_64.rpm
glib2-debuginfo-2.46.2-4.el7.x86_64.rpm
glibc-debuginfo-2.17-157.el7.x86_64.rpm
glibc-debuginfo-common-2.17-157.el7.x86_64.rpm
gobject-introspection-debuginfo-1.42.0-1.el7.x86_64.rpm
pygobject2-debuginfo-2.28.6-11.el7.x86_64.rpm
python-debuginfo-2.7.5-48.el7.x86_64.rpm
6. copy ks.cfg from attachment to /root/ks.cfg
7. copy file anadbg from attachment to /root/anadbg
8. run gdb -P /root/anadbg
after few minutes this command should fail or hang, re-execute it unitl hangs. On my setup 4 of 10 attempts hangs
Backtrace of hanged process is the same as backtrace of previously captured processes.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2017:2293
Created attachment 1238818 [details] script to help reproducing Description of problem: At random, some unattended installations were hanging. The customer has done a lot of work and found a possible (probable) root cause, a way to reproduce and a possible patch. They'd like to know our take on it and a plan for its inclusion. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: [Customer comment:] Based on collected backtraces from GDB and Anaconda source code I wrote a minimal working example which demonstrates a bug in Python. Could you confirm me that you can reproduce it in your setup? reproduction: pkill -9 anaconda copy mwe.py from attachment to VM execute: while true; do python mwe.py ; done after few iteration script should be hanged With little modifications I can reproduce it with Python 2.7.12 from the latest Fedora. I also wrote and tested workaround for anaconda. Add after line 180, before 'return subprocess.Popen(argv,' in util.py following lines: from distutils import spawn argv[0] = spawn.find_executable(argv[0]) With this patch anaconda doesn't hang in my tests. Actual results: Installation does not complete in certain scenarios Expected results: Installation always completes Additional info: The step the customer has taken before getting to the patch: I found reproduction scenario and possible root cause of the issue. IMO root cause of the issue is implementation of subprocess.Popen, it is not safe in multithreaded programs. Python should not call strerror() after fork(), because another thread can hold lock to __libc_setlocale_lock when fork() is called. AnaStorageThread uses pyudev module, which calls __wcsmbs_load_conv which locks __libc_setlocale_lock( please look at attached backtrace) After that context is switched back to the main thread. Main thread creates AnaTimeInit, this thread calls subprocess.Popen which calls fork(). Child process has copy of __libc_setlocale_lock which is locked, python tries to execve() which fails, after that calls strerror() - it hangs waiting for __libc_setlocale_lock. __libc_setlocale_lock never will be unlocked because AnaStorageThread and child of AnaTimeInit are separated processes. Please look at attached ilustration. Steps to reproduce: 1. Create a VM with only 1vCPU - I didn't test with more than one vCPU 2. Start installation in text mode with enabled sshd 3. pkill -9 anaconda 4. edit /sbin/anaconda a) comment line 1271 # atexit.register(exitHandler, ksdata.reboot, anaconda.storage, anaconda.payload) b) comment lines 1314, 1315 # anaconda._intf.setup(ksdata) # anaconda._intf.run() c) add line 1316 threadMgr.wait_all() 5. copy and install debuginfo packages anaconda-debuginfo-21.48.22.93-1.el7.x86_64.rpm glib2-debuginfo-2.46.2-4.el7.x86_64.rpm glibc-debuginfo-2.17-157.el7.x86_64.rpm glibc-debuginfo-common-2.17-157.el7.x86_64.rpm gobject-introspection-debuginfo-1.42.0-1.el7.x86_64.rpm pygobject2-debuginfo-2.28.6-11.el7.x86_64.rpm python-debuginfo-2.7.5-48.el7.x86_64.rpm 6. copy ks.cfg from attachment to /root/ks.cfg 7. copy file anadbg from attachment to /root/anadbg 8. run gdb -P /root/anadbg after few minutes this command should fail or hang, re-execute it unitl hangs. On my setup 4 of 10 attempts hangs Backtrace of hanged process is the same as backtrace of previously captured processes.