Bug 2073510
| Summary: | Installation sometimes hangs before Verifying: stage of dnf transaction | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Jan Stodola <jstodola> | ||||
| Component: | dnf | Assignee: | Jaroslav Mracek <jmracek> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | swm-qe | ||||
| Severity: | unspecified | Docs Contact: | Sagar Dubewar <sdubewar> | ||||
| Priority: | medium | ||||||
| Version: | 9.0 | CC: | bstinson, gfialova, james.antill, jikortus, jwboyer, pasik, rvykydal, torsava | ||||
| Target Milestone: | rc | Keywords: | Triaged | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Known Issue | |||||
| Doc Text: |
.The Installation process sometimes becomes unresponsive
When you install RHEL, the installation process sometimes becomes unresponsive. The `/tmp/packaging.log` file displays the following message at the end:
----
10:20:56,416 DDEBUG dnf: RPM transaction over.
----
To work around this problem, restart the installation process.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-10-25 16:06:15 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1942219, 2027125, 2091421 | ||||||
| Attachments: |
|
||||||
|
Description
Jan Stodola
2022-04-08 16:09:49 UTC
Thank you very much for your report. According to investigation in BZ2054230 it stops on lock in `names = set([i.name for i in q])` (base.py line 1159). When the line is replaced by ` ``` names = set() for pkg in q: name = pkg.name names.add(name) ``` it stops on lock in `for p in g.getPackages()` (base.py line 1162). There is also one unique part of Anaconda. It uses `multiprocessing` to run dnf (anaconda/pyanaconda/modules/payloads/payload/dnf/dnf_manager.py) ``` process = multiprocessing.Process( target=self._run_transaction, args=(self._base, display) ) # Start the transaction. log.debug("Starting the transaction process...") process.start() ``` The problem is that the issue is hard to reproduce and I also believe that the issue cannot be related to any recent change in DNF therefore I would like to ask the Python team to investigate the issue or provide a suggestion that will help us with fixing it. Hi Jan, could you provide a core dump from the affected installation? Perhaps we could identify the issue there. Even better for us would be to get access to a machine that has run into the issue. However, we don't have the means to realistically run 200 RHEL 9 installations waiting for the bug. Would you be able to get us access to such a machine? These machines have no debug symbols for python3.9, dnf is not working, and even if I copy python-gdb.py manually, gdb fails to rebuild the Python traceback. Oh.
I used a different approach: call _Py_DumpTracebackThreads().
Step 1: get the gdb stdout
$ ps ax|grep gdb
513638 pts/2 Sl+ 0:00 gdb -p 513545
$ readlink /proc/513638/fd/1
/dev/pts/2
=> gdb stdout is /dev/pts/2
Step 2: open the Python process in gdb and type
(a) Open the gdb stdout:
(gdb) p (int)open("/dev/pts/2", 1)
$1 = 56
(b) Call _Py_DumpTracebackThreads() with this file descriptor
(gdb) p (const char*)_Py_DumpTracebackThreads(46, 0, 0)
Current thread 0x00007f58da2c1740 (most recent call first):
File "/usr/lib/python3.9/site-packages/gi/overrides/GLib.py", line 497 in run
(...)
$2 = 0x0
(c) Close the file descriptor
(gdb) p (int)close(46)
$3 = 0
I don't know dnf at all so I tried to pick the most interesting python3.9 processes used by anaconda.
I looked at the hierarchy of these 6 processes:
(1) anaconda,2077 (4 threads): multiprocessing: pyanaconda/payload/dnf/payload.py:914: install()
* CHILD: (2) {anaconda},2101 (1 thread): poll() <= main() of gi/overrides/Gtk.py
* CHILD: (3) {anaconda},2102 (1 thread): poll() <= main() of gi/overrides/Gtk.py
* CHILD: (4) {anaconda},2617 (1 thread): multiprocessing _recv() <= install() of payload/dnf/payload.py
* CHILD: (5) anaconda,3065 (2 threads):
* Thread 1: g_mutex_lock_slowpath() <= g_signal_handlers_destroy()
* Thread 2: lock_PyThread_acquire_lock() <= do_transaction() of pyanaconda/payload/dnf/utils.py
* CHILD: (6) {anaconda},3069 (1 thread): multiprocessing queue _feed() -> lock_PyThread_acquire_lock()
So 3 processes are doing multiprocessing things in Python:
* 2077: multiprocessing: pyanaconda/payload/dnf/payload.py:914: install()
* 2617: multiprocessing _recv() <= install() of payload/dnf/payload.py
* 3069: multiprocessing queue _feed() -> lock_PyThread_acquire_lock()
It seems like the logical chain of processes is: 2077 => 2617 => 3069.
So the process 3069 is the end of the chain and it's stuck in queue _feed(): it calls nwait() at multiprocessing/queues.py:233 which is self._notempty.wait().
I don't understand how this queue is supposed to be filled, since the process has a single thread which is waiting until this queue is not empty.
The strange part is the _Py_DumpTracebackThreads() still shows Python threads which no longer exists: gdb lists 1 thread, _Py_DumpTracebackThreads() shows 2 threads. I guess that they are zombie from the forked parent process, and Python fails to clear these zombies threads.
Attached tracebacks.txt contains the traceback of these processes.
Created attachment 1874196 [details]
tracebacks.txt
Oops, copy/paste issue. My steps to debug are not correct. You must copy/paste the file descriptor created by open(). For example, here it's 46, so you must pass 46 to _Py_DumpTracebackThreads() and close().
(a) Open the gdb stdout:
(gdb) p (int)open("/dev/pts/2", 1)
$1 = 46
(b) Call _Py_DumpTracebackThreads() with this file descriptor
(gdb) p (const char*)_Py_DumpTracebackThreads(46, 0, 0)
Current thread 0x00007f58da2c1740 (most recent call first):
File "/usr/lib/python3.9/site-packages/gi/overrides/GLib.py", line 497 in run
(...)
$2 = 0x0
(c) Close the file descriptor
(gdb) p (int)close(46)
$3 = 0
Hi, I had a look at on the "stuck" machine and collected traceback of "stuck" different anaconda processes. I don't know how to make progress on that issue. Is there a way to reproduce the issue? Tell me if you need my help to investigate the issue. I reassign the issue to dnf. I am really sorry but without solid reproducer we cannot do much. Right now it is a candidate to be closed as insufficient data. Hi Sagar, the draft looks fine, thanks. Jaroslav (and others), I've tried to run anaconda on an installed system (RHEL-9.0-RC) and installing to an image file: # anaconda --image /var/tmp/disk.img --kickstart /var/tmp/ks.cfg --text There was about 3000 installations in total executed on ~15 machines, but I wasn't able to reproduce this bug with this method. Some of the installation were executed in text mode, some in vnc mode. Please, do not close this bug, it needs to be debugged and fixed, probably in cooperation with the anaconda developers. I was able to reproduce with RHEL9 Beta. I was not able to reproduce with RHEL9 Alpha but should do more runs. I am really sorry, without a reproducer we cannot do much. Please feel free to reopen the issue with additional information. |