RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2073510 - Installation sometimes hangs before Verifying: stage of dnf transaction
Summary: Installation sometimes hangs before Verifying: stage of dnf transaction
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: dnf
Version: 9.0
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: rc
: ---
Assignee: Jaroslav Mracek
QA Contact: swm-qe
Sagar Dubewar
URL:
Whiteboard:
Depends On:
Blocks: 1942219 2027125 2091421
TreeView+ depends on / blocked
 
Reported: 2022-04-08 16:09 UTC by Jan Stodola
Modified: 2022-10-25 16:06 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.The Installation process sometimes becomes unresponsive When you install RHEL, the installation process sometimes becomes unresponsive. The `/tmp/packaging.log` file displays the following message at the end: ---- 10:20:56,416 DDEBUG dnf: RPM transaction over. ---- To workaround this problem, restart the installation process.
Clone Of:
Environment:
Last Closed: 2022-10-25 16:06:15 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
tracebacks.txt (20.57 KB, text/plain)
2022-04-21 16:42 UTC, Victor Stinner
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-118420 0 None None None 2022-04-08 16:18:08 UTC

Description Jan Stodola 2022-04-08 16:09:49 UTC
Description of problem:
Anaconda sometimes doesn't complete the package installation and verification phase and stops. The last message in packaging.log is:

...
10:20:55,019 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
10:20:56,163 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
10:20:56,212 INF dnf.rpm: Running in chroot, ignoring request.

10:20:56,283 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
10:20:56,416 DDEBUG dnf: RPM transaction over.



What is expected at this point is a package verification, like in this case:
...
14:24:00,364 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
14:24:01,594 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
14:24:01,640 INF dnf.rpm: Running in chroot, ignoring request.

14:24:01,710 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
14:24:01,808 DDEBUG dnf: RPM transaction over.
14:24:02,075 INF packaging: Verifying: NetworkManager-1:1.36.0-4.el9_0.aarch64 1648805902 d45b2f054bc32f77671b29dea0d3aa8307b31c7ef891126ef2d87ae4fad07f4e
14:24:02,080 INF packaging: Verifying: NetworkManager-libnm-1:1.36.0-4.el9_0.aarch64 1648805902 7d552bca978bb4a7a6db22b19eed772751052be92981b0958d198839d8ea6a8e
14:24:02,081 INF packaging: Verifying: NetworkManager-team-1:1.36.0-4.el9_0.aarch64 1648805902 881de45f4299f8856fc97d0c2b05d025dfbdc6e37d000d83a6ab94acf964edc9
...

Version-Release number of selected component (if applicable):
RHEL-9.0.0-20220406.1 or RHEL-9.1.0-20220408.0
dnf-4.10.0-5.el9_0.noarch
anaconda-34.25.0.29-1.el9_0

How reproducible:
Sometimes. Maybe 1 out of 200 installations.
Seen on s390x, aarch64 and x86_64.

Steps to Reproduce:
1. Unclear, run a RHEL-9 installation

Actual results:
Installation hangs during the package installation process.

Expected results:
Successful installation.

Additional info:
This problem looks the same as a Fedora bug 2054230.

Comment 4 Jaroslav Mracek 2022-04-12 09:57:47 UTC
Thank you very much for your report. According to investigation in BZ2054230 it stops on lock in `names = set([i.name for i in q])` (base.py line 1159). When the line is replaced by `
```
names = set()
for pkg in q:
    name = pkg.name
    names.add(name)
```

it stops on lock in `for p in g.getPackages()` (base.py line 1162).

There is also one unique part of Anaconda. It uses `multiprocessing` to run dnf (anaconda/pyanaconda/modules/payloads/payload/dnf/dnf_manager.py)

```
process = multiprocessing.Process(
            target=self._run_transaction,
            args=(self._base, display)
        )

        # Start the transaction.
        log.debug("Starting the transaction process...")
        process.start()
```
The problem is that the issue is hard to reproduce and I also believe that the issue cannot be related to any recent change in DNF therefore I would like to ask the Python team to investigate the issue or provide a suggestion that will help us with fixing it.

Comment 5 Tomas Orsava 2022-04-20 12:27:11 UTC
Hi Jan,
could you provide a core dump from the affected installation? Perhaps we could identify the issue there.

Even better for us would be to get access to a machine that has run into the issue. However, we don't have the means to realistically run 200 RHEL 9 installations waiting for the bug. Would you be able to get us access to such a machine?

Comment 9 Victor Stinner 2022-04-21 16:33:58 UTC
These machines have no debug symbols for python3.9, dnf is not working, and even if I copy python-gdb.py manually, gdb fails to rebuild the Python traceback. Oh.

I used a different approach: call _Py_DumpTracebackThreads().

Step 1: get the gdb stdout

$ ps ax|grep gdb
 513638 pts/2    Sl+    0:00 gdb -p 513545
$ readlink /proc/513638/fd/1
/dev/pts/2

=> gdb stdout is /dev/pts/2

Step 2: open the Python process in gdb and type

(a) Open the gdb stdout:

(gdb) p (int)open("/dev/pts/2", 1)
$1 = 56

(b) Call _Py_DumpTracebackThreads() with this file descriptor

(gdb) p (const char*)_Py_DumpTracebackThreads(46, 0, 0)
Current thread 0x00007f58da2c1740 (most recent call first):
  File "/usr/lib/python3.9/site-packages/gi/overrides/GLib.py", line 497 in run
  (...)
$2 = 0x0

(c) Close the file descriptor

(gdb) p (int)close(46)
$3 = 0

Comment 10 Victor Stinner 2022-04-21 16:41:46 UTC
I don't know dnf at all so I tried to pick the most interesting python3.9 processes used by anaconda.

I looked at the hierarchy of these 6 processes:

(1) anaconda,2077 (4 threads): multiprocessing: pyanaconda/payload/dnf/payload.py:914: install()

* CHILD: (2) {anaconda},2101 (1 thread): poll() <= main() of gi/overrides/Gtk.py
* CHILD: (3) {anaconda},2102 (1 thread): poll() <= main() of gi/overrides/Gtk.py
* CHILD: (4) {anaconda},2617 (1 thread): multiprocessing _recv() <= install() of payload/dnf/payload.py
* CHILD: (5) anaconda,3065 (2 threads):

  * Thread 1: g_mutex_lock_slowpath() <= g_signal_handlers_destroy()
  * Thread 2: lock_PyThread_acquire_lock() <= do_transaction() of pyanaconda/payload/dnf/utils.py
  * CHILD: (6) {anaconda},3069 (1 thread): multiprocessing queue _feed() -> lock_PyThread_acquire_lock()

So 3 processes are doing multiprocessing things in Python:

* 2077: multiprocessing: pyanaconda/payload/dnf/payload.py:914: install()
* 2617: multiprocessing _recv() <= install() of payload/dnf/payload.py
* 3069: multiprocessing queue _feed() -> lock_PyThread_acquire_lock()

It seems like the logical chain of processes is: 2077 => 2617 => 3069.

So the process 3069 is the end of the chain and it's stuck in queue _feed(): it calls nwait() at multiprocessing/queues.py:233 which is self._notempty.wait().

I don't understand how this queue is supposed to be filled, since the process has a single thread which is waiting until this queue is not empty.


The strange part is the _Py_DumpTracebackThreads() still shows Python threads which no longer exists: gdb lists 1 thread, _Py_DumpTracebackThreads() shows 2 threads. I guess that they are zombie from the forked parent process, and Python fails to clear these zombies threads.


Attached tracebacks.txt contains the traceback of these processes.

Comment 11 Victor Stinner 2022-04-21 16:42:36 UTC
Created attachment 1874196 [details]
tracebacks.txt

Comment 12 Victor Stinner 2022-04-21 16:44:13 UTC
Oops, copy/paste issue. My steps to debug are not correct. You must copy/paste the file descriptor created by open(). For example, here it's 46, so you must pass 46 to _Py_DumpTracebackThreads() and close().


(a) Open the gdb stdout:

(gdb) p (int)open("/dev/pts/2", 1)
$1 = 46

(b) Call _Py_DumpTracebackThreads() with this file descriptor

(gdb) p (const char*)_Py_DumpTracebackThreads(46, 0, 0)
Current thread 0x00007f58da2c1740 (most recent call first):
  File "/usr/lib/python3.9/site-packages/gi/overrides/GLib.py", line 497 in run
  (...)
$2 = 0x0

(c) Close the file descriptor

(gdb) p (int)close(46)
$3 = 0

Comment 13 Victor Stinner 2022-04-25 07:17:43 UTC
Hi, I had a look at on the "stuck" machine and collected traceback of "stuck" different anaconda processes. I don't know how to make progress on that issue. Is there a way to reproduce the issue? Tell me if you need my help to investigate the issue. I reassign the issue to dnf.

Comment 14 Jaroslav Mracek 2022-05-04 11:39:19 UTC
I am really sorry but without solid reproducer we cannot do much. Right now it is a candidate to be closed as insufficient data.

Comment 16 Jan Stodola 2022-05-05 08:54:39 UTC
Hi Sagar, the draft looks fine, thanks.


Jaroslav (and others), I've tried to run anaconda on an installed system (RHEL-9.0-RC) and installing to an image file:

# anaconda --image /var/tmp/disk.img --kickstart /var/tmp/ks.cfg --text

There was about 3000 installations in total executed on ~15 machines, but I wasn't able to reproduce this bug with this method. Some of the installation were executed in text mode, some in vnc mode.
Please, do not close this bug, it needs to be debugged and fixed, probably in cooperation with the anaconda developers.

Comment 19 Radek Vykydal 2022-05-17 11:50:53 UTC
I was able to reproduce with RHEL9 Beta.
I was not able to reproduce with RHEL9 Alpha but should do more runs.

Comment 22 Jaroslav Mracek 2022-10-25 16:06:15 UTC
I am really sorry, without a reproducer we cannot do much. Please feel free to reopen the issue with additional information.


Note You need to log in before you can comment on or make changes to this bug.