Bug 2073510

Summary:

Installation sometimes hangs before Verifying: stage of dnf transaction

Product:

Red Hat Enterprise Linux 9

Reporter:

Jan Stodola <jstodola>

Component:

dnf

Assignee:

Jaroslav Mracek <jmracek>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

swm-qe

Severity:

unspecified

Docs Contact:

Sagar Dubewar <sdubewar>

Priority:

medium

Version:

9.0

CC:

bstinson, gfialova, james.antill, jikortus, jwboyer, pasik, rvykydal, torsava

Target Milestone:

Keywords:

Triaged

Target Release:

---

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

.The Installation process sometimes becomes unresponsive When you install RHEL, the installation process sometimes becomes unresponsive. The `/tmp/packaging.log` file displays the following message at the end: ---- 10:20:56,416 DDEBUG dnf: RPM transaction over. ---- To work around this problem, restart the installation process.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-10-25 16:06:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1942219, 2027125, 2091421

Attachments:

Description	Flags
tracebacks.txt	none

Description Jan Stodola 2022-04-08 16:09:49 UTC

Description of problem:
Anaconda sometimes doesn't complete the package installation and verification phase and stops. The last message in packaging.log is:

...
10:20:55,019 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
10:20:56,163 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
10:20:56,212 INF dnf.rpm: Running in chroot, ignoring request.

10:20:56,283 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
10:20:56,416 DDEBUG dnf: RPM transaction over.



What is expected at this point is a package verification, like in this case:
...
14:24:00,364 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
14:24:01,594 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
14:24:01,640 INF dnf.rpm: Running in chroot, ignoring request.

14:24:01,710 INF packaging: Configuring (running scriptlet for): rootfiles-8.1-31.el9.noarch 1628613242 94e7ec73e7998cb2f94f5306fded9be2bc175e5a4c912fd3744f2ca18f2afce1
14:24:01,808 DDEBUG dnf: RPM transaction over.
14:24:02,075 INF packaging: Verifying: NetworkManager-1:1.36.0-4.el9_0.aarch64 1648805902 d45b2f054bc32f77671b29dea0d3aa8307b31c7ef891126ef2d87ae4fad07f4e
14:24:02,080 INF packaging: Verifying: NetworkManager-libnm-1:1.36.0-4.el9_0.aarch64 1648805902 7d552bca978bb4a7a6db22b19eed772751052be92981b0958d198839d8ea6a8e
14:24:02,081 INF packaging: Verifying: NetworkManager-team-1:1.36.0-4.el9_0.aarch64 1648805902 881de45f4299f8856fc97d0c2b05d025dfbdc6e37d000d83a6ab94acf964edc9
...

Version-Release number of selected component (if applicable):
RHEL-9.0.0-20220406.1 or RHEL-9.1.0-20220408.0
dnf-4.10.0-5.el9_0.noarch
anaconda-34.25.0.29-1.el9_0

How reproducible:
Sometimes. Maybe 1 out of 200 installations.
Seen on s390x, aarch64 and x86_64.

Steps to Reproduce:
1. Unclear, run a RHEL-9 installation

Actual results:
Installation hangs during the package installation process.

Expected results:
Successful installation.

Additional info:
This problem looks the same as a Fedora bug 2054230.

Comment 4 Jaroslav Mracek 2022-04-12 09:57:47 UTC

Thank you very much for your report. According to investigation in BZ2054230 it stops on lock in `names = set([i.name for i in q])` (base.py line 1159). When the line is replaced by `
```
names = set()
for pkg in q:
    name = pkg.name
    names.add(name)
```

it stops on lock in `for p in g.getPackages()` (base.py line 1162).

There is also one unique part of Anaconda. It uses `multiprocessing` to run dnf (anaconda/pyanaconda/modules/payloads/payload/dnf/dnf_manager.py)

```
process = multiprocessing.Process(
            target=self._run_transaction,
            args=(self._base, display)
        )

        # Start the transaction.
        log.debug("Starting the transaction process...")
        process.start()
```
The problem is that the issue is hard to reproduce and I also believe that the issue cannot be related to any recent change in DNF therefore I would like to ask the Python team to investigate the issue or provide a suggestion that will help us with fixing it.

Comment 5 Tomas Orsava 2022-04-20 12:27:11 UTC

Hi Jan,
could you provide a core dump from the affected installation? Perhaps we could identify the issue there.

Even better for us would be to get access to a machine that has run into the issue. However, we don't have the means to realistically run 200 RHEL 9 installations waiting for the bug. Would you be able to get us access to such a machine?

Comment 9 Victor Stinner 2022-04-21 16:33:58 UTC

These machines have no debug symbols for python3.9, dnf is not working, and even if I copy python-gdb.py manually, gdb fails to rebuild the Python traceback. Oh.

I used a different approach: call _Py_DumpTracebackThreads().

Step 1: get the gdb stdout

$ ps ax|grep gdb
 513638 pts/2    Sl+    0:00 gdb -p 513545
$ readlink /proc/513638/fd/1
/dev/pts/2

=> gdb stdout is /dev/pts/2

Step 2: open the Python process in gdb and type

(a) Open the gdb stdout:

(gdb) p (int)open("/dev/pts/2", 1)
$1 = 56

(b) Call _Py_DumpTracebackThreads() with this file descriptor

(gdb) p (const char*)_Py_DumpTracebackThreads(46, 0, 0)
Current thread 0x00007f58da2c1740 (most recent call first):
  File "/usr/lib/python3.9/site-packages/gi/overrides/GLib.py", line 497 in run
  (...)
$2 = 0x0

(c) Close the file descriptor

(gdb) p (int)close(46)
$3 = 0

Comment 10 Victor Stinner 2022-04-21 16:41:46 UTC

I don't know dnf at all so I tried to pick the most interesting python3.9 processes used by anaconda.

I looked at the hierarchy of these 6 processes:

(1) anaconda,2077 (4 threads): multiprocessing: pyanaconda/payload/dnf/payload.py:914: install()

* CHILD: (2) {anaconda},2101 (1 thread): poll() <= main() of gi/overrides/Gtk.py
* CHILD: (3) {anaconda},2102 (1 thread): poll() <= main() of gi/overrides/Gtk.py
* CHILD: (4) {anaconda},2617 (1 thread): multiprocessing _recv() <= install() of payload/dnf/payload.py
* CHILD: (5) anaconda,3065 (2 threads):

  * Thread 1: g_mutex_lock_slowpath() <= g_signal_handlers_destroy()
  * Thread 2: lock_PyThread_acquire_lock() <= do_transaction() of pyanaconda/payload/dnf/utils.py
  * CHILD: (6) {anaconda},3069 (1 thread): multiprocessing queue _feed() -> lock_PyThread_acquire_lock()

So 3 processes are doing multiprocessing things in Python:

* 2077: multiprocessing: pyanaconda/payload/dnf/payload.py:914: install()
* 2617: multiprocessing _recv() <= install() of payload/dnf/payload.py
* 3069: multiprocessing queue _feed() -> lock_PyThread_acquire_lock()

It seems like the logical chain of processes is: 2077 => 2617 => 3069.

So the process 3069 is the end of the chain and it's stuck in queue _feed(): it calls nwait() at multiprocessing/queues.py:233 which is self._notempty.wait().

I don't understand how this queue is supposed to be filled, since the process has a single thread which is waiting until this queue is not empty.


The strange part is the _Py_DumpTracebackThreads() still shows Python threads which no longer exists: gdb lists 1 thread, _Py_DumpTracebackThreads() shows 2 threads. I guess that they are zombie from the forked parent process, and Python fails to clear these zombies threads.


Attached tracebacks.txt contains the traceback of these processes.

Comment 11 Victor Stinner 2022-04-21 16:42:36 UTC

Created attachment 1874196 [details]
tracebacks.txt

Comment 12 Victor Stinner 2022-04-21 16:44:13 UTC

Oops, copy/paste issue. My steps to debug are not correct. You must copy/paste the file descriptor created by open(). For example, here it's 46, so you must pass 46 to _Py_DumpTracebackThreads() and close().


(a) Open the gdb stdout:

(gdb) p (int)open("/dev/pts/2", 1)
$1 = 46

(b) Call _Py_DumpTracebackThreads() with this file descriptor

(gdb) p (const char*)_Py_DumpTracebackThreads(46, 0, 0)
Current thread 0x00007f58da2c1740 (most recent call first):
  File "/usr/lib/python3.9/site-packages/gi/overrides/GLib.py", line 497 in run
  (...)
$2 = 0x0

(c) Close the file descriptor

(gdb) p (int)close(46)
$3 = 0

Comment 13 Victor Stinner 2022-04-25 07:17:43 UTC

Hi, I had a look at on the "stuck" machine and collected traceback of "stuck" different anaconda processes. I don't know how to make progress on that issue. Is there a way to reproduce the issue? Tell me if you need my help to investigate the issue. I reassign the issue to dnf.

Comment 14 Jaroslav Mracek 2022-05-04 11:39:19 UTC

I am really sorry but without solid reproducer we cannot do much. Right now it is a candidate to be closed as insufficient data.

Comment 16 Jan Stodola 2022-05-05 08:54:39 UTC

Hi Sagar, the draft looks fine, thanks.


Jaroslav (and others), I've tried to run anaconda on an installed system (RHEL-9.0-RC) and installing to an image file:

# anaconda --image /var/tmp/disk.img --kickstart /var/tmp/ks.cfg --text

There was about 3000 installations in total executed on ~15 machines, but I wasn't able to reproduce this bug with this method. Some of the installation were executed in text mode, some in vnc mode.
Please, do not close this bug, it needs to be debugged and fixed, probably in cooperation with the anaconda developers.

Comment 19 Radek Vykydal 2022-05-17 11:50:53 UTC

I was able to reproduce with RHEL9 Beta.
I was not able to reproduce with RHEL9 Alpha but should do more runs.

Comment 22 Jaroslav Mracek 2022-10-25 16:06:15 UTC

I am really sorry, without a reproducer we cannot do much. Please feel free to reopen the issue with additional information.