DescriptionRicardo Ramos Thomas
2023-02-09 10:29:07 UTC
Description of problem:
excerpt of ps fauxww on ltctrl97001, we observe that theres'a dumb-init pid 885635 that's the parent of libvirtd, pid 885659.
root 885635 0.0 0.0 4240 884 ? Ss 13:16 0:00 \_ dumb-init --single-child -- kolla_start
root 885659 0.4 0.0 1901808 52032 ? Sl 13:16 0:07 \_ /usr/sbin/libvirtd
cgls.txt.gz (attached ), if you look at that and search for the pids you will find that the dumb-init runs as
└─machine.slice
├─libpod-d8247017736e7533441aad6729a01764213cabb84a1625c9e35dd33b68e01c5c.scope
│ ├─12596 dumb-init --single-child -- kolla_start
│ └─12730 /usr/sbin/crond -s -n
├─libpod-907173ef3272b38ef6cba67406c04fd0190ef7d011becd40e7ef795e5ecd0b24.scope
│ └─885635 dumb-init --single-child -- kolla_start
├─libpod-685b98531e562daa308b8095ad810b70c2036c9ed03c16e95d723eba033e5c58.scope
and libvirtd as
├─system.slice
│ ├─rngd.service
│ │ └─7833 /usr/sbin/rngd -f --fill-watermark=0 -x pkcs11 -x nist
[ ... ]
│ ├─certmonger.service
│ │ └─10308 /usr/sbin/certmonger -S -p /run/certmonger.pid -n -d2
│ ├─run-r8ad7517aa0ee417e9c80e6a8c2cee7f2.scope
│ │ └─885659 /usr/sbin/libvirtd
│ ├─snmpd.service
│ │ └─11011 /usr/sbin/snmpd -LS0-5d -f
which means that libvirtd is not running in the same control-group-context as the dumb-init.
when we manually kill dumb-init (kill -09 $pid) and check ps -fauxww we see that it has moved upwards and now hangs directly below init / systemd
(The PIDs are different in the sample below, I have lost the original paste)
~~~
root 805617 0.0 0.0 4240 880 ? Ss 2022 0:00 _ dumb-init --single-child -- kolla_start
root 805629 0.1 0.0 149692 12344 ? S 2022 132:04 _ /usr/sbin/virtlogd --config /etc/libvirt/virtlogd.conf
root 806944 0.7 0.0 4380304 98480 ? Sl 2022 833:29 /usr/sbin/libvirtd
~~
the systemd-service-file states that the KillMode is 'control-group', so that means systemd will kill everything in the control-group.
libvirtd is unharmed.
What we have reproduced here is 1:1 what we see in prod, so we suspect that dumb-init is crashing and systemd not cleaning up properly before it restarts the service.
Version-Release number of selected component (if applicable):
RHOSP 16.2.3 (Train)
How reproducible:
Steps to Reproduce:
1. manually kill dumb-init
2.
3.
Actual results:
nova libvirt container never recover when dumb-init dies
Expected results:
nova libvirt recover after dumb-init dies
Additional info:
SOS reports are available.
Instead of proposing a prestop exec command as a permanent solution, we decided to remove dumb-init entryppint from the nova libivrt container image.
As we already do run libvirt container in the host pid namespace, it needs no artificial init. So that the parent process becomes
podman/common shim (the container main process). W/o dumb-init proxy process in the middle, there should be no more a possibility for libvirtd to reparent.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Red Hat OpenStack Platform 16.2.6 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2023:6307