Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2168530

Summary: [OSP16.2] nova libvirt container never recover when dumb-init dies.
Product: Red Hat OpenStack Reporter: Ricardo Ramos Thomas <riramos>
Component: openstack-tripleo-commonAssignee: Nobody <nobody>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: alifshit, bdobreli, dasmith, drosenfe, eglynn, jhakimra, kchamart, ksambor, mburns, sbauza, sgordon, shtiwari, slinaber, vromanso
Target Milestone: z6Keywords: Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-common-11.7.1-2.20230711015219.6371211.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 19:18:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ricardo Ramos Thomas 2023-02-09 10:29:07 UTC
Description of problem:

excerpt of ps fauxww on ltctrl97001, we observe that theres'a dumb-init pid 885635 that's the parent of libvirtd, pid 885659.


root      885635  0.0  0.0   4240   884 ?        Ss   13:16   0:00  \_ dumb-init --single-child -- kolla_start
root      885659  0.4  0.0 1901808 52032 ?       Sl   13:16   0:07      \_ /usr/sbin/libvirtd

cgls.txt.gz (attached ), if you look at that and search for the pids you will find that the dumb-init runs as


└─machine.slice
  ├─libpod-d8247017736e7533441aad6729a01764213cabb84a1625c9e35dd33b68e01c5c.scope
  │ ├─12596 dumb-init --single-child -- kolla_start
  │ └─12730 /usr/sbin/crond -s -n
  ├─libpod-907173ef3272b38ef6cba67406c04fd0190ef7d011becd40e7ef795e5ecd0b24.scope
  │ └─885635 dumb-init --single-child -- kolla_start
  ├─libpod-685b98531e562daa308b8095ad810b70c2036c9ed03c16e95d723eba033e5c58.scope
and libvirtd as 


├─system.slice
│ ├─rngd.service
│ │ └─7833 /usr/sbin/rngd -f --fill-watermark=0 -x pkcs11 -x nist
[ ... ]
│ ├─certmonger.service
│ │ └─10308 /usr/sbin/certmonger -S -p /run/certmonger.pid -n -d2
│ ├─run-r8ad7517aa0ee417e9c80e6a8c2cee7f2.scope
│ │ └─885659 /usr/sbin/libvirtd
│ ├─snmpd.service
│ │ └─11011 /usr/sbin/snmpd -LS0-5d -f


which means that libvirtd is not running in the same control-group-context as the dumb-init.


when we manually kill dumb-init (kill -09 $pid) and check ps -fauxww we see that it has moved upwards and now hangs directly below init / systemd
(The PIDs are different in the sample below, I have lost the original paste)


~~~
root      805617  0.0  0.0   4240   880 ?        Ss    2022   0:00  _ dumb-init --single-child -- kolla_start
root      805629  0.1  0.0 149692 12344 ?        S     2022 132:04      _ /usr/sbin/virtlogd --config /etc/libvirt/virtlogd.conf
root      806944  0.7  0.0 4380304 98480 ?       Sl    2022 833:29 /usr/sbin/libvirtd
~~


the systemd-service-file states that the KillMode is 'control-group', so that means systemd will kill everything in the control-group.
libvirtd is unharmed.


What we have reproduced here is 1:1 what we see in prod, so we suspect that dumb-init is crashing and systemd not cleaning up properly before it restarts the service.


Version-Release number of selected component (if applicable):

RHOSP 16.2.3 (Train)

How reproducible:


Steps to Reproduce:
1. manually kill dumb-init
2. 
3.

Actual results:

nova libvirt container never recover when dumb-init dies

Expected results:

nova libvirt recover after dumb-init dies


Additional info:

SOS reports are available.

Comment 12 Bogdan Dobrelya 2023-03-23 16:16:50 UTC
Instead of proposing a prestop exec command as a permanent solution, we decided to remove dumb-init entryppint from the nova libivrt container image.
As we already do run libvirt container in the host pid namespace, it needs no artificial init. So that the parent process becomes
podman/common shim (the container main process). W/o dumb-init proxy process in the middle, there should be no more a possibility for libvirtd to reparent.

Comment 16 Artom Lifshitz 2023-06-01 16:00:58 UTC
Upstream wallaby is merged, backport to be proposed to 16.2.

Comment 25 errata-xmlrpc 2023-11-08 19:18:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.2.6 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:6307