Description of problem: Even if a common process, which represent a podman container is killed because of some reason, processes inside the container stays running. Then systemd detects failure of common process and tries to start the failed podman container, but it fails because of stale process running on the host. ~~~ [heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor |-conmon(121860)-+-dumb-init(121893)---nova-conductor(121960)-+-nova-conductor(122511) | | |-nova-conductor(122512) | | |-nova-conductor(122513) | | `-nova-conductor(122518) [heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor ● tripleo_nova_conductor.service - nova_conductor container Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2020-06-22 15:00:04 UTC; 1 day 10h ago Main PID: 121860 (conmon) Tasks: 0 (limit: 26213) Memory: 2.1M CGroup: /system.slice/tripleo_nova_conductor.service ‣ 121860 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storag> Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. [heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor |-conmon(121860)-+-dumb-init(121893)---nova-conductor(121960)-+-nova-conductor(122511) | | |-nova-conductor(122512) | | |-nova-conductor(122513) | | `-nova-conductor(122518) [heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor 0edc910c83a4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1 kolla_start 34 hours ago Up 34 hours ago nova_conductor [heat-admin@controller-0 ~]$ sudo kill -KILL 121860 [heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor |-dumb-init(121893)---nova-conductor(121960)-+-nova-conductor(122511) | |-nova-conductor(122512) | |-nova-conductor(122513) | `-nova-conductor(122518) [heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor ● tripleo_nova_conductor.service - nova_conductor container Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled) Active: failed (Result: protocol) since Wed 2020-06-24 01:01:26 UTC; 8s ago Process: 4030 ExecStart=/usr/bin/podman start nova_conductor (code=exited, status=0/SUCCESS) Main PID: 121860 (code=killed, signal=KILL) Jun 24 01:01:26 controller-0 systemd[1]: tripleo_nova_conductor.service: Service RestartSec=100ms expired, scheduling restart. Jun 24 01:01:26 controller-0 systemd[1]: tripleo_nova_conductor.service: Scheduled restart job, restart counter is at 6. Jun 24 01:01:26 controller-0 systemd[1]: Stopped nova_conductor container. Jun 24 01:01:26 controller-0 systemd[1]: tripleo_nova_conductor.service: Start request repeated too quickly. Jun 24 01:01:26 controller-0 systemd[1]: tripleo_nova_conductor.service: Failed with result 'protocol'. Jun 24 01:01:26 controller-0 systemd[1]: Failed to start nova_conductor container. [heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor 0edc910c83a4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1 kolla_start 34 hours ago Up 34 hours ago nova_conductor ~~~ To recover from the situation we need to stop the failed container manually then start it from systemd. ~~~ [heat-admin@controller-0 ~]$ sudo podman stop nova_conductor Error: timed out waiting for file /var/run/libpod/exits/0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38: internal libpod error [heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor [heat-admin@controller-0 ~]$ sudo systemctl start tripleo_nova_conductor [heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor ● tripleo_nova_conductor.service - nova_conductor container Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2020-06-24 01:04:27 UTC; 4s ago Process: 21146 ExecStart=/usr/bin/podman start nova_conductor (code=exited, status=0/SUCCESS) Main PID: 21169 (conmon) Tasks: 0 (limit: 26213) Memory: 2.1M CGroup: /system.slice/tripleo_nova_conductor.service ‣ 21169 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storage> Jun 24 01:04:26 controller-0 systemd[1]: Starting nova_conductor container... Jun 24 01:04:27 controller-0 podman[21146]: 2020-06-24 01:04:27.220821736 +0000 UTC m=+0.429382454 container init 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho> Jun 24 01:04:27 controller-0 podman[21146]: 2020-06-24 01:04:27.24996283 +0000 UTC m=+0.458523513 container start 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho> Jun 24 01:04:27 controller-0 podman[21146]: nova_conductor Jun 24 01:04:27 controller-0 systemd[1]: Started nova_conductor container. ~~~ Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Send SIGKILL to common process of one podman container 2. Check status of tripleo_<service name> Actual results: The service becomes failed and the podman container is not restarted Expected results: The service becomes active status, and the podman container is restarted Additional info:
After fixing the problem by stopping podman container and restarting it via systemd, the process is restarted under common process expectedly. ~~~ [heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor 0edc910c83a4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1 kolla_start 34 hours ago Up About a minute ago nova_conductor [heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor |-conmon(21169)-+-dumb-init(21181)---nova-conductor(21206)-+-nova-conductor(21491) | | |-nova-conductor(21492) | | |-nova-conductor(21493) | | `-nova-conductor(21494) ~~~
There is a fix merged into podman recently, which makes ExecStopPost also configured in systemd unit files, so that container processes are actually stopped even common process fails. I think we need to implement the same in tripleo ansible, so that generated systemd file has ExecStopPost. https://github.com/containers/libpod/commit/e5c3432944245a740ed443803c654dcc9c3757f0
I tested systemd unit file with ExecStopPost added ~~~ [heat-admin@controller-0 ~]$ sudo cat /etc/systemd/system/tripleo_nova_conductor.service [Unit] Description=nova_conductor container After=paunch-container-shutdown.service Wants= [Service] Restart=always ExecStart=/usr/bin/podman start nova_conductor ExecReload=/usr/bin/podman kill --signal HUP nova_conductor ExecStop=/usr/bin/podman stop -t 10 nova_conductor ExecStopPost=/usr/bin/podman stop -t 10 nova_conductor KillMode=none Type=forking PIDFile=/var/run/nova_conductor.pid [Install] WantedBy=multi-user.target [heat-admin@controller-0 ~]$ sudo systemctl daemon-reload ~~~ and confirmed that it didn't affect normal stop/start operation ~~~ [heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor ● tripleo_nova_conductor.service - nova_conductor container Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2020-06-24 01:04:27 UTC; 1h 6min ago Main PID: 21169 (conmon) Tasks: 0 (limit: 26213) Memory: 2.4M CGroup: /system.slice/tripleo_nova_conductor.service ‣ 21169 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storage> Jun 24 01:04:26 controller-0 systemd[1]: Starting nova_conductor container... Jun 24 01:04:27 controller-0 podman[21146]: 2020-06-24 01:04:27.220821736 +0000 UTC m=+0.429382454 container init 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho> Jun 24 01:04:27 controller-0 podman[21146]: 2020-06-24 01:04:27.24996283 +0000 UTC m=+0.458523513 container start 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho> Jun 24 01:04:27 controller-0 podman[21146]: nova_conductor Jun 24 01:04:27 controller-0 systemd[1]: Started nova_conductor container. Jun 24 02:10:54 controller-0 systemd[1]: Reloading nova_conductor container. Jun 24 02:10:54 controller-0 podman[391867]: 2020-06-24 02:10:54.445391064 +0000 UTC m=+0.083626342 container kill 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh> Jun 24 02:10:54 controller-0 podman[391867]: 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 Jun 24 02:10:54 controller-0 systemd[1]: Reloaded nova_conductor container. [heat-admin@controller-0 ~]$ sudo systemctl stop tripleo_nova_conductor [heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor ● tripleo_nova_conductor.service - nova_conductor container Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled) Active: inactive (dead) since Wed 2020-06-24 02:11:22 UTC; 2s ago Process: 394597 ExecStopPost=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS) Process: 393786 ExecStop=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS) Main PID: 21169 (code=exited, status=0/SUCCESS) Jun 24 02:10:54 controller-0 systemd[1]: Reloading nova_conductor container. Jun 24 02:10:54 controller-0 podman[391867]: 2020-06-24 02:10:54.445391064 +0000 UTC m=+0.083626342 container kill 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh> Jun 24 02:10:54 controller-0 podman[391867]: 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 Jun 24 02:10:54 controller-0 systemd[1]: Reloaded nova_conductor container. Jun 24 02:11:18 controller-0 systemd[1]: Stopping nova_conductor container... Jun 24 02:11:22 controller-0 podman[393786]: 2020-06-24 02:11:22.059051776 +0000 UTC m=+3.421731728 container died 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh> Jun 24 02:11:22 controller-0 podman[393786]: 2020-06-24 02:11:22.060230275 +0000 UTC m=+3.422910246 container stop 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh> Jun 24 02:11:22 controller-0 podman[393786]: 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 Jun 24 02:11:22 controller-0 podman[394597]: 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 Jun 24 02:11:22 controller-0 systemd[1]: Stopped nova_conductor container. [heat-admin@controller-0 ~]$ sudo systemctl start tripleo_nova_conductor [heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor ● tripleo_nova_conductor.service - nova_conductor container Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2020-06-24 02:11:43 UTC; 47s ago Process: 394597 ExecStopPost=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS) Process: 393786 ExecStop=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS) Process: 396438 ExecStart=/usr/bin/podman start nova_conductor (code=exited, status=0/SUCCESS) Main PID: 396524 (conmon) Tasks: 0 (limit: 26213) Memory: 1.8M CGroup: /system.slice/tripleo_nova_conductor.service ‣ 396524 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storag> Jun 24 02:11:43 controller-0 systemd[1]: Starting nova_conductor container... Jun 24 02:11:43 controller-0 podman[396438]: 2020-06-24 02:11:43.625492561 +0000 UTC m=+0.475655869 container init 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh> Jun 24 02:11:43 controller-0 podman[396438]: 2020-06-24 02:11:43.641841944 +0000 UTC m=+0.492005326 container start 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/r> Jun 24 02:11:43 controller-0 podman[396438]: nova_conductor Jun 24 02:11:43 controller-0 systemd[1]: Started nova_conductor container. ~~~ and now systemd can restart the container whose common process was killed. ~~~ [heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor 0edc910c83a4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1 kolla_start 35 hours ago Up About a minute ago nova_conductor [heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor |-conmon(396524)-+-dumb-init(396541)---nova-conductor(396589)-+-nova-conductor(397076) | | |-nova-conductor(397077) | | |-nova-conductor(397078) | | `-nova-conductor(397079) [heat-admin@controller-0 ~]$ sudo kill -KILL 396524 [heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor |-dumb-init(396541)---nova-conductor(396589)---nova-conductor(397076) [heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor [heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor |-conmon(406733)-+-dumb-init(406746)---nova-conductor(406765) [heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor ● tripleo_nova_conductor.service - nova_conductor container Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2020-06-24 02:13:29 UTC; 11s ago Process: 405610 ExecStopPost=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=125) Process: 393786 ExecStop=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS) Process: 406710 ExecStart=/usr/bin/podman start nova_conductor (code=exited, status=0/SUCCESS) Main PID: 406733 (conmon) Tasks: 0 (limit: 26213) Memory: 1.9M CGroup: /system.slice/tripleo_nova_conductor.service ‣ 406733 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storag> Jun 24 02:13:28 controller-0 systemd[1]: Starting nova_conductor container... Jun 24 02:13:29 controller-0 podman[406710]: 2020-06-24 02:13:29.338664709 +0000 UTC m=+0.428971832 container init 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh> Jun 24 02:13:29 controller-0 podman[406710]: 2020-06-24 02:13:29.355447918 +0000 UTC m=+0.445755041 container start 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/r> Jun 24 02:13:29 controller-0 podman[406710]: nova_conductor Jun 24 02:13:29 controller-0 systemd[1]: Started nova_conductor container. [heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor 0edc910c83a4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1 kolla_start 35 hours ago Up 2 minutes ago nova_conductor [heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor |-conmon(406733)-+-dumb-init(406746)---nova-conductor(406765)-+-nova-conductor(407012) | | |-nova-conductor(407013) | | |-nova-conductor(407014) | | `-nova-conductor(407015) [heat-admin@controller-0 ~]$ ~~~
Moving to "paunch" component - not sure if tripleo-ansible will need any patch, I think we moved to podman managed systemd units with newer version. Paunch is used at least in 16.0 and 16.1.
Thank you for the information, Cédric. I'll submit a patch to paunch as well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284