Bug 1901754
| Summary: | paunch deletes systemd service file and does not regenerate it when it fails to stop/remove containers | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Takashi Kajinami <tkajinam> |
| Component: | python-paunch | Assignee: | Alex Schultz <aschultz> |
| Status: | CLOSED ERRATA | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | aschultz, cjeanner, jhajyahy, kecarter, knoha |
| Target Milestone: | z6 | Keywords: | Triaged |
| Target Release: | 16.1 (Train on RHEL 8.2) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | python-paunch-5.3.3-1.20210310104358.ed2c015.el8ost | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-26 13:49:37 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
*** Bug 1901753 has been marked as a duplicate of this bug. *** I confirmed that the containers paunch failed to stop (nova_libvirt on compute-4 and nova_virlogd on compute-5)
has TRIPLEO_CONFIG_HASH different from the one in container-startup-config json.
This shows that paunch didn't run steps to create containers with new config hash...
compute-4
$ sudo podman inspect nova_libvirt | jq .[0].Config.Env
[
...
"TRIPLEO_CONFIG_HASH=625b1a9cec416ffb7510a575bb82a938",
...
]
/var/lib/tripleo-config/container-startup-config/step_3/hashed-nova_libvirt.json
~~~
{
"cpuset_cpus": "all",
"depends_on": [
"tripleo_nova_virtlogd.service"
],
"environment": {
"KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS",
"TRIPLEO_CONFIG_HASH": "a7a350bde34fee59724c7cf52366877d"
},
...
~~~
$ sudo podman inspect nova_virtlogd | jq .[0].Config.Env
[
...
"TRIPLEO_CONFIG_HASH=a7a350bde34fee59724c7cf52366877d",
...
]
/var/lib/tripleo-config/container-startup-config/step_3/hashed-nova_virtlogd.json
~~~
{
"environment": {
"KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS",
"TRIPLEO_CONFIG_HASH": "a7a350bde34fee59724c7cf52366877d"
},
...
~~~
compute-5
$ sudo podman inspect nova_libvirt | jq .[0].Config.Env
[
...
"TRIPLEO_CONFIG_HASH=c2021708da4df13f562f4ae13a4c9fc4",
...
]
/var/lib/tripleo-config/container-startup-config/step_3/hashed-nova_libvirt.json
~~~
{
"cpuset_cpus": "all",
"depends_on": [
"tripleo_nova_virtlogd.service"
],
"environment": {
"KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS",
"TRIPLEO_CONFIG_HASH": "c2021708da4df13f562f4ae13a4c9fc4"
},
...
~~~
$ sudo podman inspect nova_virtlogd | jq .[0].Config.Env
[
...
"TRIPLEO_CONFIG_HASH=b30b8d8e81bf50209db634e0bed48c55",
...
]
/var/lib/tripleo-config/container-startup-config/step_3/hashed-nova_virtlogd.json
~~~
{
"environment": {
"KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS",
"TRIPLEO_CONFIG_HASH": "c2021708da4df13f562f4ae13a4c9fc4"
},
...
~~~
So according to my observation so far, it seems that the "systemctl stop <service" command doesn't fail even when "podman stop" fails because of timeout, and it can cause failures in subsequent podman commands. This causes only systemd service file deleted, while the container is still left without getting stopped... I have replaced ExecStop command by dummy script which returns 125.
~~~
[root@compute-1 ~]# cat /root/test.sh
#!/bin/bash
exit 125
[root@compute-1 ~]# cat /etc/systemd/system/tripleo_logrotate_crond.service
[Unit]
Description=logrotate_crond container
After=paunch-container-shutdown.service
Wants=
[Service]
Restart=always
ExecStart=/usr/bin/podman start logrotate_crond
ExecReload=/usr/bin/podman kill --signal HUP logrotate_crond
ExecStop=/bin/bash /root/test.sh
KillMode=none
Type=forking
PIDFile=/var/run/logrotate_crond.pid
[Install]
WantedBy=multi-user.target
[root@compute-1 ~]#
~~~
However systemctl stop doesn't return error code.
~~~
[root@compute-1 ~]# sudo systemctl daemon-reload
[root@compute-1 ~]# sudo systemctl status tripleo_logrotate_crond.service
● tripleo_logrotate_crond.service - logrotate_crond container
Loaded: loaded (/etc/systemd/system/tripleo_logrotate_crond.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2020-12-09 04:10:00 UTC; 23h ago
Main PID: 2671 (conmon)
Tasks: 0 (limit: 101106)
Memory: 1.7M
CGroup: /system.slice/tripleo_logrotate_crond.service
‣ 2671 /usr/bin/conmon --api-version 1 -s -c e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 -u e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 -r /usr/bin/runc -b /var/lib/containers/storage/>
Dec 09 04:09:58 compute-1 systemd[1]: Starting logrotate_crond container...
Dec 09 04:10:00 compute-1 podman[2462]: 2020-12-09 04:10:00.305730294 +0000 UTC m=+1.282243334 container init e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16>
Dec 09 04:10:00 compute-1 podman[2462]: 2020-12-09 04:10:00.334236854 +0000 UTC m=+1.310749885 container start e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp1>
Dec 09 04:10:00 compute-1 podman[2462]: logrotate_crond
Dec 09 04:10:00 compute-1 systemd[1]: Started logrotate_crond container.
[root@compute-1 ~]# sudo systemctl stop tripleo_logrotate_crond.service
[root@compute-1 ~]# echo $?
0
[root@compute-1 ~]# sudo systemctl status tripleo_logrotate_crond.service
● tripleo_logrotate_crond.service - logrotate_crond container
Loaded: loaded (/etc/systemd/system/tripleo_logrotate_crond.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2020-12-10 03:20:54 UTC; 14s ago
Process: 252506 ExecStop=/bin/bash /root/test.sh (code=exited, status=125)
Main PID: 2671
Dec 09 04:09:58 compute-1 systemd[1]: Starting logrotate_crond container...
Dec 09 04:10:00 compute-1 podman[2462]: 2020-12-09 04:10:00.305730294 +0000 UTC m=+1.282243334 container init e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16>
Dec 09 04:10:00 compute-1 podman[2462]: 2020-12-09 04:10:00.334236854 +0000 UTC m=+1.310749885 container start e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp1>
Dec 09 04:10:00 compute-1 podman[2462]: logrotate_crond
Dec 09 04:10:00 compute-1 systemd[1]: Started logrotate_crond container.
Dec 10 03:20:54 compute-1 systemd[1]: Stopping logrotate_crond container...
Dec 10 03:20:54 compute-1 systemd[1]: tripleo_logrotate_crond.service: Control process exited, code=exited status=125
Dec 10 03:20:54 compute-1 systemd[1]: tripleo_logrotate_crond.service: Failed with result 'exit-code'.
Dec 10 03:20:54 compute-1 systemd[1]: Stopped logrotate_crond container.
[root@compute-1 ~]# sudo podman ps | grep logrotate
e683b161a92b undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:16.1_20201020.1 kolla_start 2 weeks ago Up 23 hours ago logrotate_crond
~~~
Even if I add ExecStopPost to simulate the content of service files generated by paunch, systemctl stop returns 0.
~~~
[root@compute-1 ~]# cat tripleo_logrotate_crond.service
[Unit]
Description=logrotate_crond container
After=paunch-container-shutdown.service
Wants=
[Service]
Restart=always
ExecStart=/usr/bin/podman start logrotate_crond
ExecReload=/usr/bin/podman kill --signal HUP logrotate_crond
ExecStop=/usr/bin/podman stop -t 10 logrotate_crond
ExecStopPost=/usr/bin/podman stop -t 10 logrotate_crond
KillMode=none
Type=forking
PIDFile=/var/run/logrotate_crond.pid
[Install]
[root@compute-1 ~]# sudo systemctl daemon-reload
[root@compute-1 ~]# sudo systemctl status tripleo_logrotate_crond.service
● tripleo_logrotate_crond.service - logrotate_crond container
Loaded: loaded (/etc/systemd/system/tripleo_logrotate_crond.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2020-12-10 03:24:49 UTC; 37s ago
Main PID: 253292 (conmon)
Tasks: 0 (limit: 101106)
Memory: 1.6M
CGroup: /system.slice/tripleo_logrotate_crond.service
‣ 253292 /usr/bin/conmon --api-version 1 -s -c e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 -u e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 -r /usr/bin/runc -b /var/lib/containers/storag>
Dec 10 03:24:48 compute-1 systemd[1]: Starting logrotate_crond container...
Dec 10 03:24:49 compute-1 podman[253247]: 2020-12-10 03:24:49.203373271 +0000 UTC m=+0.227471808 container init e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp>
Dec 10 03:24:49 compute-1 podman[253247]: 2020-12-10 03:24:49.218287972 +0000 UTC m=+0.242386484 container start e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhos>
Dec 10 03:24:49 compute-1 podman[253247]: logrotate_crond
Dec 10 03:24:49 compute-1 systemd[1]: Started logrotate_crond container.
[root@compute-1 ~]# systemctl stop tripleo_logrotate_crond.service
[root@compute-1 ~]# echo $?
0
[root@compute-1 ~]# systemctl status tripleo_logrotate_crond.service
● tripleo_logrotate_crond.service - logrotate_crond container
Loaded: loaded (/etc/systemd/system/tripleo_logrotate_crond.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2020-12-10 03:27:06 UTC; 24s ago
Process: 253453 ExecStopPost=/bin/bash /root/test.sh (code=exited, status=125)
Process: 253451 ExecStop=/bin/bash /root/test.sh (code=exited, status=125)
Main PID: 253292
Dec 10 03:24:49 compute-1 podman[253247]: 2020-12-10 03:24:49.203373271 +0000 UTC m=+0.227471808 container init e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp>
Dec 10 03:24:49 compute-1 podman[253247]: 2020-12-10 03:24:49.218287972 +0000 UTC m=+0.242386484 container start e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhos>
Dec 10 03:24:49 compute-1 podman[253247]: logrotate_crond
Dec 10 03:24:49 compute-1 systemd[1]: Started logrotate_crond container.
Dec 10 03:25:36 compute-1 systemd[1]: Stopping logrotate_crond container...
Dec 10 03:25:36 compute-1 systemd[1]: tripleo_logrotate_crond.service: Control process exited, code=exited status=125
Dec 10 03:25:36 compute-1 systemd[1]: tripleo_logrotate_crond.service: Control process exited, code=exited status=125
Dec 10 03:27:06 compute-1 systemd[1]: tripleo_logrotate_crond.service: State 'stop-post' timed out. Terminating.
Dec 10 03:27:06 compute-1 systemd[1]: tripleo_logrotate_crond.service: Failed with result 'exit-code'.
Dec 10 03:27:06 compute-1 systemd[1]: Stopped logrotate_crond container.
[root@compute-1 ~]# sudo podman ps | grep logrotate
e683b161a92b undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:16.1_20201020.1 kolla_start 2 weeks ago Up 2 minutes ago logrotate_crond
~~~
So this is the scenarion which results in the problem. - For some reason a container doesn't stop within the specified timeout (10 seconds) - podman stop fails with 125 return code, but it doesn't forcefully stop the container - systemd doesn't detect the failure in stop the container, and it doesn't stop the remaining container - the subsequent steps performs podman stop/rm/rm -f but it doesn't check whether these steps are done successfully I confirmed that changing kill mode from none to control-group makes systemd forcefully stop the remaining container (maybe by SITERM or SIGKILL ?), but I'm not sure whether this is the right thing to do. Hello Takashi, it would be interesting to understand why those two containers aren't stopping as they should. My best guess here is: there are a lot of VMs running on the node, and nova_libvirt is just unable to stop in the delay - this is also the same reason nova_virtlogd isn't stopped and ends in the same weird black hole. I think we should first check with Compute for this case, then we should be able to find a better way within paunch. Maybe the way VMs are managed within nova_libvirt is the issue? For instance, Neutron isn't doing the same thing, they start dedicated containers when a new network action is done (i.e. new dhcp/subnet). Might be a path to explore so that all the VMs don't depend on one unique container? Fact is, if we kill the container it might lead to data corruption within the VMs... Not sure this is something we actually want to see, do we? Cheers, C. Hi Cédric,
> it would be interesting to understand why those two containers aren't stopping as they should. My best guess here is: there are a lot of VMs running on the node, and nova_libvirt is just unable to stop in the delay - this is also the same reason nova_virtlogd isn't stopped and ends in the same weird black hole.
I checked the log of libvirt and messages around the time when podman couldn't stop these containers, but I didn't see any errors or interesting logs so far.
All I could have confirmed so far is podman failed to stop these container retuning 125(I guess it's timeout) and these containers are left running, without having processes inside receive any stop signals.
A customer stopped these containers manually later, but they didn't see any issues at that time, so that "hung" seems to be something temproral.
Also, this problem happened not only with these libvirt containers, but also with the neutron-ovs-agent container, so I'm afraid this is not very specific to Compute.
Hello Takashi, humpf. "great"..... seems to point to yet another podman thing. IIRC there were others issues with the timeout, and we already had to raise it a bit. Maybe that's "just" it - but I really doubt this is a way we want to follow... I'll call for help within DF then (thank you needinfo()) - on my side I don't really see what we can really do without any risk :/. But seeing the whole thing, there might be an issue with the way systemd catches that 125 code, preventing paunch to actually get it and retry... Probably a thing to dig in. @Alex, @Kevin, any thoughts or ideas? Cheers, C. We'll have to look further into the error handling. We likely don't want to just fail the entire paunch execution because we want to manage whatever we can in a given step. That being said, this should have been a deploy failure so this will likely need more code/testing. I thought we had a different patch to bump timeouts or make them configurable to address this kind of problem however. So we specifically set KillMode to none due to race conditions. See https://review.opendev.org/c/openstack/paunch/+/645550 I'm not certain if switching to control-group has the same problem as process. So I don't think we address the podman failure itself for timeout. The timeout value is configurable at a service level but I don't think it's tuneable by the end user at this time. I checked and KillGroup=control-group has the race condition as process so we cannot switch to it. I think the solution here would be to catch this failure better and not remove the systemd unit if the service is still running. Additionally we should ensure that we do fail when this condition occurs. They have the wrong version of podman installed so the containers are not able to be managed/renamed/etc. They should have 1.6.4 but they have 1.9.3 podman-1.9.3-2.module+el8.2.1+6867+366c07d6.x86_64 Sun Aug 2 15:15:23 2020 podman-docker-1.9.3-2.module+el8.2.1+6867+366c07d6.noarch Sun Aug 2 15:15:26 2020 The rename is occurring because the container cannot be removed. But the container has an improper state so it never works. 2020-12-01 18:11:31.413 410894 ERROR paunch [ ] Error: cannot remove container f5a01ec7b2ee86c53f2d89d76a926b06ee9b93643e63ee8a778e2357c95c60d9 as it is running - running or paused containers cannot be removed without force: container state improper Thanks Alex, for pointing the wrong podman version. Sorry I missed that in my analysis...
I'll ask the customer to downgrade the podman package to have the correct version.
> The rename is occurring because the container cannot be removed. But the container has an improper state so it never works.
IIUC paunch is not renaming the container but removing it to replace it by the newly created one.
What I've observed so far is that container is still running and doesn't accept stop operation via systemd or directly,
and still running when paunch tries to remove it by "podman rm" or "podman rm -f"
Yea I looked at the paunch code and we actually remove the systemd unit prior to stopping it because they'll conflicit when trying to operate on the container. The bit of code from the log is actually from the rename functions and not part of the actual service management. I've got some additional patches to improve error handling when these conditions occur but it's likely that manual intervention will be needed if this condition occurs. If systemctl doesn't return an error, there's not much we can do to detect this state. set qe_test_coverage to - because the root cause of the problem was using an incorrect podman version. That should not be automated. A commit was made to handle a corner case error condition. A way to force creation of that error manually or through automation hasn't been found. Paused nova_libvirt container, redeployed OC and verified tripleo_nova_libvirt.service exists Removed nova_libvirt container, redeployed OC and verified tripleo_nova_libvirt.service exists Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.6 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2097 |
Description of problem: We noticed that some systemd service files are missing from overcloud nodes after successful stack update. For example compute-4 doesn't have tripleo_nova_ibvirt service after stack update ~~~ [heat-admin@compute-4 ~]$ sudo ls /etc/systemd/system bacula-fd.service nfs-mountd.service.requires tripleo_ceilometer_agent_compute.service.requires tripleo_nova_compute.service basic.target.wants nfs-server.service.requires tripleo_ceilometer_agent_compute_healthcheck.service tripleo_nova_compute.service.requires cloud-init.target.wants remote-fs.target.wants tripleo_ceilometer_agent_compute_healthcheck.timer tripleo_nova_compute_healthcheck.service ctrl-alt-del.target rpc-gssd.service.requires tripleo_logrotate_crond.service tripleo_nova_compute_healthcheck.timer dbus-org.freedesktop.nm-dispatcher.service rpc-statd-notify.service.requires tripleo_logrotate_crond.service.requires tripleo_nova_migration_target.service dbus-org.freedesktop.resolve1.service rpc-statd.service.requires tripleo_logrotate_crond_healthcheck.service tripleo_nova_migration_target.service.requires dbus-org.freedesktop.timedate1.service sockets.target.wants tripleo_logrotate_crond_healthcheck.timer tripleo_nova_migration_target_healthcheck.service default.target sysinit.target.wants tripleo_neutron_ovs_agent.service tripleo_nova_migration_target_healthcheck.timer getty.target.wants syslog.service tripleo_neutron_ovs_agent.service.requires tripleo_nova_virtlogd.service ksm.service sysstat.service.wants tripleo_neutron_ovs_agent_healthcheck.service tripleo_nova_virtlogd.service.requires ksmtuned.service systemd-timedated.service tripleo_neutron_ovs_agent_healthcheck.timer tripleo_nova_virtlogd_healthcheck.service multi-user.target.wants timers.target.wants tripleo_neutron_sriov_agent.service tripleo_nova_virtlogd_healthcheck.timer network-online.target.wants tripleo-ip6tables.service tripleo_neutron_sriov_agent.service.requires nfs-blkmap.service.requires tripleo-iptables.service tripleo_neutron_sriov_agent_healthcheck.service nfs-idmapd.service.requires tripleo_ceilometer_agent_compute.service tripleo_neutron_sriov_agent_healthcheck.timer [heat-admin@compute-4 ~]$ ~~~ and compute-5 doesn't have tripleo_nova_virtlogd service on the other hand. ~~~ [heat-admin@compute-5 ~]$ sudo ls /etc/systemd/system bacula-fd.service nfs-mountd.service.requires tripleo_ceilometer_agent_compute.service.requires tripleo_nova_compute.service basic.target.wants nfs-server.service.requires tripleo_ceilometer_agent_compute_healthcheck.service tripleo_nova_compute.service.requires cloud-init.target.wants remote-fs.target.wants tripleo_ceilometer_agent_compute_healthcheck.timer tripleo_nova_compute_healthcheck.service ctrl-alt-del.target rpc-gssd.service.requires tripleo_logrotate_crond.service tripleo_nova_compute_healthcheck.timer dbus-org.freedesktop.nm-dispatcher.service rpc-statd-notify.service.requires tripleo_logrotate_crond.service.requires tripleo_nova_libvirt.service dbus-org.freedesktop.resolve1.service rpc-statd.service.requires tripleo_logrotate_crond_healthcheck.service tripleo_nova_libvirt.service.requires dbus-org.freedesktop.timedate1.service sockets.target.wants tripleo_logrotate_crond_healthcheck.timer tripleo_nova_libvirt_healthcheck.service default.target sysinit.target.wants tripleo_neutron_ovs_agent.service tripleo_nova_libvirt_healthcheck.timer getty.target.wants syslog.service tripleo_neutron_ovs_agent.service.requires tripleo_nova_migration_target.service ksm.service sysstat.service.wants tripleo_neutron_ovs_agent_healthcheck.service tripleo_nova_migration_target.service.requires ksmtuned.service systemd-timedated.service tripleo_neutron_ovs_agent_healthcheck.timer tripleo_nova_migration_target_healthcheck.service multi-user.target.wants timers.target.wants tripleo_neutron_sriov_agent.service tripleo_nova_migration_target_healthcheck.timer network-online.target.wants tripleo-ip6tables.service tripleo_neutron_sriov_agent.service.requires nfs-blkmap.service.requires tripleo-iptables.service tripleo_neutron_sriov_agent_healthcheck.service nfs-idmapd.service.requires tripleo_ceilometer_agent_compute.service tripleo_neutron_sriov_agent_healthcheck.timer [heat-admin@compute-5 ~]$ ~~~ In aisble.log we found the following outputs, which indicate that paunch failed to stop these container for some reasons. However since paunch doesn't fail even if it fails to stop containers all these tasks ~~~ 2020-11-11 04:31:31,313 p=234567 u=mistral n=ansible | TASK [Wait for containers to start for step 3 using paunch] ******************** 2020-11-11 04:31:31,313 p=234567 u=mistral n=ansible | Wednesday 11 November 2020 04:31:31 +0900 (0:00:02.158) 0:25:34.244 **** ... 2020-11-11 04:32:04,713 p=230357 u=mistral n=ansible | ok: [compute-4] => {"action": ["Applying config_id tripleo_step3"], ... ... 2020-11-11 04:32:04,816 p=230357 u=mistral n=ansible | ok: [compute-5] => {"action": ["Applying config_id tripleo_step3"], ... ... 2020-11-11 04:32:34,367 p=230357 u=mistral n=ansible | TASK [Debug output for task: Start containers for step 3] ********************** 2020-11-11 04:32:34,367 p=230357 u=mistral n=ansible | Wednesday 11 November 2020 04:32:34 +0900 (0:01:03.053) 0:26:37.297 **** .... 2020-11-11 04:32:35,205 p=230357 u=mistral n=ansible | ok: [compute-4] => { "failed_when_result": false, "start_containers_outputs.stdout_lines | default([]) | union(start_containers_outputs.stderr_lines | default([]))": [ "Error executing ['podman', 'stop', 'nova_libvirt']: returned 125", "Error executing ['podman', 'rm', 'nova_libvirt']: returned 2", "Error removing container gracefully: nova_libvirt", "Error: cannot remove container 880a99e986801599a233793aeb07cf15bd0e60444451fc1381a29d12b68d261a as it is running - running or paused containers cannot be removed without force: container state improper", "", "Error executing ['podman', 'rm', '-f', 'nova_libvirt']: returned 125", "Error removing container: nova_libvirt", "Error: cannot remove container 880a99e986801599a233793aeb07cf15bd0e60444451fc1381a29d12b68d261a as it could not be stopped: given PIDs did not die within timeout" ] } 2020-11-11 04:32:35,313 p=230357 u=mistral n=ansible | ok: [compute-5] => { "failed_when_result": false, "start_containers_outputs.stdout_lines | default([]) | union(start_containers_outputs.stderr_lines | default([]))": [ "Error executing ['podman', 'stop', 'nova_virtlogd']: returned 125", "Error executing ['podman', 'rm', 'nova_virtlogd']: returned 2", "Error removing container gracefully: nova_virtlogd", "Error: cannot remove container f7c1be87a80b5ea434c3d4abe85cf1d527e33ad5cf83e0ffbbc72dd483305e6c as it is running - running or paused containers cannot be removed without force: container state improper", "", "Error executing ['podman', 'rm', '-f', 'nova_virtlogd']: returned 125", "Error removing container: nova_virtlogd", "Error: cannot remove container f7c1be87a80b5ea434c3d4abe85cf1d527e33ad5cf83e0ffbbc72dd483305e6c as it could not be stopped: given PIDs did not die within timeout" ] } ... ~~~ Current problem is that paunch does not fail even it fails to stop and delete some containers. What is worse, it doesn't run steps to create containers and systemd files if there are already containers running, and causes some systemd files get deleted. Version-Release number of selected component (if applicable): The overcloud nodes have the following paunch packages installed. paunch-services-5.3.3-0.20200527083422.16ae5e4.el8ost.noarch python3-paunch-5.3.3-0.20200527083422.16ae5e4.el8ost.noarch How reproducible: The issue was observed once so far. Steps to Reproduce: 1. 2. 3. Actual results: deployment doesn't fail even paunch fails to stop/delete containers and some systemd files are deleted Expected results: deployment fails or paunch doesn't delete systemd files Additional info: