Bug 1901754

Summary:	paunch deletes systemd service file and does not regenerate it when it fails to stop/remove containers
Product:	Red Hat OpenStack	Reporter:	Takashi Kajinami <tkajinam>
Component:	python-paunch	Assignee:	Alex Schultz <aschultz>
Status:	CLOSED ERRATA	QA Contact:	David Rosenfeld <drosenfe>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	aschultz, cjeanner, jhajyahy, kecarter, knoha
Target Milestone:	z6	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	python-paunch-5.3.3-1.20210310104358.ed2c015.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-26 13:49:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Takashi Kajinami 2020-11-26 00:17:16 UTC

Description of problem:

We noticed that some systemd service files are missing from overcloud nodes after successful stack update.

For example compute-4 doesn't have tripleo_nova_ibvirt service after stack update
~~~
[heat-admin@compute-4 ~]$ sudo ls /etc/systemd/system
bacula-fd.service                           nfs-mountd.service.requires               tripleo_ceilometer_agent_compute.service.requires     tripleo_nova_compute.service
basic.target.wants                          nfs-server.service.requires               tripleo_ceilometer_agent_compute_healthcheck.service  tripleo_nova_compute.service.requires
cloud-init.target.wants                     remote-fs.target.wants                    tripleo_ceilometer_agent_compute_healthcheck.timer    tripleo_nova_compute_healthcheck.service
ctrl-alt-del.target                         rpc-gssd.service.requires                 tripleo_logrotate_crond.service                       tripleo_nova_compute_healthcheck.timer
dbus-org.freedesktop.nm-dispatcher.service  rpc-statd-notify.service.requires         tripleo_logrotate_crond.service.requires              tripleo_nova_migration_target.service
dbus-org.freedesktop.resolve1.service       rpc-statd.service.requires                tripleo_logrotate_crond_healthcheck.service           tripleo_nova_migration_target.service.requires
dbus-org.freedesktop.timedate1.service      sockets.target.wants                      tripleo_logrotate_crond_healthcheck.timer             tripleo_nova_migration_target_healthcheck.service
default.target                              sysinit.target.wants                      tripleo_neutron_ovs_agent.service                     tripleo_nova_migration_target_healthcheck.timer
getty.target.wants                          syslog.service                            tripleo_neutron_ovs_agent.service.requires            tripleo_nova_virtlogd.service
ksm.service                                 sysstat.service.wants                     tripleo_neutron_ovs_agent_healthcheck.service         tripleo_nova_virtlogd.service.requires
ksmtuned.service                            systemd-timedated.service                 tripleo_neutron_ovs_agent_healthcheck.timer           tripleo_nova_virtlogd_healthcheck.service
multi-user.target.wants                     timers.target.wants                       tripleo_neutron_sriov_agent.service                   tripleo_nova_virtlogd_healthcheck.timer
network-online.target.wants                 tripleo-ip6tables.service                 tripleo_neutron_sriov_agent.service.requires
nfs-blkmap.service.requires                 tripleo-iptables.service                  tripleo_neutron_sriov_agent_healthcheck.service
nfs-idmapd.service.requires                 tripleo_ceilometer_agent_compute.service  tripleo_neutron_sriov_agent_healthcheck.timer
[heat-admin@compute-4 ~]$
~~~

and compute-5 doesn't have tripleo_nova_virtlogd service on the other hand.
~~~
[heat-admin@compute-5 ~]$ sudo ls /etc/systemd/system
bacula-fd.service                           nfs-mountd.service.requires               tripleo_ceilometer_agent_compute.service.requires     tripleo_nova_compute.service
basic.target.wants                          nfs-server.service.requires               tripleo_ceilometer_agent_compute_healthcheck.service  tripleo_nova_compute.service.requires
cloud-init.target.wants                     remote-fs.target.wants                    tripleo_ceilometer_agent_compute_healthcheck.timer    tripleo_nova_compute_healthcheck.service
ctrl-alt-del.target                         rpc-gssd.service.requires                 tripleo_logrotate_crond.service                       tripleo_nova_compute_healthcheck.timer
dbus-org.freedesktop.nm-dispatcher.service  rpc-statd-notify.service.requires         tripleo_logrotate_crond.service.requires              tripleo_nova_libvirt.service
dbus-org.freedesktop.resolve1.service       rpc-statd.service.requires                tripleo_logrotate_crond_healthcheck.service           tripleo_nova_libvirt.service.requires
dbus-org.freedesktop.timedate1.service      sockets.target.wants                      tripleo_logrotate_crond_healthcheck.timer             tripleo_nova_libvirt_healthcheck.service
default.target                              sysinit.target.wants                      tripleo_neutron_ovs_agent.service                     tripleo_nova_libvirt_healthcheck.timer
getty.target.wants                          syslog.service                            tripleo_neutron_ovs_agent.service.requires            tripleo_nova_migration_target.service
ksm.service                                 sysstat.service.wants                     tripleo_neutron_ovs_agent_healthcheck.service         tripleo_nova_migration_target.service.requires
ksmtuned.service                            systemd-timedated.service                 tripleo_neutron_ovs_agent_healthcheck.timer           tripleo_nova_migration_target_healthcheck.service
multi-user.target.wants                     timers.target.wants                       tripleo_neutron_sriov_agent.service                   tripleo_nova_migration_target_healthcheck.timer
network-online.target.wants                 tripleo-ip6tables.service                 tripleo_neutron_sriov_agent.service.requires
nfs-blkmap.service.requires                 tripleo-iptables.service                  tripleo_neutron_sriov_agent_healthcheck.service
nfs-idmapd.service.requires                 tripleo_ceilometer_agent_compute.service  tripleo_neutron_sriov_agent_healthcheck.timer
[heat-admin@compute-5 ~]$
~~~

In aisble.log we found the following outputs, which indicate that
paunch failed to stop these container for some reasons.
However since paunch doesn't fail even if it fails to stop containers
all these tasks 
~~~
2020-11-11 04:31:31,313 p=234567 u=mistral n=ansible | TASK [Wait for containers to start for step 3 using paunch] ********************
2020-11-11 04:31:31,313 p=234567 u=mistral n=ansible | Wednesday 11 November 2020  04:31:31 +0900 (0:00:02.158)       0:25:34.244 **** 
...
2020-11-11 04:32:04,713 p=230357 u=mistral n=ansible | ok: [compute-4] => {"action": ["Applying config_id tripleo_step3"], ...
...
2020-11-11 04:32:04,816 p=230357 u=mistral n=ansible | ok: [compute-5] => {"action": ["Applying config_id tripleo_step3"], ...
...
2020-11-11 04:32:34,367 p=230357 u=mistral n=ansible | TASK [Debug output for task: Start containers for step 3] **********************
2020-11-11 04:32:34,367 p=230357 u=mistral n=ansible | Wednesday 11 November 2020  04:32:34 +0900 (0:01:03.053)       0:26:37.297 **** 
....
2020-11-11 04:32:35,205 p=230357 u=mistral n=ansible | ok: [compute-4] => {
    "failed_when_result": false,
    "start_containers_outputs.stdout_lines | default([]) | union(start_containers_outputs.stderr_lines | default([]))": [
        "Error executing ['podman', 'stop', 'nova_libvirt']: returned 125",
        "Error executing ['podman', 'rm', 'nova_libvirt']: returned 2",
        "Error removing container gracefully: nova_libvirt",
        "Error: cannot remove container 880a99e986801599a233793aeb07cf15bd0e60444451fc1381a29d12b68d261a as it is running - running or paused containers cannot be removed without force: container state improper",
        "",
        "Error executing ['podman', 'rm', '-f', 'nova_libvirt']: returned 125",
        "Error removing container: nova_libvirt",
        "Error: cannot remove container 880a99e986801599a233793aeb07cf15bd0e60444451fc1381a29d12b68d261a as it could not be stopped: given PIDs did not die within timeout"
    ]
}
2020-11-11 04:32:35,313 p=230357 u=mistral n=ansible | ok: [compute-5] => {
    "failed_when_result": false,
    "start_containers_outputs.stdout_lines | default([]) | union(start_containers_outputs.stderr_lines | default([]))": [
        "Error executing ['podman', 'stop', 'nova_virtlogd']: returned 125",
        "Error executing ['podman', 'rm', 'nova_virtlogd']: returned 2",
        "Error removing container gracefully: nova_virtlogd",
        "Error: cannot remove container f7c1be87a80b5ea434c3d4abe85cf1d527e33ad5cf83e0ffbbc72dd483305e6c as it is running - running or paused containers cannot be removed without force: container state improper",
        "",
        "Error executing ['podman', 'rm', '-f', 'nova_virtlogd']: returned 125",
        "Error removing container: nova_virtlogd",
        "Error: cannot remove container f7c1be87a80b5ea434c3d4abe85cf1d527e33ad5cf83e0ffbbc72dd483305e6c as it could not be stopped: given PIDs did not die within timeout"
    ]
}
...
~~~

Current problem is that paunch does not fail even it fails to stop and delete some containers.
What is worse, it doesn't run steps to create containers and systemd files if there are already containers running,
and causes some systemd files get deleted.


Version-Release number of selected component (if applicable):

The overcloud nodes have the following paunch packages installed.
 paunch-services-5.3.3-0.20200527083422.16ae5e4.el8ost.noarch
 python3-paunch-5.3.3-0.20200527083422.16ae5e4.el8ost.noarch

How reproducible:

The issue was observed once so far.


Steps to Reproduce:
1. 
2.
3.

Actual results:
deployment doesn't fail even paunch fails to stop/delete containers and some systemd files are deleted

Expected results:
deployment fails or paunch doesn't delete systemd files

Additional info:

Comment 1 Takashi Kajinami 2020-11-26 00:19:23 UTC

*** Bug 1901753 has been marked as a duplicate of this bug. ***

Comment 2 Takashi Kajinami 2020-11-26 00:32:16 UTC

I confirmed that the containers paunch failed to stop (nova_libvirt on compute-4 and nova_virlogd on compute-5)
has TRIPLEO_CONFIG_HASH different from the one in container-startup-config json.
This shows that paunch didn't run steps to create containers with new config hash...


compute-4

$ sudo podman inspect nova_libvirt | jq .[0].Config.Env
[
  ...
  "TRIPLEO_CONFIG_HASH=625b1a9cec416ffb7510a575bb82a938",
  ...
]

/var/lib/tripleo-config/container-startup-config/step_3/hashed-nova_libvirt.json
~~~
{
  "cpuset_cpus": "all",
  "depends_on": [
    "tripleo_nova_virtlogd.service"
  ],
  "environment": {
    "KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS",
    "TRIPLEO_CONFIG_HASH": "a7a350bde34fee59724c7cf52366877d"
  },
  ...
~~~


$ sudo podman inspect nova_virtlogd | jq .[0].Config.Env
[
  ...
  "TRIPLEO_CONFIG_HASH=a7a350bde34fee59724c7cf52366877d",
  ...
]

/var/lib/tripleo-config/container-startup-config/step_3/hashed-nova_virtlogd.json
~~~
{
  "environment": {
    "KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS",
    "TRIPLEO_CONFIG_HASH": "a7a350bde34fee59724c7cf52366877d"
  },
  ...
~~~


compute-5


$ sudo podman inspect nova_libvirt | jq .[0].Config.Env
[
  ...
  "TRIPLEO_CONFIG_HASH=c2021708da4df13f562f4ae13a4c9fc4",
  ...
]

/var/lib/tripleo-config/container-startup-config/step_3/hashed-nova_libvirt.json
~~~
{
  "cpuset_cpus": "all",
  "depends_on": [
    "tripleo_nova_virtlogd.service"
  ],
  "environment": {
    "KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS",
    "TRIPLEO_CONFIG_HASH": "c2021708da4df13f562f4ae13a4c9fc4"
  },
  ...
~~~


$ sudo podman inspect nova_virtlogd | jq .[0].Config.Env
[
  ...
  "TRIPLEO_CONFIG_HASH=b30b8d8e81bf50209db634e0bed48c55",
  ...
]

/var/lib/tripleo-config/container-startup-config/step_3/hashed-nova_virtlogd.json
~~~
{
  "environment": {
    "KOLLA_CONFIG_STRATEGY": "COPY_ALWAYS",
    "TRIPLEO_CONFIG_HASH": "c2021708da4df13f562f4ae13a4c9fc4"
  },
  ...
~~~

Comment 4 Takashi Kajinami 2020-12-09 09:06:21 UTC

So according to my observation so far, it seems that the "systemctl stop <service" command doesn't fail even when "podman stop" fails because of timeout, and it can cause failures in subsequent podman commands.
This causes only systemd service file deleted, while the container is still left without getting stopped...

Comment 5 Takashi Kajinami 2020-12-10 03:28:45 UTC

I have replaced ExecStop command by dummy script which returns 125.

~~~
[root@compute-1 ~]# cat /root/test.sh 
#!/bin/bash

exit 125
[root@compute-1 ~]# cat /etc/systemd/system/tripleo_logrotate_crond.service
[Unit]
Description=logrotate_crond container
After=paunch-container-shutdown.service
Wants=
[Service]
Restart=always
ExecStart=/usr/bin/podman start logrotate_crond
ExecReload=/usr/bin/podman kill --signal HUP logrotate_crond
ExecStop=/bin/bash /root/test.sh
KillMode=none
Type=forking
PIDFile=/var/run/logrotate_crond.pid

[Install]
WantedBy=multi-user.target
[root@compute-1 ~]#
~~~

However systemctl stop doesn't return error code.
~~~
[root@compute-1 ~]# sudo systemctl daemon-reload
[root@compute-1 ~]# sudo systemctl status tripleo_logrotate_crond.service
● tripleo_logrotate_crond.service - logrotate_crond container
   Loaded: loaded (/etc/systemd/system/tripleo_logrotate_crond.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-12-09 04:10:00 UTC; 23h ago
 Main PID: 2671 (conmon)
    Tasks: 0 (limit: 101106)
   Memory: 1.7M
   CGroup: /system.slice/tripleo_logrotate_crond.service
           ‣ 2671 /usr/bin/conmon --api-version 1 -s -c e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 -u e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 -r /usr/bin/runc -b /var/lib/containers/storage/>

Dec 09 04:09:58 compute-1 systemd[1]: Starting logrotate_crond container...
Dec 09 04:10:00 compute-1 podman[2462]: 2020-12-09 04:10:00.305730294 +0000 UTC m=+1.282243334 container init e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16>
Dec 09 04:10:00 compute-1 podman[2462]: 2020-12-09 04:10:00.334236854 +0000 UTC m=+1.310749885 container start e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp1>
Dec 09 04:10:00 compute-1 podman[2462]: logrotate_crond
Dec 09 04:10:00 compute-1 systemd[1]: Started logrotate_crond container.
[root@compute-1 ~]# sudo systemctl stop tripleo_logrotate_crond.service
[root@compute-1 ~]# echo $?
0
[root@compute-1 ~]# sudo systemctl status tripleo_logrotate_crond.service
● tripleo_logrotate_crond.service - logrotate_crond container
   Loaded: loaded (/etc/systemd/system/tripleo_logrotate_crond.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2020-12-10 03:20:54 UTC; 14s ago
  Process: 252506 ExecStop=/bin/bash /root/test.sh (code=exited, status=125)
 Main PID: 2671

Dec 09 04:09:58 compute-1 systemd[1]: Starting logrotate_crond container...
Dec 09 04:10:00 compute-1 podman[2462]: 2020-12-09 04:10:00.305730294 +0000 UTC m=+1.282243334 container init e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16>
Dec 09 04:10:00 compute-1 podman[2462]: 2020-12-09 04:10:00.334236854 +0000 UTC m=+1.310749885 container start e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp1>
Dec 09 04:10:00 compute-1 podman[2462]: logrotate_crond
Dec 09 04:10:00 compute-1 systemd[1]: Started logrotate_crond container.
Dec 10 03:20:54 compute-1 systemd[1]: Stopping logrotate_crond container...
Dec 10 03:20:54 compute-1 systemd[1]: tripleo_logrotate_crond.service: Control process exited, code=exited status=125
Dec 10 03:20:54 compute-1 systemd[1]: tripleo_logrotate_crond.service: Failed with result 'exit-code'.
Dec 10 03:20:54 compute-1 systemd[1]: Stopped logrotate_crond container.
[root@compute-1 ~]# sudo podman ps | grep logrotate
e683b161a92b  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:16.1_20201020.1                        kolla_start  2 weeks ago  Up 23 hours ago         logrotate_crond
~~~

Even if I add ExecStopPost to simulate the content of service files generated by paunch, systemctl stop returns 0.

~~~
[root@compute-1 ~]# cat tripleo_logrotate_crond.service 
[Unit]
Description=logrotate_crond container
After=paunch-container-shutdown.service
Wants=
[Service]
Restart=always
ExecStart=/usr/bin/podman start logrotate_crond
ExecReload=/usr/bin/podman kill --signal HUP logrotate_crond
ExecStop=/usr/bin/podman stop -t 10 logrotate_crond
ExecStopPost=/usr/bin/podman stop -t 10 logrotate_crond
KillMode=none
Type=forking
PIDFile=/var/run/logrotate_crond.pid

[Install]
[root@compute-1 ~]# sudo systemctl daemon-reload
[root@compute-1 ~]# sudo systemctl status tripleo_logrotate_crond.service
● tripleo_logrotate_crond.service - logrotate_crond container
   Loaded: loaded (/etc/systemd/system/tripleo_logrotate_crond.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-12-10 03:24:49 UTC; 37s ago
 Main PID: 253292 (conmon)
    Tasks: 0 (limit: 101106)
   Memory: 1.6M
   CGroup: /system.slice/tripleo_logrotate_crond.service
           ‣ 253292 /usr/bin/conmon --api-version 1 -s -c e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 -u e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 -r /usr/bin/runc -b /var/lib/containers/storag>

Dec 10 03:24:48 compute-1 systemd[1]: Starting logrotate_crond container...
Dec 10 03:24:49 compute-1 podman[253247]: 2020-12-10 03:24:49.203373271 +0000 UTC m=+0.227471808 container init e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp>
Dec 10 03:24:49 compute-1 podman[253247]: 2020-12-10 03:24:49.218287972 +0000 UTC m=+0.242386484 container start e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhos>
Dec 10 03:24:49 compute-1 podman[253247]: logrotate_crond
Dec 10 03:24:49 compute-1 systemd[1]: Started logrotate_crond container.
[root@compute-1 ~]# systemctl stop tripleo_logrotate_crond.service
[root@compute-1 ~]# echo $?
0
[root@compute-1 ~]# systemctl status tripleo_logrotate_crond.service
● tripleo_logrotate_crond.service - logrotate_crond container
   Loaded: loaded (/etc/systemd/system/tripleo_logrotate_crond.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2020-12-10 03:27:06 UTC; 24s ago
  Process: 253453 ExecStopPost=/bin/bash /root/test.sh (code=exited, status=125)
  Process: 253451 ExecStop=/bin/bash /root/test.sh (code=exited, status=125)
 Main PID: 253292

Dec 10 03:24:49 compute-1 podman[253247]: 2020-12-10 03:24:49.203373271 +0000 UTC m=+0.227471808 container init e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp>
Dec 10 03:24:49 compute-1 podman[253247]: 2020-12-10 03:24:49.218287972 +0000 UTC m=+0.242386484 container start e683b161a92beef0b1a66bebb01fdf99989893f904233a1d8701567547bab6f7 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhos>
Dec 10 03:24:49 compute-1 podman[253247]: logrotate_crond
Dec 10 03:24:49 compute-1 systemd[1]: Started logrotate_crond container.
Dec 10 03:25:36 compute-1 systemd[1]: Stopping logrotate_crond container...
Dec 10 03:25:36 compute-1 systemd[1]: tripleo_logrotate_crond.service: Control process exited, code=exited status=125
Dec 10 03:25:36 compute-1 systemd[1]: tripleo_logrotate_crond.service: Control process exited, code=exited status=125
Dec 10 03:27:06 compute-1 systemd[1]: tripleo_logrotate_crond.service: State 'stop-post' timed out. Terminating.
Dec 10 03:27:06 compute-1 systemd[1]: tripleo_logrotate_crond.service: Failed with result 'exit-code'.
Dec 10 03:27:06 compute-1 systemd[1]: Stopped logrotate_crond container.
[root@compute-1 ~]# sudo podman ps | grep logrotate
e683b161a92b  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:16.1_20201020.1                        kolla_start  2 weeks ago  Up 2 minutes ago         logrotate_crond
~~~

Comment 6 Takashi Kajinami 2020-12-10 08:25:55 UTC

So this is the scenarion which results in the problem.

 - For some reason a container doesn't stop within the specified timeout (10 seconds)
 - podman stop fails with 125 return code, but it doesn't forcefully stop the container
 - systemd doesn't detect the failure in stop the container, and it doesn't stop the remaining container
 - the subsequent steps performs podman stop/rm/rm -f but it doesn't check whether these steps are done successfully


I confirmed that changing kill mode from none to control-group makes systemd forcefully stop the remaining container
(maybe by SITERM or SIGKILL ?), but I'm not sure whether this is the right thing to do.

Comment 7 Cédric Jeanneret 2020-12-10 10:15:29 UTC

Hello Takashi,

it would be interesting to understand why those two containers aren't stopping as they should. My best guess here is: there are a lot of VMs running on the node, and nova_libvirt is just unable to stop in the delay - this is also the same reason nova_virtlogd isn't stopped and ends in the same weird black hole.

I think we should first check with Compute for this case, then we should be able to find a better way within paunch. Maybe the way VMs are managed within nova_libvirt is the issue? For instance, Neutron isn't doing the same thing, they start dedicated containers when a new network action is done (i.e. new dhcp/subnet). Might be a path to explore so that all the VMs don't depend on one unique container?

Fact is, if we kill the container it might lead to data corruption within the VMs... Not sure this is something we actually want to see, do we?

Cheers,

C.

Comment 8 Takashi Kajinami 2020-12-10 10:25:41 UTC

Hi Cédric,


> it would be interesting to understand why those two containers aren't stopping as they should. My best guess here is: there are a lot of VMs running on the node, and nova_libvirt is just unable to stop in the delay - this is also the same reason nova_virtlogd isn't stopped and ends in the same weird black hole.

I checked the log of libvirt and messages around the time when podman couldn't stop these containers, but I didn't see any errors or interesting logs so far.
All I could have confirmed so far is podman failed to stop these container retuning 125(I guess it's timeout) and these containers are left running, without having processes inside receive any stop signals.
A customer stopped these containers manually later, but they didn't see any issues at that time, so that "hung" seems to be something temproral.

Also, this problem happened not only with these libvirt containers, but also with the neutron-ovs-agent container, so I'm afraid this is not very specific to Compute.

Comment 9 Cédric Jeanneret 2020-12-10 12:19:32 UTC

Hello Takashi,

humpf. "great"..... seems to point to yet another podman thing. IIRC there were others issues with the timeout, and we already had to raise it a bit. Maybe that's "just" it - but I really doubt this is a way we want to follow...

I'll call for help within DF then (thank you needinfo()) - on my side I don't really see what we can really do without any risk :/. But seeing the whole thing, there might be an issue with the way systemd catches that 125 code, preventing paunch to actually get it and retry... Probably a thing to dig in.

@Alex, @Kevin, any thoughts or ideas?

Cheers,

C.

Comment 10 Alex Schultz 2020-12-10 16:29:16 UTC

We'll have to look further into the error handling.  We likely don't want to just fail the entire paunch execution because we want to manage whatever we can in a given step. That being said, this should have been a deploy failure so this will likely need more code/testing. I thought we had a different patch to bump timeouts or make them configurable to address this kind of problem however.

Comment 11 Alex Schultz 2020-12-11 14:50:55 UTC

So we specifically set KillMode to none due to race conditions. See https://review.opendev.org/c/openstack/paunch/+/645550  I'm not certain if switching to control-group has the same problem as process.

Comment 12 Alex Schultz 2020-12-11 15:49:07 UTC

So I don't think we address the podman failure itself for timeout.  The timeout value is configurable at a service level but I don't think it's tuneable by the end user at this time. I checked and KillGroup=control-group has the race condition as process so we cannot switch to it. I think the solution here would be to catch this failure better and not remove the systemd unit if the service is still running. Additionally we should ensure that we do fail when this condition occurs.

Comment 14 Alex Schultz 2020-12-11 18:09:21 UTC

They have the wrong version of podman installed so the containers are not able to be managed/renamed/etc.  They should have 1.6.4 but they have 1.9.3


podman-1.9.3-2.module+el8.2.1+6867+366c07d6.x86_64          Sun Aug  2 15:15:23 2020
podman-docker-1.9.3-2.module+el8.2.1+6867+366c07d6.noarch   Sun Aug  2 15:15:26 2020

Comment 15 Alex Schultz 2020-12-11 18:13:36 UTC

The rename is occurring because the container cannot be removed. But the container has an improper state so it never works.

2020-12-01 18:11:31.413 410894 ERROR paunch [  ] Error: cannot remove container f5a01ec7b2ee86c53f2d89d76a926b06ee9b93643e63ee8a778e2357c95c60d9 as it is running - running or paused containers cannot be removed without force: container state improper

Comment 16 Takashi Kajinami 2020-12-14 00:55:56 UTC

Thanks Alex, for pointing the wrong podman version. Sorry I missed that in my analysis...
I'll ask the customer to downgrade the podman package to have the correct version.

> The rename is occurring because the container cannot be removed. But the container has an improper state so it never works.

IIUC paunch is not renaming the container but removing it to replace it by the newly created one.
What I've observed so far is that container is still running and doesn't accept stop operation via systemd or directly,
and still running when paunch tries to remove it by "podman rm" or "podman rm -f"

Comment 17 Alex Schultz 2020-12-14 14:30:28 UTC

Yea I looked at the paunch code and we actually remove the systemd unit prior to stopping it because they'll conflicit when trying to operate on the container.  The bit of code from the log is actually from the rename functions and not part of the actual service management. I've got some additional patches to improve error handling when these conditions occur but it's likely that manual intervention will be needed if this condition occurs. If systemctl doesn't return an error, there's not much we can do to detect this state.

Comment 25 David Rosenfeld 2021-04-22 16:40:47 UTC

set qe_test_coverage to - because the root cause of the problem was using an incorrect podman version. That should not be automated. A commit was made to handle a corner case error condition. A way to force creation of that error manually or through automation hasn't been found.

Comment 26 Jad Haj Yahya 2021-05-03 13:12:20 UTC

Paused nova_libvirt container, redeployed OC and verified tripleo_nova_libvirt.service exists
Removed nova_libvirt container, redeployed OC and verified tripleo_nova_libvirt.service exists

Comment 32 errata-xmlrpc 2021-05-26 13:49:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.6 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2097