Bug 1850303 - podman containers are not properly cleaned and restarted when their conmon process is killed
Summary: podman containers are not properly cleaned and restarted when their conmon pr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-paunch
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Takashi Kajinami
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-24 01:05 UTC by Takashi Kajinami
Modified: 2023-12-15 18:17 UTC (History)
4 users (show)

Fixed In Version: python-paunch-5.3.3-1.20200810143359.6f44509.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-28 15:38:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1884866 0 None None None 2020-06-24 02:25:43 UTC
OpenStack gerrit 738234 0 None MERGED Make sure failed containers get stopped by systemd 2021-02-18 20:39:52 UTC
Red Hat Issue Tracker OSP-30831 0 None None None 2023-12-15 18:17:59 UTC
Red Hat Product Errata RHEA-2020:4284 0 None None None 2020-10-28 15:38:31 UTC

Description Takashi Kajinami 2020-06-24 01:05:16 UTC
Description of problem:

Even if a common process, which represent a podman container is killed because of some reason,
processes inside the container stays running.
Then systemd detects failure of common process and tries to start the failed podman container,
but it fails because of stale process running on the host.

~~~
[heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor
           |-conmon(121860)-+-dumb-init(121893)---nova-conductor(121960)-+-nova-conductor(122511)
           |                |                                            |-nova-conductor(122512)
           |                |                                            |-nova-conductor(122513)
           |                |                                            `-nova-conductor(122518)
[heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor
● tripleo_nova_conductor.service - nova_conductor container
   Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-06-22 15:00:04 UTC; 1 day 10h ago
 Main PID: 121860 (conmon)
    Tasks: 0 (limit: 26213)
   Memory: 2.1M
   CGroup: /system.slice/tripleo_nova_conductor.service
           ‣ 121860 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storag>

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor
           |-conmon(121860)-+-dumb-init(121893)---nova-conductor(121960)-+-nova-conductor(122511)
           |                |                                            |-nova-conductor(122512)
           |                |                                            |-nova-conductor(122513)
           |                |                                            `-nova-conductor(122518)
[heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor
0edc910c83a4  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1      kolla_start           34 hours ago  Up 34 hours ago         nova_conductor
[heat-admin@controller-0 ~]$ sudo kill -KILL 121860
[heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor
           |-dumb-init(121893)---nova-conductor(121960)-+-nova-conductor(122511)
           |                                            |-nova-conductor(122512)
           |                                            |-nova-conductor(122513)
           |                                            `-nova-conductor(122518)
[heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor
● tripleo_nova_conductor.service - nova_conductor container
   Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled)
   Active: failed (Result: protocol) since Wed 2020-06-24 01:01:26 UTC; 8s ago
  Process: 4030 ExecStart=/usr/bin/podman start nova_conductor (code=exited, status=0/SUCCESS)
 Main PID: 121860 (code=killed, signal=KILL)

Jun 24 01:01:26 controller-0 systemd[1]: tripleo_nova_conductor.service: Service RestartSec=100ms expired, scheduling restart.
Jun 24 01:01:26 controller-0 systemd[1]: tripleo_nova_conductor.service: Scheduled restart job, restart counter is at 6.
Jun 24 01:01:26 controller-0 systemd[1]: Stopped nova_conductor container.
Jun 24 01:01:26 controller-0 systemd[1]: tripleo_nova_conductor.service: Start request repeated too quickly.
Jun 24 01:01:26 controller-0 systemd[1]: tripleo_nova_conductor.service: Failed with result 'protocol'.
Jun 24 01:01:26 controller-0 systemd[1]: Failed to start nova_conductor container.
[heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor
0edc910c83a4  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1      kolla_start           34 hours ago  Up 34 hours ago         nova_conductor
~~~

To recover from the situation we need to stop the failed container manually then start it from systemd.

~~~
[heat-admin@controller-0 ~]$ sudo podman stop nova_conductor
Error: timed out waiting for file /var/run/libpod/exits/0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38: internal libpod error
[heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor
[heat-admin@controller-0 ~]$ sudo systemctl start tripleo_nova_conductor
[heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor
● tripleo_nova_conductor.service - nova_conductor container
   Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-06-24 01:04:27 UTC; 4s ago
  Process: 21146 ExecStart=/usr/bin/podman start nova_conductor (code=exited, status=0/SUCCESS)
 Main PID: 21169 (conmon)
    Tasks: 0 (limit: 26213)
   Memory: 2.1M
   CGroup: /system.slice/tripleo_nova_conductor.service
           ‣ 21169 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storage>

Jun 24 01:04:26 controller-0 systemd[1]: Starting nova_conductor container...
Jun 24 01:04:27 controller-0 podman[21146]: 2020-06-24 01:04:27.220821736 +0000 UTC m=+0.429382454 container init 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho>
Jun 24 01:04:27 controller-0 podman[21146]: 2020-06-24 01:04:27.24996283 +0000 UTC m=+0.458523513 container start 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho>
Jun 24 01:04:27 controller-0 podman[21146]: nova_conductor
Jun 24 01:04:27 controller-0 systemd[1]: Started nova_conductor container.
~~~


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Send SIGKILL to common process of one podman container
2. Check status of tripleo_<service name>

Actual results:
The service becomes failed and the podman container is not restarted

Expected results:
The service becomes active status, and the podman container is restarted

Additional info:

Comment 1 Takashi Kajinami 2020-06-24 01:06:26 UTC
After fixing the problem by stopping podman container and restarting it via systemd,
the process is restarted under common process expectedly.

~~~
[heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor
0edc910c83a4  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1      kolla_start           34 hours ago  Up About a minute ago         nova_conductor
[heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor
           |-conmon(21169)-+-dumb-init(21181)---nova-conductor(21206)-+-nova-conductor(21491)
           |               |                                          |-nova-conductor(21492)
           |               |                                          |-nova-conductor(21493)
           |               |                                          `-nova-conductor(21494)
~~~

Comment 2 Takashi Kajinami 2020-06-24 02:09:38 UTC
There is a fix merged into podman recently, which makes ExecStopPost also configured in systemd unit files,
so that container processes are actually stopped even common process fails.
I think we need to implement the same in tripleo ansible, so that generated systemd file has ExecStopPost.
 https://github.com/containers/libpod/commit/e5c3432944245a740ed443803c654dcc9c3757f0

Comment 3 Takashi Kajinami 2020-06-24 02:16:37 UTC
I tested systemd unit file with ExecStopPost added

~~~
[heat-admin@controller-0 ~]$ sudo cat /etc/systemd/system/tripleo_nova_conductor.service
[Unit]
Description=nova_conductor container
After=paunch-container-shutdown.service
Wants=
[Service]
Restart=always
ExecStart=/usr/bin/podman start nova_conductor
ExecReload=/usr/bin/podman kill --signal HUP nova_conductor
ExecStop=/usr/bin/podman stop -t 10 nova_conductor
ExecStopPost=/usr/bin/podman stop -t 10 nova_conductor
KillMode=none
Type=forking
PIDFile=/var/run/nova_conductor.pid

[Install]
WantedBy=multi-user.target
[heat-admin@controller-0 ~]$ sudo systemctl daemon-reload
~~~

and confirmed that it didn't affect normal stop/start operation
~~~
[heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor
● tripleo_nova_conductor.service - nova_conductor container
   Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-06-24 01:04:27 UTC; 1h 6min ago
 Main PID: 21169 (conmon)
    Tasks: 0 (limit: 26213)
   Memory: 2.4M
   CGroup: /system.slice/tripleo_nova_conductor.service
           ‣ 21169 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storage>

Jun 24 01:04:26 controller-0 systemd[1]: Starting nova_conductor container...
Jun 24 01:04:27 controller-0 podman[21146]: 2020-06-24 01:04:27.220821736 +0000 UTC m=+0.429382454 container init 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho>
Jun 24 01:04:27 controller-0 podman[21146]: 2020-06-24 01:04:27.24996283 +0000 UTC m=+0.458523513 container start 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho>
Jun 24 01:04:27 controller-0 podman[21146]: nova_conductor
Jun 24 01:04:27 controller-0 systemd[1]: Started nova_conductor container.
Jun 24 02:10:54 controller-0 systemd[1]: Reloading nova_conductor container.
Jun 24 02:10:54 controller-0 podman[391867]: 2020-06-24 02:10:54.445391064 +0000 UTC m=+0.083626342 container kill 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh>
Jun 24 02:10:54 controller-0 podman[391867]: 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38
Jun 24 02:10:54 controller-0 systemd[1]: Reloaded nova_conductor container.
[heat-admin@controller-0 ~]$ sudo systemctl stop tripleo_nova_conductor
[heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor
● tripleo_nova_conductor.service - nova_conductor container
   Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2020-06-24 02:11:22 UTC; 2s ago
  Process: 394597 ExecStopPost=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS)
  Process: 393786 ExecStop=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS)
 Main PID: 21169 (code=exited, status=0/SUCCESS)

Jun 24 02:10:54 controller-0 systemd[1]: Reloading nova_conductor container.
Jun 24 02:10:54 controller-0 podman[391867]: 2020-06-24 02:10:54.445391064 +0000 UTC m=+0.083626342 container kill 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh>
Jun 24 02:10:54 controller-0 podman[391867]: 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38
Jun 24 02:10:54 controller-0 systemd[1]: Reloaded nova_conductor container.
Jun 24 02:11:18 controller-0 systemd[1]: Stopping nova_conductor container...
Jun 24 02:11:22 controller-0 podman[393786]: 2020-06-24 02:11:22.059051776 +0000 UTC m=+3.421731728 container died 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh>
Jun 24 02:11:22 controller-0 podman[393786]: 2020-06-24 02:11:22.060230275 +0000 UTC m=+3.422910246 container stop 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh>
Jun 24 02:11:22 controller-0 podman[393786]: 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38
Jun 24 02:11:22 controller-0 podman[394597]: 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38
Jun 24 02:11:22 controller-0 systemd[1]: Stopped nova_conductor container.
[heat-admin@controller-0 ~]$ sudo systemctl start tripleo_nova_conductor
[heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor
● tripleo_nova_conductor.service - nova_conductor container
   Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-06-24 02:11:43 UTC; 47s ago
  Process: 394597 ExecStopPost=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS)
  Process: 393786 ExecStop=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS)
  Process: 396438 ExecStart=/usr/bin/podman start nova_conductor (code=exited, status=0/SUCCESS)
 Main PID: 396524 (conmon)
    Tasks: 0 (limit: 26213)
   Memory: 1.8M
   CGroup: /system.slice/tripleo_nova_conductor.service
           ‣ 396524 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storag>

Jun 24 02:11:43 controller-0 systemd[1]: Starting nova_conductor container...
Jun 24 02:11:43 controller-0 podman[396438]: 2020-06-24 02:11:43.625492561 +0000 UTC m=+0.475655869 container init 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh>
Jun 24 02:11:43 controller-0 podman[396438]: 2020-06-24 02:11:43.641841944 +0000 UTC m=+0.492005326 container start 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/r>
Jun 24 02:11:43 controller-0 podman[396438]: nova_conductor
Jun 24 02:11:43 controller-0 systemd[1]: Started nova_conductor container.
~~~

and now systemd can restart the container whose common process was killed.
~~~
[heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor
0edc910c83a4  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1      kolla_start           35 hours ago  Up About a minute ago         nova_conductor
[heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor
           |-conmon(396524)-+-dumb-init(396541)---nova-conductor(396589)-+-nova-conductor(397076)
           |                |                                            |-nova-conductor(397077)
           |                |                                            |-nova-conductor(397078)
           |                |                                            `-nova-conductor(397079)
[heat-admin@controller-0 ~]$ sudo kill -KILL 396524
[heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor
           |-dumb-init(396541)---nova-conductor(396589)---nova-conductor(397076)
[heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor
[heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor
           |-conmon(406733)-+-dumb-init(406746)---nova-conductor(406765)
[heat-admin@controller-0 ~]$ sudo systemctl status tripleo_nova_conductor
● tripleo_nova_conductor.service - nova_conductor container
   Loaded: loaded (/etc/systemd/system/tripleo_nova_conductor.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-06-24 02:13:29 UTC; 11s ago
  Process: 405610 ExecStopPost=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=125)
  Process: 393786 ExecStop=/usr/bin/podman stop -t 10 nova_conductor (code=exited, status=0/SUCCESS)
  Process: 406710 ExecStart=/usr/bin/podman start nova_conductor (code=exited, status=0/SUCCESS)
 Main PID: 406733 (conmon)
    Tasks: 0 (limit: 26213)
   Memory: 1.9M
   CGroup: /system.slice/tripleo_nova_conductor.service
           ‣ 406733 /usr/bin/conmon --api-version 1 -s -c 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -u 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 -r /usr/bin/runc -b /var/lib/containers/storag>

Jun 24 02:13:28 controller-0 systemd[1]: Starting nova_conductor container...
Jun 24 02:13:29 controller-0 podman[406710]: 2020-06-24 02:13:29.338664709 +0000 UTC m=+0.428971832 container init 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rh>
Jun 24 02:13:29 controller-0 podman[406710]: 2020-06-24 02:13:29.355447918 +0000 UTC m=+0.445755041 container start 0edc910c83a41e393db9378e01b51df3223371077e169ce8d8c590840cebcf38 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/r>
Jun 24 02:13:29 controller-0 podman[406710]: nova_conductor
Jun 24 02:13:29 controller-0 systemd[1]: Started nova_conductor container.
[heat-admin@controller-0 ~]$ sudo podman ps | grep nova_conductor
0edc910c83a4  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-conductor:20200416.1      kolla_start           35 hours ago  Up 2 minutes ago         nova_conductor
[heat-admin@controller-0 ~]$ sudo pstree -p | grep nova-conductor
           |-conmon(406733)-+-dumb-init(406746)---nova-conductor(406765)-+-nova-conductor(407012)
           |                |                                            |-nova-conductor(407013)
           |                |                                            |-nova-conductor(407014)
           |                |                                            `-nova-conductor(407015)
[heat-admin@controller-0 ~]$ 
~~~

Comment 4 Cédric Jeanneret 2020-06-24 06:04:41 UTC
Moving to "paunch" component - not sure if tripleo-ansible will need any patch, I think we moved to podman managed systemd units with newer version. Paunch is used at least in 16.0 and 16.1.

Comment 5 Takashi Kajinami 2020-06-24 09:39:43 UTC
Thank you for the information, Cédric.
I'll submit a patch to paunch as well.

Comment 13 errata-xmlrpc 2020-10-28 15:38:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284


Note You need to log in before you can comment on or make changes to this bug.