Description of problem: Observed on an FFU job, but the problem can happen whenever the "podman cp" command is used. When internal or public certificates are re-generated, we notify the impacted services by triggering a "config reload" when possible, to avoid restarting them entirely and incurring a temporary service disruption. Since the certificates are re-generated on the host, they must be injected in running containers before the config reload takes place. We inject the file into the running container with "podman cp". The "podman cp" command operates by temporarily mounting the container's overlay-fs, freezing its execution, copying the file, then thawing execution and unmounting: Feb 23 18:19:06 controller-0 platform-python[325270]: ansible-command Invoked with _raw_params=set -e#012podman cp /etc/pki/tls/private/overcloud_endpoint.pem 75ee35ce1b09:/etc/pki/tls/private/overcloud_endpoint.pem#012podman exec - -user root 75ee35ce1b09 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem#012podman kill --signal=HUP 75ee35ce1b09#012 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=N one creates=None removes=None stdin=None Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.132908296 +0000 UTC m=+0.086081887 container mount 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh-o sbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0) Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.156749176 +0000 UTC m=+0.109922731 container pause 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh-o sbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0) Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.583847725 +0000 UTC m=+0.537021270 container unpause 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh -osbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0) Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.585128533 +0000 UTC m=+0.538302139 container unmount 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh -osbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0) During the small period of time where the freeze is is happening, no "podman exec" or "podman stop" command can run: any attempt in doing so returns an error. Feb 24 02:05:58 controller-0 podman(haproxy-bundle-podman-0)[352861]: ERROR: Error: can only stop created or running containers. 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 is in state paused: container state improper Feb 24 02:05:58 controller-0 podman(haproxy-bundle-podman-0)[352861]: ERROR: Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest. Feb 24 02:05:58 controller-0 pacemaker-execd[66852]: notice: haproxy-bundle-podman-0_stop_0:352861:stderr [ ocf-exit-reason:Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest. ] Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Result of stop operation for haproxy-bundle-podman-0 on controller-0: 1 (error) Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: controller-0-haproxy-bundle-podman-0_stop_0:206 [ ocf-exit-reason:Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest.\n ] Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 aborted by operation haproxy-bundle-podman-0_stop_0 'modify' on controller-0: Event failed Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 action 12 (haproxy-bundle-podman-0_stop_0 on controller-0): expected 'ok' but got 'error' Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-199.bz2): Complete Feb 24 02:05:58 controller-0 pacemaker-attrd[66853]: notice: Setting fail-count-haproxy-bundle-podman-0#stop_0[controller-0]: (unset) -> INFINITY Feb 24 02:05:58 controller-0 pacemaker-attrd[66853]: notice: Setting last-failure-haproxy-bundle-podman-0#stop_0[controller-0]: (unset) -> 1614132358 Note that the error returned is generic, so there's no easy API to check whether the error was due to a paused container or not. Unfortunately, using "podman cp" opens a time window where some critical container operations can now fail, and destabilize the stack. For instance, it may happen that pacemaker cannot monitor the state of a container and decide to stop it. Worse, it the "podman stop" cannot be run, pacemaker will consider it a stop failure and will end up fencingn the node hosting the running container. The certificate injection with "podman cp" is at least found in tripleo-heat-templates [1] and in puppet-tripleo [2]. [1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-inject.yaml [2] e.g. https://opendev.org/openstack/puppet-tripleo/src/branch/master/files/certmonger-haproxy-refresh.sh, but other service files have it as well Version-Release number of selected component (if applicable): How reproducible: Random (low chances) Steps to Reproduce: 1. Deploy an TLS-e overcloud 2. stop haproxy pcs resource disable haproxy-bundle 3. spam the certificate-inject script for haproxy /usr/bin/certmonger-haproxy-refresh.sh reload internal_api Actual results: Sometimes the stop will fail because the container will be in pause state at the time the stop action occurred, and fencing will be triggered. Expected results: No stop operation should fail. Additional info: This most directly impacts pacemaker, but we could make the argument that temporarily breaking "podman exec" or "podman stop" will likely impact other services.
Instruction for testing: On a controller node, with haproxy container running. 1. check one current certificate for haproxy (e.g. internal API) # md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem bf41675624ef5b748335bf4668d6f7c3 /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem 2. generate a new certificate with certmonger, and wait a couple of seconds for the new cert to get injected in haproxy # getcert resubmit -i haproxy-internal_api-cert Resubmitting "haproxy-internal_api-cert" to "IPA". 3. verify that a new certificate has been retrieved by certmonger on the host # md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem fa01cdc093402a172c16a2baa363ba99 /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem 3. verify that the cert got injected in the container as expected # podman exec $(podman ps -q --filter name=haproxy) md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem fa01cdc093402a172c16a2baa363ba99 /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem 4. verify that haproxy reloaded its config after the certificate injection # journalctl -t haproxy Aug 11 10:30:39 controller-0.redhat.local haproxy[536933]: Proxy cinder started. Aug 11 10:30:39 controller-0.redhat.local haproxy[536933]: Proxy glance_api started. Aug 11 10:30:39 controller-0.redhat.local haproxy[536933]: Proxy haproxy.stats started. [...] 5. double check that the cert injection uses "tar" instead of "podman cp" to fix the bz # /usr/bin/certmonger-haproxy-refresh.sh reload internal_api [...] + cat /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.crt /etc/ipa/ca.crt /etc/pki/tls/private/haproxy/overcloud-haproxy-internal_api.key ++ podman ps '--format={{.Names}}' ++ grep -w -E 'haproxy(-bundle-.*-[0-9]+)?' + haproxy_container_name=haproxy-bundle-podman-0 + '[' reload == reload ']' + echo haproxy-bundle-podman-0 + grep -q '^haproxy-bundle' + tar -c /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem + podman exec -i haproxy-bundle-podman-0 tar -C / -xv
Verified: Verified in version: [heat-admin@controller-0 ~]$ rpm -qa|grep puppet-tripleo puppet-tripleo-11.6.2-2.20210603175724.f642d04.el8ost.1.noarch Steps were performed: [root@controller-0 ~]# date Sun Aug 22 20:06:30 UTC 2021 // Check md5sum of the current haproxy internal api certificate: [root@controller-0 ~]# md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem 3e9187b760be94888e44e9e0f9f6eb4c /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem // Generate a new certificate by certmonger: [root@controller-0 ~]# getcert resubmit -i haproxy-internal_api-cert Resubmitting "haproxy-internal_api-cert" to "IPA". // Wait a few seconds and verify that the previous action caused to haproxy configuration reload: [root@controller-0 ~]# journalctl -t haproxy ... Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy cinder started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy glance_api started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy haproxy.stats started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy heat_api started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy heat_cfn started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy horizon started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy keystone_admin started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy keystone_public started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy mysql started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy neutron started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy nova_metadata started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy nova_novncproxy started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy nova_osapi started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy placement started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy redis started. Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy swift_proxy_server started // verify that a new certificate has been retrieved by certmonger on the host: [root@controller-0 ~]# md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem a7ef9913940ffe3bc69dd4d4bb63ef2a /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem // Verify that the new certificate was injected to the haproxy container as expected: [root@controller-0 ~]# podman exec $(podman ps -q --filter name=haproxy) md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem a7ef9913940ffe3bc69dd4d4bb63ef2a /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem .pem // Ensure that certificate injection has been done by using "tar" and not "podman cp" command: [root@controller-0 ~]# date Sun Aug 22 20:18:02 UTC 2021 [root@controller-0 ~]# /usr/bin/certmonger-haproxy-refresh.sh reload internal_api ... ++ grep -w -E 'haproxy(-bundle-.*-[0-9]+)?' + haproxy_container_name=haproxy-bundle-podman-0 + '[' reload == reload ']' + echo haproxy-bundle-podman-0 + grep -q '^haproxy-bundle' + tar -c /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem + podman exec -i haproxy-bundle-podman-0 tar -C / -xv tar: Removing leading `/' from member names etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem ... // See (not nesaccery) that haproxy reloaded its configuration : [root@controller-0 ~]# journalctl -t haproxy ... Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy cinder started. Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy glance_api started. Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy haproxy.stats started. Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy heat_api started. Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy heat_cfn started. ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483
*** Bug 1847320 has been marked as a duplicate of this bug. ***