+++ This bug was initially created as a clone of Bug #1935621 +++ Description of problem: Observed on an FFU job, but the problem can happen whenever the "podman cp" command is used. When internal or public certificates are re-generated, we notify the impacted services by triggering a "config reload" when possible, to avoid restarting them entirely and incurring a temporary service disruption. Since the certificates are re-generated on the host, they must be injected in running containers before the config reload takes place. We inject the file into the running container with "podman cp". The "podman cp" command operates by temporarily mounting the container's overlay-fs, freezing its execution, copying the file, then thawing execution and unmounting: Feb 23 18:19:06 controller-0 platform-python[325270]: ansible-command Invoked with _raw_params=set -e#012podman cp /etc/pki/tls/private/overcloud_endpoint.pem 75ee35ce1b09:/etc/pki/tls/private/overcloud_endpoint.pem#012podman exec - -user root 75ee35ce1b09 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem#012podman kill --signal=HUP 75ee35ce1b09#012 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=N one creates=None removes=None stdin=None Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.132908296 +0000 UTC m=+0.086081887 container mount 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh-o sbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0) Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.156749176 +0000 UTC m=+0.109922731 container pause 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh-o sbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0) Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.583847725 +0000 UTC m=+0.537021270 container unpause 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh -osbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0) Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.585128533 +0000 UTC m=+0.538302139 container unmount 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh -osbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0) During the small period of time where the freeze is is happening, no "podman exec" or "podman stop" command can run: any attempt in doing so returns an error. Feb 24 02:05:58 controller-0 podman(haproxy-bundle-podman-0)[352861]: ERROR: Error: can only stop created or running containers. 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 is in state paused: container state improper Feb 24 02:05:58 controller-0 podman(haproxy-bundle-podman-0)[352861]: ERROR: Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest. Feb 24 02:05:58 controller-0 pacemaker-execd[66852]: notice: haproxy-bundle-podman-0_stop_0:352861:stderr [ ocf-exit-reason:Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest. ] Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Result of stop operation for haproxy-bundle-podman-0 on controller-0: 1 (error) Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: controller-0-haproxy-bundle-podman-0_stop_0:206 [ ocf-exit-reason:Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest.\n ] Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 aborted by operation haproxy-bundle-podman-0_stop_0 'modify' on controller-0: Event failed Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 action 12 (haproxy-bundle-podman-0_stop_0 on controller-0): expected 'ok' but got 'error' Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-199.bz2): Complete Feb 24 02:05:58 controller-0 pacemaker-attrd[66853]: notice: Setting fail-count-haproxy-bundle-podman-0#stop_0[controller-0]: (unset) -> INFINITY Feb 24 02:05:58 controller-0 pacemaker-attrd[66853]: notice: Setting last-failure-haproxy-bundle-podman-0#stop_0[controller-0]: (unset) -> 1614132358 Note that the error returned is generic, so there's no easy API to check whether the error was due to a paused container or not. Unfortunately, using "podman cp" opens a time window where some critical container operations can now fail, and destabilize the stack. For instance, it may happen that pacemaker cannot monitor the state of a container and decide to stop it. Worse, it the "podman stop" cannot be run, pacemaker will consider it a stop failure and will end up fencingn the node hosting the running container. The certificate injection with "podman cp" is at least found in tripleo-heat-templates [1] and in puppet-tripleo [2]. [1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-inject.yaml [2] e.g. https://opendev.org/openstack/puppet-tripleo/src/branch/master/files/certmonger-haproxy-refresh.sh, but other service files have it as well Version-Release number of selected component (if applicable): How reproducible: Random (low chances) Steps to Reproduce: 1. Deploy an TLS-e overcloud 2. stop haproxy pcs resource disable haproxy-bundle 3. spam the certificate-inject script for haproxy /usr/bin/certmonger-haproxy-refresh.sh reload internal_api Actual results: Sometimes the stop will fail because the container will be in pause state at the time the stop action occurred, and fencing will be triggered. Expected results: No stop operation should fail. Additional info: This most directly impacts pacemaker, but we could make the argument that temporarily breaking "podman exec" or "podman stop" will likely impact other services.
Damien, can certmonger-rabbitmq-refresh.sh be fixed as it was for OSP16.2 in https://bugzilla.redhat.com/show_bug.cgi?id=1998917, please?
(In reply to Julia Marciano from comment #14) > Damien, > can certmonger-rabbitmq-refresh.sh be fixed as it was for OSP16.2 in > https://bugzilla.redhat.com/show_bug.cgi?id=1998917, please? I just created the cloned bz https://bugzilla.redhat.com/show_bug.cgi?id=1999702 for tracking the fix or certmonger-rabbitmq-refresh.sh in 16.1.
(In reply to Julia Marciano from comment #16) > The new certificate isn't being copied to the container, it seems 'tar' > command doesn't succeed: > // Run here a copy of the original /usr/bin/certmonger-haproxy-refresh.sh > with 'set -x' > [root@controller-0 ~]# date;/usr/bin/certmonger-haproxy-refresh.copy.sh > reload internal_api > Fri Sep 3 00:19:47 UTC 2021 > ... > + tar -c /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem > + podman exec -i haproxy-bundle-podman-0 tar -C / -xv > tar: Removing leading `/' from member names > tar: This does not look like a tar archive > tar: Exiting with failure status due to previous errors > ERRO[0000] read unixpacket > @->/var/run/libpod/socket/ > 1ac6566cb88b71e0cfb2f14f720126ee96d72753d52e826cf0c9bf12f0185a4a/attach: > read: connection reset by peer > Error: non zero exit code: 2: OCI runtime error > ... This is a separate issue. We forgot to patch some scripts in puppet-tripleo to workaround the podman cp breakage on rhel 8.2.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3762