Bug 1969461 - Injecting certificate with "podman cp" can break cluster monitoring and operation
Summary: Injecting certificate with "podman cp" can break cluster monitoring and opera...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z7
: 16.1 (Train on RHEL 8.2)
Assignee: Damien Ciabrini
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On: 1935621
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-08 13:32 UTC by Luigi Toscano
Modified: 2022-06-07 12:09 UTC (History)
16 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20210720153312.el8ost puppet-tripleo-11.5.0-1.20210622133309.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1935621
Environment:
Last Closed: 2021-12-09 20:19:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1917868 0 None None None 2021-06-08 13:32:39 UTC
OpenStack gerrit 783913 0 None MERGED HA: inject public certificates without blocking container 2021-06-08 13:32:39 UTC
OpenStack gerrit 783949 0 None MERGED HA: inject public certificates without blocking container 2021-06-08 13:32:39 UTC
Red Hat Issue Tracker OSP-4470 0 None None None 2021-11-18 11:33:12 UTC
Red Hat Product Errata RHBA-2021:3762 0 None None None 2021-12-09 20:19:59 UTC

Description Luigi Toscano 2021-06-08 13:32:40 UTC
+++ This bug was initially created as a clone of Bug #1935621 +++

Description of problem:

Observed on an FFU job, but the problem can happen whenever the "podman cp" command is used.

When internal or public certificates are re-generated, we notify the
impacted services by triggering a "config reload" when possible, to
avoid restarting them entirely and incurring a temporary service
disruption.

Since the certificates are re-generated on the host, they must be
injected in running containers before the config reload takes place.
We inject the file into the running container with "podman cp".

The "podman cp" command operates by temporarily mounting the
container's overlay-fs, freezing its execution, copying the file,
then thawing execution and unmounting:

Feb 23 18:19:06 controller-0 platform-python[325270]: ansible-command Invoked with _raw_params=set -e#012podman cp /etc/pki/tls/private/overcloud_endpoint.pem 75ee35ce1b09:/etc/pki/tls/private/overcloud_endpoint.pem#012podman exec -
-user root 75ee35ce1b09 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem#012podman kill --signal=HUP 75ee35ce1b09#012 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=N
one creates=None removes=None stdin=None
Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.132908296 +0000 UTC m=+0.086081887 container mount 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh-o
sbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0)
Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.156749176 +0000 UTC m=+0.109922731 container pause 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh-o
sbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0)
Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.583847725 +0000 UTC m=+0.537021270 container unpause 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh
-osbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0)
Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.585128533 +0000 UTC m=+0.538302139 container unmount 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh
-osbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0)

During the small period of time where the freeze is is happening, no
"podman exec" or "podman stop" command can run: any attempt in doing
so returns an error.

Feb 24 02:05:58 controller-0 podman(haproxy-bundle-podman-0)[352861]: ERROR: Error: can only stop created or running containers. 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 is in state paused: container state improper
Feb 24 02:05:58 controller-0 podman(haproxy-bundle-podman-0)[352861]: ERROR: Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest.
Feb 24 02:05:58 controller-0 pacemaker-execd[66852]: notice: haproxy-bundle-podman-0_stop_0:352861:stderr [ ocf-exit-reason:Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest. ]
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Result of stop operation for haproxy-bundle-podman-0 on controller-0: 1 (error)
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: controller-0-haproxy-bundle-podman-0_stop_0:206 [ ocf-exit-reason:Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest.\n ]
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 aborted by operation haproxy-bundle-podman-0_stop_0 'modify' on controller-0: Event failed
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 action 12 (haproxy-bundle-podman-0_stop_0 on controller-0): expected 'ok' but got 'error'
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-199.bz2): Complete
Feb 24 02:05:58 controller-0 pacemaker-attrd[66853]: notice: Setting fail-count-haproxy-bundle-podman-0#stop_0[controller-0]: (unset) -> INFINITY
Feb 24 02:05:58 controller-0 pacemaker-attrd[66853]: notice: Setting last-failure-haproxy-bundle-podman-0#stop_0[controller-0]: (unset) -> 1614132358

Note that the error returned is generic, so there's no easy API
to check whether the error was due to a paused container or not.


Unfortunately, using "podman cp" opens a time window where some
critical container operations can now fail, and destabilize the
stack. For instance, it may happen that pacemaker cannot monitor the
state of a container and decide to stop it. Worse, it the "podman
stop" cannot be run, pacemaker will consider it a stop failure and
will end up fencingn the node hosting the running container.

The certificate injection with "podman cp" is at least found in
tripleo-heat-templates [1] and in puppet-tripleo [2].

[1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-inject.yaml
[2] e.g. https://opendev.org/openstack/puppet-tripleo/src/branch/master/files/certmonger-haproxy-refresh.sh, but other service files have it as well


Version-Release number of selected component (if applicable):


How reproducible:
Random (low chances)

Steps to Reproduce:
1. Deploy an TLS-e overcloud

2. stop haproxy
pcs resource disable haproxy-bundle

3. spam the certificate-inject script for haproxy
/usr/bin/certmonger-haproxy-refresh.sh reload internal_api


Actual results:
Sometimes the stop will fail because the container will be in pause state at the time the stop action occurred, and fencing will be triggered.

Expected results:
No stop operation should fail.

Additional info:

This most directly impacts pacemaker, but we could make the argument that temporarily breaking "podman exec" or "podman stop" will likely impact other services.

Comment 14 Julia Marciano 2021-08-30 21:23:47 UTC
Damien, 
can certmonger-rabbitmq-refresh.sh be fixed as it was for OSP16.2 in https://bugzilla.redhat.com/show_bug.cgi?id=1998917, please?

Comment 15 Damien Ciabrini 2021-08-31 15:41:26 UTC
(In reply to Julia Marciano from comment #14)
> Damien, 
> can certmonger-rabbitmq-refresh.sh be fixed as it was for OSP16.2 in
> https://bugzilla.redhat.com/show_bug.cgi?id=1998917, please?

I just created the cloned bz https://bugzilla.redhat.com/show_bug.cgi?id=1999702 for tracking the fix or certmonger-rabbitmq-refresh.sh in 16.1.

Comment 18 Michele Baldessari 2021-09-06 14:53:58 UTC
(In reply to Julia Marciano from comment #16)
> The new certificate isn't being copied to the container, it seems 'tar'
> command doesn't succeed:
> // Run here a copy of the original /usr/bin/certmonger-haproxy-refresh.sh
> with 'set -x'
> [root@controller-0 ~]# date;/usr/bin/certmonger-haproxy-refresh.copy.sh
> reload internal_api 
> Fri Sep  3 00:19:47 UTC 2021
> ...
> + tar -c /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
> + podman exec -i haproxy-bundle-podman-0 tar -C / -xv
> tar: Removing leading `/' from member names
> tar: This does not look like a tar archive
> tar: Exiting with failure status due to previous errors
> ERRO[0000] read unixpacket
> @->/var/run/libpod/socket/
> 1ac6566cb88b71e0cfb2f14f720126ee96d72753d52e826cf0c9bf12f0185a4a/attach:
> read: connection reset by peer 
> Error: non zero exit code: 2: OCI runtime error
> ...

This is a separate issue. We forgot to patch some scripts in puppet-tripleo to workaround the
podman cp breakage on rhel 8.2.

Comment 34 errata-xmlrpc 2021-12-09 20:19:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762


Note You need to log in before you can comment on or make changes to this bug.