Bug 1935621 - Injecting certificate with "podman cp" can break cluster monitoring and operation
Summary: Injecting certificate with "podman cp" can break cluster monitoring and opera...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 16.2 (Train)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: beta
: 16.2 (Train on RHEL 8.4)
Assignee: Damien Ciabrini
QA Contact: David Rosenfeld
URL:
Whiteboard:
: 1847320 (view as bug list)
Depends On:
Blocks: 1969461
TreeView+ depends on / blocked
 
Reported: 2021-03-05 09:55 UTC by Damien Ciabrini
Modified: 2024-10-01 17:37 UTC (History)
10 users (show)

Fixed In Version: puppet-tripleo-11.6.2-2.20210428172107.5c76ddc.el8ost.2 openstack-tripleo-heat-templates-11.5.1-2.20210506094805.28ae385.el8ost.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1969461 (view as bug list)
Environment:
Last Closed: 2021-09-15 07:12:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1917868 0 None None None 2021-03-05 10:04:40 UTC
OpenStack gerrit 783913 0 None MERGED HA: inject public certificates without blocking container 2021-05-27 10:45:44 UTC
OpenStack gerrit 783949 0 None MERGED HA: inject public certificates without blocking container 2021-05-27 10:45:44 UTC
Red Hat Issue Tracker OSP-4213 0 None None None 2022-06-07 12:17:08 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:13:00 UTC

Description Damien Ciabrini 2021-03-05 09:55:37 UTC
Description of problem:

Observed on an FFU job, but the problem can happen whenever the "podman cp" command is used.

When internal or public certificates are re-generated, we notify the
impacted services by triggering a "config reload" when possible, to
avoid restarting them entirely and incurring a temporary service
disruption.

Since the certificates are re-generated on the host, they must be
injected in running containers before the config reload takes place.
We inject the file into the running container with "podman cp".

The "podman cp" command operates by temporarily mounting the
container's overlay-fs, freezing its execution, copying the file,
then thawing execution and unmounting:

Feb 23 18:19:06 controller-0 platform-python[325270]: ansible-command Invoked with _raw_params=set -e#012podman cp /etc/pki/tls/private/overcloud_endpoint.pem 75ee35ce1b09:/etc/pki/tls/private/overcloud_endpoint.pem#012podman exec -
-user root 75ee35ce1b09 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem#012podman kill --signal=HUP 75ee35ce1b09#012 _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=N
one creates=None removes=None stdin=None
Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.132908296 +0000 UTC m=+0.086081887 container mount 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh-o
sbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0)
Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.156749176 +0000 UTC m=+0.109922731 container pause 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh-o
sbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0)
Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.583847725 +0000 UTC m=+0.537021270 container unpause 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh
-osbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0)
Feb 23 18:19:06 controller-0 podman[325297]: 2021-02-23 18:19:06.585128533 +0000 UTC m=+0.538302139 container unmount 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 (image=undercloud-0.ctlplane.redhat.local:8787/rh
-osbs/rhosp16-openstack-haproxy:16.1_20210205.1, name=haproxy-bundle-podman-0)

During the small period of time where the freeze is is happening, no
"podman exec" or "podman stop" command can run: any attempt in doing
so returns an error.

Feb 24 02:05:58 controller-0 podman(haproxy-bundle-podman-0)[352861]: ERROR: Error: can only stop created or running containers. 75ee35ce1b0912645d42d46967a372cf0a3edf49a5d9d5d121ce60fe3acc6144 is in state paused: container state improper
Feb 24 02:05:58 controller-0 podman(haproxy-bundle-podman-0)[352861]: ERROR: Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest.
Feb 24 02:05:58 controller-0 pacemaker-execd[66852]: notice: haproxy-bundle-podman-0_stop_0:352861:stderr [ ocf-exit-reason:Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest. ]
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Result of stop operation for haproxy-bundle-podman-0 on controller-0: 1 (error)
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: controller-0-haproxy-bundle-podman-0_stop_0:206 [ ocf-exit-reason:Failed to stop container, haproxy-bundle-podman-0, based on image, cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest.\n ]
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 aborted by operation haproxy-bundle-podman-0_stop_0 'modify' on controller-0: Event failed
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 action 12 (haproxy-bundle-podman-0_stop_0 on controller-0): expected 'ok' but got 'error'
Feb 24 02:05:58 controller-0 pacemaker-controld[66855]: notice: Transition 232 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-199.bz2): Complete
Feb 24 02:05:58 controller-0 pacemaker-attrd[66853]: notice: Setting fail-count-haproxy-bundle-podman-0#stop_0[controller-0]: (unset) -> INFINITY
Feb 24 02:05:58 controller-0 pacemaker-attrd[66853]: notice: Setting last-failure-haproxy-bundle-podman-0#stop_0[controller-0]: (unset) -> 1614132358

Note that the error returned is generic, so there's no easy API
to check whether the error was due to a paused container or not.


Unfortunately, using "podman cp" opens a time window where some
critical container operations can now fail, and destabilize the
stack. For instance, it may happen that pacemaker cannot monitor the
state of a container and decide to stop it. Worse, it the "podman
stop" cannot be run, pacemaker will consider it a stop failure and
will end up fencingn the node hosting the running container.

The certificate injection with "podman cp" is at least found in
tripleo-heat-templates [1] and in puppet-tripleo [2].

[1] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/haproxy/haproxy-public-tls-inject.yaml
[2] e.g. https://opendev.org/openstack/puppet-tripleo/src/branch/master/files/certmonger-haproxy-refresh.sh, but other service files have it as well


Version-Release number of selected component (if applicable):


How reproducible:
Random (low chances)

Steps to Reproduce:
1. Deploy an TLS-e overcloud

2. stop haproxy
pcs resource disable haproxy-bundle

3. spam the certificate-inject script for haproxy
/usr/bin/certmonger-haproxy-refresh.sh reload internal_api


Actual results:
Sometimes the stop will fail because the container will be in pause state at the time the stop action occurred, and fencing will be triggered.

Expected results:
No stop operation should fail.

Additional info:

This most directly impacts pacemaker, but we could make the argument that temporarily breaking "podman exec" or "podman stop" will likely impact other services.

Comment 8 Damien Ciabrini 2021-08-11 10:32:22 UTC
Instruction for testing:

On a controller node, with haproxy container running.


1. check one current certificate for haproxy (e.g. internal API)
# md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
bf41675624ef5b748335bf4668d6f7c3  /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem

2. generate a new certificate with certmonger, and wait a couple of seconds for the new cert to get injected in haproxy
# getcert resubmit -i haproxy-internal_api-cert
Resubmitting "haproxy-internal_api-cert" to "IPA".

3. verify that a new certificate has been retrieved by certmonger on the host
# md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
fa01cdc093402a172c16a2baa363ba99  /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem

3. verify that the cert got injected in the container as expected
# podman exec $(podman ps -q --filter name=haproxy) md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
fa01cdc093402a172c16a2baa363ba99  /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem

4. verify that haproxy reloaded its config after the certificate injection
# journalctl -t haproxy
Aug 11 10:30:39 controller-0.redhat.local haproxy[536933]: Proxy cinder started.
Aug 11 10:30:39 controller-0.redhat.local haproxy[536933]: Proxy glance_api started.
Aug 11 10:30:39 controller-0.redhat.local haproxy[536933]: Proxy haproxy.stats started.
[...]

5. double check that the cert injection uses "tar" instead of "podman cp" to fix the bz
# /usr/bin/certmonger-haproxy-refresh.sh reload internal_api
[...]
+ cat /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.crt /etc/ipa/ca.crt /etc/pki/tls/private/haproxy/overcloud-haproxy-internal_api.key
++ podman ps '--format={{.Names}}'
++ grep -w -E 'haproxy(-bundle-.*-[0-9]+)?'
+ haproxy_container_name=haproxy-bundle-podman-0
+ '[' reload == reload ']'
+ echo haproxy-bundle-podman-0
+ grep -q '^haproxy-bundle'
+ tar -c /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
+ podman exec -i haproxy-bundle-podman-0 tar -C / -xv

Comment 9 Julia Marciano 2021-08-23 00:13:21 UTC
Verified:

Verified in version:
[heat-admin@controller-0 ~]$ rpm -qa|grep puppet-tripleo
puppet-tripleo-11.6.2-2.20210603175724.f642d04.el8ost.1.noarch

Steps were performed:
[root@controller-0 ~]# date
Sun Aug 22 20:06:30 UTC 2021

// Check md5sum of the current haproxy internal api certificate:
[root@controller-0 ~]# md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
3e9187b760be94888e44e9e0f9f6eb4c  /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem

// Generate a new certificate by certmonger:
[root@controller-0 ~]# getcert resubmit -i haproxy-internal_api-cert
Resubmitting "haproxy-internal_api-cert" to "IPA".

// Wait a few seconds and verify that the previous action caused to haproxy configuration reload:
[root@controller-0 ~]# journalctl -t haproxy
...
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy cinder started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy glance_api started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy haproxy.stats started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy heat_api started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy heat_cfn started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy horizon started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy keystone_admin started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy keystone_public started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy mysql started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy neutron started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy nova_metadata started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy nova_novncproxy started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy nova_osapi started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy placement started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy redis started.
Aug 22 20:07:05 controller-0.redhat.local haproxy[536933]: Proxy swift_proxy_server started

//  verify that a new certificate has been retrieved by certmonger on the host:
[root@controller-0 ~]# md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
a7ef9913940ffe3bc69dd4d4bb63ef2a  /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem

// Verify that the new certificate was injected to the haproxy container as expected:
[root@controller-0 ~]# podman exec $(podman ps -q --filter name=haproxy) md5sum /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
a7ef9913940ffe3bc69dd4d4bb63ef2a  /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
.pem

// Ensure that certificate injection has been done by using "tar" and not "podman cp" command:
[root@controller-0 ~]# date
Sun Aug 22 20:18:02 UTC 2021

[root@controller-0 ~]# /usr/bin/certmonger-haproxy-refresh.sh reload internal_api
...
++ grep -w -E 'haproxy(-bundle-.*-[0-9]+)?'
+ haproxy_container_name=haproxy-bundle-podman-0
+ '[' reload == reload ']'
+ echo haproxy-bundle-podman-0
+ grep -q '^haproxy-bundle'
+ tar -c /etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
+ podman exec -i haproxy-bundle-podman-0 tar -C / -xv
tar: Removing leading `/' from member names
etc/pki/tls/certs/haproxy/overcloud-haproxy-internal_api.pem
...

// See (not nesaccery) that haproxy reloaded its configuration :
[root@controller-0 ~]# journalctl -t haproxy
...
Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy cinder started.
Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy glance_api started.
Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy haproxy.stats started.
Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy heat_api started.
Aug 22 20:18:16 controller-0.redhat.local haproxy[536933]: Proxy heat_cfn started.
...

Comment 11 errata-xmlrpc 2021-09-15 07:12:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483

Comment 12 Damien Ciabrini 2021-11-09 16:31:41 UTC
*** Bug 1847320 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.