Bug 1927017

Summary:

CCO does not relinquish leadership when restarting for proxy CA change

Product:

OpenShift Container Platform

Reporter:

Matthew Staebler <mstaeble>

Component:

Cloud Credential Operator

Assignee:

sumehta

Status:

CLOSED ERRATA

QA Contact:

wang lin <lwan>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.7

CC:

arane, dgoodwin, gshereme, lwan, sumehta

Target Milestone:

---

Target Release:

4.8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-07-27 22:43:10 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
bootstrap gather	none
in-cluster CCO log	none
openshift-machine-api-aws cred request	none

Description Matthew Staebler 2021-02-09 20:37:17 UTC

Created attachment 1756045 [details]
bootstrap gather

When the CCO pod is restarted due to a proxy CA change, the old pod does not relinquish its leader election. This results in an 8-minute delay before the new pod can service credentials requests.

Old pod
> time="2021-02-09T20:13:57Z" level=info msg="became leader" id=6706a11c-b8b2-409f-b9bb-3d26b65685b3
> <snip>
> time="2021-02-09T20:19:18Z" level=info msg="Proxy CA configmap change detected, restarting pod" configmap=openshift-cloud-credential-operator/cco-trusted-ca controller=configmap

New pod
> time="2021-02-09T20:19:20Z" level=info msg="generated leader election ID" id=0c26d32c-3526-43e9-9884-e8aacf4ed20c
> I0209 20:19:20.003479       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cloud-credential-operator/cloud-credential-operator-leader...
> time="2021-02-09T20:19:20Z" level=info msg="current leader: 6706a11c-b8b2-409f-b9bb-3d26b65685b3"

The consequence of this is that the installation takes longer as the worker machines cannot be started during the CCO downtime.

Comment 1 Matthew Staebler 2021-02-09 20:41:47 UTC

Created attachment 1756047 [details]
in-cluster CCO log

From the in-cluster CCO,
> time="2021-02-09T20:21:36Z" level=info msg="generated leader election ID" id=7fca7590-766b-4709-a7be-104708a1260f
> I0209 20:21:36.858036       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cloud-credential-operator/cloud-credential-operator-leader...
> time="2021-02-09T20:21:37Z" level=info msg="current leader: 6706a11c-b8b2-409f-b9bb-3d26b65685b3"
> I0209 20:30:05.713252       1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader
> time="2021-02-09T20:30:05Z" level=info msg="became leader" id=7fca7590-766b-4709-a7be-104708a1260f

Comment 2 Matthew Staebler 2021-02-09 20:47:22 UTC

Created attachment 1756048 [details]
openshift-machine-api-aws cred request

Of particular interest to me is the openshift-machine-api-aws cred request. The cred request was created at 2021-02-09T20:12:17Z, but was not fulfilled until 2021-02-09T20:30:14Z.

> time="2021-02-09T20:30:14Z" level=info msg="secret created successfully" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws targetSecret=openshift-machine-api/aws-cloud-credentials

Comment 3 Devan Goodwin 2021-03-15 18:35:49 UTC

via Abhinav: we are likely using os.Exit when we detect the proxy CA change which would bypass our leader election lease release.

We should have a global context that gets cancelled instead to allow the lease release to execute.

Comment 6 wang lin 2021-03-25 07:40:09 UTC

The issue didn't fix. When I changed proxy CA, the cco pod didn't restart.

version: 
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-24-200346   True        False         33m     Cluster version is 4.8.0-0.nightly-2021-03-24-200346

######logs######
time="2021-03-25T07:01:39Z" level=info msg="Proxy CA configmap change detected, restarting pod" configmap=openshift-cloud-credential-operator/cco-trusted-ca controller=configmap
time="2021-03-25T07:01:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:01:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.498132ms
time="2021-03-25T07:01:55Z" level=info msg="reconciling clusteroperator status"
time="2021-03-25T07:01:55Z" level=info msg="clusteroperator status updated" controller=status
W0325 07:02:40.994149       1 warnings.go:67] admissionregistration.k8s.io/v1beta1 MutatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 MutatingWebhookConfiguration
time="2021-03-25T07:03:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:03:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.38714ms
time="2021-03-25T07:05:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:05:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.523944ms
time="2021-03-25T07:06:18Z" level=info msg="Proxy CA configmap change detected, restarting pod" configmap=openshift-cloud-credential-operator/cco-trusted-ca controller=configmap
time="2021-03-25T07:06:55Z" level=info msg="reconciling clusteroperator status"
time="2021-03-25T07:06:55Z" level=info msg="clusteroperator status updated" controller=status
time="2021-03-25T07:07:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:07:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.539316ms
time="2021-03-25T07:09:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:09:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.433577ms
time="2021-03-25T07:11:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:11:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.361834ms
time="2021-03-25T07:11:55Z" level=info msg="reconciling clusteroperator status"

Comment 8 wang lin 2021-04-19 01:37:49 UTC

Verified on 4.8.0-0.nightly-2021-04-15-202330.

1. Change proxy CA, the old pod will detect the changes and restart
old pod
I0416 12:38:46.968666       1 observer_polling.go:120] Observed file "/var/run/configmaps/trusted-ca-bundle/tls-ca-bundle.pem" has been modified (old="e0c433f773e598341811fd40d74c581fbfe04f864739e80ea763c9e1291f6c5d", new="7336a74c27fc8c30928e9c8f8f275ec4b656221a18ba15aab4aeda42f546da1d")
time="2021-04-16T12:38:46Z" level=info msg="Proxy CA configmap change detected, restarting pod"
time="2021-04-16T12:38:46Z" level=info msg="leader lost" id=9b649244-8804-4c66-aabd-a3a0b1953d34

2. The new pod can become the leader immediately without waiting 8 minutes.
new pod
Copying system trust bundle
time="2021-04-16T12:38:47Z" level=info msg="setting up client for manager"
time="2021-04-16T12:38:47Z" level=info msg="running file observer" file=/var/run/configmaps/trusted-ca-bundle/tls-ca-bundle.pem
I0416 12:38:47.554489       1 observer_polling.go:159] Starting file observer
time="2021-04-16T12:38:47Z" level=info msg="generated leader election ID" id=7e23979a-7897-47c3-9ab2-ab47e084895a
I0416 12:38:47.556316       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cloud-credential-operator/cloud-credential-operator-leader...
I0416 12:38:47.568094       1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader
time="2021-04-16T12:38:47Z" level=info msg="became leader" id=7e23979a-7897-47c3-9ab2-ab47e084895a
time="2021-04-16T12:38:47Z" level=info msg="setting up manager"
I0416 12:38:48.618720       1 request.go:645] Throttling request took 1.046397293s, request: GET:https://172.30.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s

Comment 11 errata-xmlrpc 2021-07-27 22:43:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438