1927017 – CCO does not relinquish leadership when restarting for proxy CA change

Bug 1927017 - CCO does not relinquish leadership when restarting for proxy CA change

Summary: CCO does not relinquish leadership when restarting for proxy CA change

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	sumehta
QA Contact:	wang lin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-09 20:37 UTC by Matthew Staebler
Modified:	2021-07-27 22:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:43:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
bootstrap gather (4.24 MB, application/gzip) 2021-02-09 20:37 UTC, Matthew Staebler	no flags	Details
in-cluster CCO log (63.63 KB, text/plain) 2021-02-09 20:41 UTC, Matthew Staebler	no flags	Details
openshift-machine-api-aws cred request (3.77 KB, text/plain) 2021-02-09 20:47 UTC, Matthew Staebler	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cloud-credential-operator pull 310	None	open	Bug 1927017: Context cancel on stopped leading	2021-03-17 17:50:36 UTC
Github	openshift cloud-credential-operator pull 320	None	open	[WIP] Bug 1927017: Drop config map controller	2021-04-08 16:08:42 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:43:32 UTC

Description Matthew Staebler 2021-02-09 20:37:17 UTC

Created attachment 1756045 [details]
bootstrap gather

When the CCO pod is restarted due to a proxy CA change, the old pod does not relinquish its leader election. This results in an 8-minute delay before the new pod can service credentials requests.

Old pod
> time="2021-02-09T20:13:57Z" level=info msg="became leader" id=6706a11c-b8b2-409f-b9bb-3d26b65685b3
> <snip>
> time="2021-02-09T20:19:18Z" level=info msg="Proxy CA configmap change detected, restarting pod" configmap=openshift-cloud-credential-operator/cco-trusted-ca controller=configmap

New pod
> time="2021-02-09T20:19:20Z" level=info msg="generated leader election ID" id=0c26d32c-3526-43e9-9884-e8aacf4ed20c
> I0209 20:19:20.003479       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cloud-credential-operator/cloud-credential-operator-leader...
> time="2021-02-09T20:19:20Z" level=info msg="current leader: 6706a11c-b8b2-409f-b9bb-3d26b65685b3"

The consequence of this is that the installation takes longer as the worker machines cannot be started during the CCO downtime.

Comment 1 Matthew Staebler 2021-02-09 20:41:47 UTC

Created attachment 1756047 [details]
in-cluster CCO log

From the in-cluster CCO,
> time="2021-02-09T20:21:36Z" level=info msg="generated leader election ID" id=7fca7590-766b-4709-a7be-104708a1260f
> I0209 20:21:36.858036       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cloud-credential-operator/cloud-credential-operator-leader...
> time="2021-02-09T20:21:37Z" level=info msg="current leader: 6706a11c-b8b2-409f-b9bb-3d26b65685b3"
> I0209 20:30:05.713252       1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader
> time="2021-02-09T20:30:05Z" level=info msg="became leader" id=7fca7590-766b-4709-a7be-104708a1260f

Comment 2 Matthew Staebler 2021-02-09 20:47:22 UTC

Created attachment 1756048 [details]
openshift-machine-api-aws cred request

Of particular interest to me is the openshift-machine-api-aws cred request. The cred request was created at 2021-02-09T20:12:17Z, but was not fulfilled until 2021-02-09T20:30:14Z.

> time="2021-02-09T20:30:14Z" level=info msg="secret created successfully" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws targetSecret=openshift-machine-api/aws-cloud-credentials

Comment 3 Devan Goodwin 2021-03-15 18:35:49 UTC

via Abhinav: we are likely using os.Exit when we detect the proxy CA change which would bypass our leader election lease release.

We should have a global context that gets cancelled instead to allow the lease release to execute.

Comment 6 wang lin 2021-03-25 07:40:09 UTC

The issue didn't fix. When I changed proxy CA, the cco pod didn't restart.

version: 
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-24-200346   True        False         33m     Cluster version is 4.8.0-0.nightly-2021-03-24-200346

######logs######
time="2021-03-25T07:01:39Z" level=info msg="Proxy CA configmap change detected, restarting pod" configmap=openshift-cloud-credential-operator/cco-trusted-ca controller=configmap
time="2021-03-25T07:01:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:01:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.498132ms
time="2021-03-25T07:01:55Z" level=info msg="reconciling clusteroperator status"
time="2021-03-25T07:01:55Z" level=info msg="clusteroperator status updated" controller=status
W0325 07:02:40.994149       1 warnings.go:67] admissionregistration.k8s.io/v1beta1 MutatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 MutatingWebhookConfiguration
time="2021-03-25T07:03:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:03:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.38714ms
time="2021-03-25T07:05:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:05:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.523944ms
time="2021-03-25T07:06:18Z" level=info msg="Proxy CA configmap change detected, restarting pod" configmap=openshift-cloud-credential-operator/cco-trusted-ca controller=configmap
time="2021-03-25T07:06:55Z" level=info msg="reconciling clusteroperator status"
time="2021-03-25T07:06:55Z" level=info msg="clusteroperator status updated" controller=status
time="2021-03-25T07:07:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:07:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.539316ms
time="2021-03-25T07:09:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:09:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.433577ms
time="2021-03-25T07:11:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
time="2021-03-25T07:11:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.361834ms
time="2021-03-25T07:11:55Z" level=info msg="reconciling clusteroperator status"

Comment 8 wang lin 2021-04-19 01:37:49 UTC

Verified on 4.8.0-0.nightly-2021-04-15-202330.

1. Change proxy CA, the old pod will detect the changes and restart
old pod
I0416 12:38:46.968666       1 observer_polling.go:120] Observed file "/var/run/configmaps/trusted-ca-bundle/tls-ca-bundle.pem" has been modified (old="e0c433f773e598341811fd40d74c581fbfe04f864739e80ea763c9e1291f6c5d", new="7336a74c27fc8c30928e9c8f8f275ec4b656221a18ba15aab4aeda42f546da1d")
time="2021-04-16T12:38:46Z" level=info msg="Proxy CA configmap change detected, restarting pod"
time="2021-04-16T12:38:46Z" level=info msg="leader lost" id=9b649244-8804-4c66-aabd-a3a0b1953d34

2. The new pod can become the leader immediately without waiting 8 minutes.
new pod
Copying system trust bundle
time="2021-04-16T12:38:47Z" level=info msg="setting up client for manager"
time="2021-04-16T12:38:47Z" level=info msg="running file observer" file=/var/run/configmaps/trusted-ca-bundle/tls-ca-bundle.pem
I0416 12:38:47.554489       1 observer_polling.go:159] Starting file observer
time="2021-04-16T12:38:47Z" level=info msg="generated leader election ID" id=7e23979a-7897-47c3-9ab2-ab47e084895a
I0416 12:38:47.556316       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cloud-credential-operator/cloud-credential-operator-leader...
I0416 12:38:47.568094       1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader
time="2021-04-16T12:38:47Z" level=info msg="became leader" id=7e23979a-7897-47c3-9ab2-ab47e084895a
time="2021-04-16T12:38:47Z" level=info msg="setting up manager"
I0416 12:38:48.618720       1 request.go:645] Throttling request took 1.046397293s, request: GET:https://172.30.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s

Comment 11 errata-xmlrpc 2021-07-27 22:43:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.