Created attachment 1756045 [details] bootstrap gather When the CCO pod is restarted due to a proxy CA change, the old pod does not relinquish its leader election. This results in an 8-minute delay before the new pod can service credentials requests. Old pod > time="2021-02-09T20:13:57Z" level=info msg="became leader" id=6706a11c-b8b2-409f-b9bb-3d26b65685b3 > <snip> > time="2021-02-09T20:19:18Z" level=info msg="Proxy CA configmap change detected, restarting pod" configmap=openshift-cloud-credential-operator/cco-trusted-ca controller=configmap New pod > time="2021-02-09T20:19:20Z" level=info msg="generated leader election ID" id=0c26d32c-3526-43e9-9884-e8aacf4ed20c > I0209 20:19:20.003479 1 leaderelection.go:243] attempting to acquire leader lease openshift-cloud-credential-operator/cloud-credential-operator-leader... > time="2021-02-09T20:19:20Z" level=info msg="current leader: 6706a11c-b8b2-409f-b9bb-3d26b65685b3" The consequence of this is that the installation takes longer as the worker machines cannot be started during the CCO downtime.
Created attachment 1756047 [details] in-cluster CCO log From the in-cluster CCO, > time="2021-02-09T20:21:36Z" level=info msg="generated leader election ID" id=7fca7590-766b-4709-a7be-104708a1260f > I0209 20:21:36.858036 1 leaderelection.go:243] attempting to acquire leader lease openshift-cloud-credential-operator/cloud-credential-operator-leader... > time="2021-02-09T20:21:37Z" level=info msg="current leader: 6706a11c-b8b2-409f-b9bb-3d26b65685b3" > I0209 20:30:05.713252 1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader > time="2021-02-09T20:30:05Z" level=info msg="became leader" id=7fca7590-766b-4709-a7be-104708a1260f
Created attachment 1756048 [details] openshift-machine-api-aws cred request Of particular interest to me is the openshift-machine-api-aws cred request. The cred request was created at 2021-02-09T20:12:17Z, but was not fulfilled until 2021-02-09T20:30:14Z. > time="2021-02-09T20:30:14Z" level=info msg="secret created successfully" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws targetSecret=openshift-machine-api/aws-cloud-credentials
via Abhinav: we are likely using os.Exit when we detect the proxy CA change which would bypass our leader election lease release. We should have a global context that gets cancelled instead to allow the lease release to execute.
The issue didn't fix. When I changed proxy CA, the cco pod didn't restart. version: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-03-24-200346 True False 33m Cluster version is 4.8.0-0.nightly-2021-03-24-200346 ######logs###### time="2021-03-25T07:01:39Z" level=info msg="Proxy CA configmap change detected, restarting pod" configmap=openshift-cloud-credential-operator/cco-trusted-ca controller=configmap time="2021-03-25T07:01:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics time="2021-03-25T07:01:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.498132ms time="2021-03-25T07:01:55Z" level=info msg="reconciling clusteroperator status" time="2021-03-25T07:01:55Z" level=info msg="clusteroperator status updated" controller=status W0325 07:02:40.994149 1 warnings.go:67] admissionregistration.k8s.io/v1beta1 MutatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 MutatingWebhookConfiguration time="2021-03-25T07:03:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics time="2021-03-25T07:03:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.38714ms time="2021-03-25T07:05:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics time="2021-03-25T07:05:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.523944ms time="2021-03-25T07:06:18Z" level=info msg="Proxy CA configmap change detected, restarting pod" configmap=openshift-cloud-credential-operator/cco-trusted-ca controller=configmap time="2021-03-25T07:06:55Z" level=info msg="reconciling clusteroperator status" time="2021-03-25T07:06:55Z" level=info msg="clusteroperator status updated" controller=status time="2021-03-25T07:07:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics time="2021-03-25T07:07:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.539316ms time="2021-03-25T07:09:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics time="2021-03-25T07:09:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.433577ms time="2021-03-25T07:11:54Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics time="2021-03-25T07:11:54Z" level=info msg="reconcile complete" controller=metrics elapsed=2.361834ms time="2021-03-25T07:11:55Z" level=info msg="reconciling clusteroperator status"
Verified on 4.8.0-0.nightly-2021-04-15-202330. 1. Change proxy CA, the old pod will detect the changes and restart old pod I0416 12:38:46.968666 1 observer_polling.go:120] Observed file "/var/run/configmaps/trusted-ca-bundle/tls-ca-bundle.pem" has been modified (old="e0c433f773e598341811fd40d74c581fbfe04f864739e80ea763c9e1291f6c5d", new="7336a74c27fc8c30928e9c8f8f275ec4b656221a18ba15aab4aeda42f546da1d") time="2021-04-16T12:38:46Z" level=info msg="Proxy CA configmap change detected, restarting pod" time="2021-04-16T12:38:46Z" level=info msg="leader lost" id=9b649244-8804-4c66-aabd-a3a0b1953d34 2. The new pod can become the leader immediately without waiting 8 minutes. new pod Copying system trust bundle time="2021-04-16T12:38:47Z" level=info msg="setting up client for manager" time="2021-04-16T12:38:47Z" level=info msg="running file observer" file=/var/run/configmaps/trusted-ca-bundle/tls-ca-bundle.pem I0416 12:38:47.554489 1 observer_polling.go:159] Starting file observer time="2021-04-16T12:38:47Z" level=info msg="generated leader election ID" id=7e23979a-7897-47c3-9ab2-ab47e084895a I0416 12:38:47.556316 1 leaderelection.go:243] attempting to acquire leader lease openshift-cloud-credential-operator/cloud-credential-operator-leader... I0416 12:38:47.568094 1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader time="2021-04-16T12:38:47Z" level=info msg="became leader" id=7e23979a-7897-47c3-9ab2-ab47e084895a time="2021-04-16T12:38:47Z" level=info msg="setting up manager" I0416 12:38:48.618720 1 request.go:645] Throttling request took 1.046397293s, request: GET:https://172.30.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438