Description of problem: IN 4.6 the leader election code in the CCO was redone to write to etcd less. However testing did not cover modifications to the Deployment, where the new pod is coming up before the old is terminated, and determining that someone else is the leader, and then waiting the full renew deadline of 270 seconds before checking again. Version-Release number of selected component (if applicable): 4.6.0-fc.7 How reproducible: Seems 100% but possible timing might cause it to not surface here and there. Steps to Reproduce: 1. Scale down the CVO so we can manually modify the CCO Deployment: kubectl scale -n openshift-cluster-version deployment.v1.apps/cluster-version-operator --replicas=0 2. kubectl edit deployment cloud-credential-operator 3. Add a label to spec.template.metadata.labels to trigger a new Deployment rollout. 4. Check the logs on the new created pod: Actual results: time="2020-09-24T18:43:18Z" level=info msg="setting up client for manager" time="2020-09-24T18:43:18Z" level=info msg="generated leader election ID" id=930ff8bb-b508-4cae-ad02-7c32a6429a1c I0924 18:43:18.296341 1 leaderelection.go:243] attempting to acquire leader lease openshift-cloud-credential-operator/cloud-credential-operator-leader... time="2020-09-24T18:43:18Z" level=info msg="current leader: 54d703bc-d52e-4620-b09d-05b7dc888e1e" Pod will stall here for 270 seconds. Expected results: The new pod should be able to immediately run as the old one properly releases the lock when it was terminated. Additional info: This is likely related to the use of the default Deployment strategy of rollingUpdate. This brings up the new pod before the old is terminated, causing this issue. Using the "recreate" strategy where old pods are terminated and then replaced would fix the issue, and seems like a valid use for an operator.
The issue has fixed. test payload: test result: after old pod is terminated, the new pod is coming up soon and doesn't need to wait the full renew deadline. time="2020-10-09T05:18:53Z" level=info msg="setting up client for manager" time="2020-10-09T05:18:53Z" level=info msg="generated leader election ID" id=c9a77536-ddb0-46ae-804c-43d23c3d5b9f I1009 05:18:53.064847 1 leaderelection.go:243] attempting to acquire leader lease openshift-cloud-credential-operator/cloud-credential-operator-leader... time="2020-10-09T05:18:53Z" level=info msg="became leader" id=c9a77536-ddb0-46ae-804c-43d23c3d5b9f I1009 05:18:53.080342 1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader time="2020-10-09T05:18:53Z" level=info msg="setting up manager" I1009 05:18:54.130507 1 request.go:645] Throttling request took 1.045619565s, request: GET:https://172.30.0.1:443/apis/admissionregistration.k8s.io/v1beta1?timeout=32s time="2020-10-09T05:18:55Z" level=info msg="registering components" time="2020-10-09T05:18:55Z" level=info msg="setting up scheme" time="2020-10-09T05:18:55Z" level=info msg="setting up controller" time="2020-10-09T05:18:55Z" level=info msg="Setting up secret annotator. Platform Type is AWS" time="2020-10-09T05:18:55Z" level=info msg="setting up AWS pod identity controller" time="2020-10-09T05:18:57Z" level=info msg="setting up AWS OIDC Discovery Endpoint Controller" time="2020-10-09T05:19:00Z" level=info msg="initializing AWS actuator" time="2020-10-09T05:19:00Z" level=info msg="starting the cmd" time="2020-10-09T05:19:00Z" level=info msg="requeueing all CredentialsRequests"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196