Bug 1882505

Summary:	CCO Leader Election Stalls on Deployment Rollouts
Product:	OpenShift Container Platform	Reporter:	Devan Goodwin <dgoodwin>
Component:	Cloud Credential Operator	Assignee:	Devan Goodwin <dgoodwin>
Status:	CLOSED ERRATA	QA Contact:	wang lin <lwan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.6	CC:	lwan
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:45:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Devan Goodwin 2020-09-24 18:44:44 UTC

Description of problem:

IN 4.6 the leader election code in the CCO was redone to write to etcd less.

However testing did not cover modifications to the Deployment, where the new pod is coming up before the old is terminated, and determining that someone else is the leader, and then waiting the full renew deadline of 270 seconds before checking again.

Version-Release number of selected component (if applicable):

4.6.0-fc.7

How reproducible:

Seems 100% but possible timing might cause it to not surface here and there.

Steps to Reproduce:
1. Scale down the CVO so we can manually modify the CCO Deployment: kubectl scale -n openshift-cluster-version deployment.v1.apps/cluster-version-operator --replicas=0
2. kubectl edit deployment cloud-credential-operator
3. Add a label to spec.template.metadata.labels to trigger a new Deployment rollout.
4. Check the logs on the new created pod:

Actual results:

time="2020-09-24T18:43:18Z" level=info msg="setting up client for manager"
time="2020-09-24T18:43:18Z" level=info msg="generated leader election ID" id=930ff8bb-b508-4cae-ad02-7c32a6429a1c
I0924 18:43:18.296341 1 leaderelection.go:243] attempting to acquire leader lease openshift-cloud-credential-operator/cloud-credential-operator-leader...
time="2020-09-24T18:43:18Z" level=info msg="current leader: 54d703bc-d52e-4620-b09d-05b7dc888e1e"

Pod will stall here for 270 seconds.

Expected results:

The new pod should be able to immediately run as the old one properly releases the lock when it was terminated.

Additional info:

This is likely related to the use of the default Deployment strategy of rollingUpdate. This brings up the new pod before the old is terminated, causing this issue. Using the "recreate" strategy where old pods are terminated and then replaced would fix the issue, and seems like a valid use for an operator.

Comment 3 wang lin 2020-10-09 05:24:50 UTC

The issue has fixed.
test payload: 

test result:
after old pod is terminated, the new pod is coming up soon and doesn't need to wait the full renew deadline.

time="2020-10-09T05:18:53Z" level=info msg="setting up client for manager"
time="2020-10-09T05:18:53Z" level=info msg="generated leader election ID" id=c9a77536-ddb0-46ae-804c-43d23c3d5b9f
I1009 05:18:53.064847       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cloud-credential-operator/cloud-credential-operator-leader...
time="2020-10-09T05:18:53Z" level=info msg="became leader" id=c9a77536-ddb0-46ae-804c-43d23c3d5b9f
I1009 05:18:53.080342       1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader
time="2020-10-09T05:18:53Z" level=info msg="setting up manager"
I1009 05:18:54.130507       1 request.go:645] Throttling request took 1.045619565s, request: GET:https://172.30.0.1:443/apis/admissionregistration.k8s.io/v1beta1?timeout=32s
time="2020-10-09T05:18:55Z" level=info msg="registering components"
time="2020-10-09T05:18:55Z" level=info msg="setting up scheme"
time="2020-10-09T05:18:55Z" level=info msg="setting up controller"
time="2020-10-09T05:18:55Z" level=info msg="Setting up secret annotator. Platform Type is AWS"
time="2020-10-09T05:18:55Z" level=info msg="setting up AWS pod identity controller"
time="2020-10-09T05:18:57Z" level=info msg="setting up AWS OIDC Discovery Endpoint Controller"
time="2020-10-09T05:19:00Z" level=info msg="initializing AWS actuator"
time="2020-10-09T05:19:00Z" level=info msg="starting the cmd"
time="2020-10-09T05:19:00Z" level=info msg="requeueing all CredentialsRequests"

Comment 5 errata-xmlrpc 2020-10-27 16:45:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196