1882505 – CCO Leader Election Stalls on Deployment Rollouts

Bug 1882505 - CCO Leader Election Stalls on Deployment Rollouts

Summary: CCO Leader Election Stalls on Deployment Rollouts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Devan Goodwin
QA Contact:	wang lin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-24 18:44 UTC by Devan Goodwin
Modified:	2020-10-27 16:45 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:45:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cloud-credential-operator pull 251	0	None	closed	Bug 1882505: Fix stalled leader election after Deployment updates.	2021-02-04 07:28:33 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:45:45 UTC

Description Devan Goodwin 2020-09-24 18:44:44 UTC

Description of problem:

IN 4.6 the leader election code in the CCO was redone to write to etcd less.

However testing did not cover modifications to the Deployment, where the new pod is coming up before the old is terminated, and determining that someone else is the leader, and then waiting the full renew deadline of 270 seconds before checking again.

Version-Release number of selected component (if applicable):

4.6.0-fc.7

How reproducible:

Seems 100% but possible timing might cause it to not surface here and there.

Steps to Reproduce:
1. Scale down the CVO so we can manually modify the CCO Deployment: kubectl scale -n openshift-cluster-version deployment.v1.apps/cluster-version-operator --replicas=0
2. kubectl edit deployment cloud-credential-operator
3. Add a label to spec.template.metadata.labels to trigger a new Deployment rollout.
4. Check the logs on the new created pod:

Actual results:

time="2020-09-24T18:43:18Z" level=info msg="setting up client for manager"
time="2020-09-24T18:43:18Z" level=info msg="generated leader election ID" id=930ff8bb-b508-4cae-ad02-7c32a6429a1c
I0924 18:43:18.296341 1 leaderelection.go:243] attempting to acquire leader lease openshift-cloud-credential-operator/cloud-credential-operator-leader...
time="2020-09-24T18:43:18Z" level=info msg="current leader: 54d703bc-d52e-4620-b09d-05b7dc888e1e"

Pod will stall here for 270 seconds.

Expected results:

The new pod should be able to immediately run as the old one properly releases the lock when it was terminated.

Additional info:

This is likely related to the use of the default Deployment strategy of rollingUpdate. This brings up the new pod before the old is terminated, causing this issue. Using the "recreate" strategy where old pods are terminated and then replaced would fix the issue, and seems like a valid use for an operator.

Comment 3 wang lin 2020-10-09 05:24:50 UTC

The issue has fixed.
test payload: 

test result:
after old pod is terminated, the new pod is coming up soon and doesn't need to wait the full renew deadline.

time="2020-10-09T05:18:53Z" level=info msg="setting up client for manager"
time="2020-10-09T05:18:53Z" level=info msg="generated leader election ID" id=c9a77536-ddb0-46ae-804c-43d23c3d5b9f
I1009 05:18:53.064847       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cloud-credential-operator/cloud-credential-operator-leader...
time="2020-10-09T05:18:53Z" level=info msg="became leader" id=c9a77536-ddb0-46ae-804c-43d23c3d5b9f
I1009 05:18:53.080342       1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader
time="2020-10-09T05:18:53Z" level=info msg="setting up manager"
I1009 05:18:54.130507       1 request.go:645] Throttling request took 1.045619565s, request: GET:https://172.30.0.1:443/apis/admissionregistration.k8s.io/v1beta1?timeout=32s
time="2020-10-09T05:18:55Z" level=info msg="registering components"
time="2020-10-09T05:18:55Z" level=info msg="setting up scheme"
time="2020-10-09T05:18:55Z" level=info msg="setting up controller"
time="2020-10-09T05:18:55Z" level=info msg="Setting up secret annotator. Platform Type is AWS"
time="2020-10-09T05:18:55Z" level=info msg="setting up AWS pod identity controller"
time="2020-10-09T05:18:57Z" level=info msg="setting up AWS OIDC Discovery Endpoint Controller"
time="2020-10-09T05:19:00Z" level=info msg="initializing AWS actuator"
time="2020-10-09T05:19:00Z" level=info msg="starting the cmd"
time="2020-10-09T05:19:00Z" level=info msg="requeueing all CredentialsRequests"

Comment 5 errata-xmlrpc 2020-10-27 16:45:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.