Bug 2052700

Summary:	kube-controller-manger should use configmap lease
Product:	OpenShift Container Platform	Reporter:	ravig <rgudimet>
Component:	kube-controller-manager	Assignee:	Filip Krepinsky <fkrepins>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.11	CC:	jchaloup, knarra, maszulik, mfojtik, yinzhou
Target Milestone:	---	Flags:	mfojtik: needinfo?
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	EmergencyRequest
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	2052599	Environment:
Last Closed:	2022-08-10 10:48:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2052599

Comment 1 Michal Fojtik 2022-02-09 20:09:40 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 2 ravig 2022-02-09 20:30:45 UTC

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/602

Comment 4 RamaKasturi 2022-02-10 12:08:27 UTC

Verified with the build below and i see that 4.11 use old configmap-based election and new lease-baded election.

Further check from openshift-kube-scheduler-operator has both locks.

[knarra@knarra ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-10-031822   True        False         116m    Cluster version is 4.11.0-0.nightly-2022-02-10-031822

[knarra@knarra ~]$ oc get cm -n openshift-kube-controller-manager-operator | grep lock
kube-controller-manager-operator-lock     0      149m

[knarra@knarra ~]$ oc get lease -n openshift-kube-controller-manager-operator
NAME                                    HOLDER                                                                                   AGE
kube-controller-manager-operator-lock   kube-controller-manager-operator-85c56dc8f6-tzkph_da44af84-823a-49cf-8ba2-b257161de5db   149m

[knarra@knarra ~]$ oc get lease kube-controller-manager-operator-lock -n openshift-kube-controller-manager-operator -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2022-02-10T09:26:30Z"
  name: kube-controller-manager-operator-lock
  namespace: openshift-kube-controller-manager-operator
  resourceVersion: "73303"
  uid: 58fcaeef-6414-4905-993a-4dc4dab79d87
spec:
  acquireTime: "2022-02-10T09:37:22.000000Z"
  holderIdentity: kube-controller-manager-operator-85c56dc8f6-tzkph_da44af84-823a-49cf-8ba2-b257161de5db
  leaseDurationSeconds: 137
  leaseTransitions: 2
  renewTime: "2022-02-10T11:57:05.095208Z"

There are both configmap and lease locks

To make sure this feature is new in 4.11 , comparing it to OCP 4.9

[knarra@knarra ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.21    True        False         112m    Cluster version is 4.9.21

[knarra@knarra ~]$  oc get cm -n openshift-kube-controller-manager-operator | grep lock
kube-controller-manager-operator-lock     0      18h

[knarra@knarra ~]$ oc get lease -n openshift-kube-controller-manager-operator
No resources found in openshift-kube-controller-manager-operator namespace.

only configmap lock exists.

To check the openshift-kube-controller-manager-operator for both locks in logs. On 4.9 & 4.10 run command below to set the loglevel to "TraceAll"
# oc edit kubecontrollermanager cluster 
# change loglevel to "TraceAll"

wait for the openshift-kube-controller-manager pods to restart, delete the pod in openshift-kube-controller-manager-operator namespace and wait for it to be recreated. Once it is up run the command below to check for both locks in logs.

4.11 kube-controller-manager-operator pod logs:
====================================================
I0210 12:01:07.837798       1 leaderelection.go:258] successfully acquired lease openshift-kube-controller-manager-operator/kube-controller-manager-operator-lock
I0210 12:01:07.837893       1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator-lock", UID:"007815b7-cd08-407f-a9f5-d4060d2ab4fa", APIVersion:"v1", ResourceVersion:"74784", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' kube-controller-manager-operator-85c56dc8f6-fdkh5_02b8c469-7f8f-4154-ab33-632e1b0905ae became leader

I0210 12:01:07.837931       1 event.go:285] Event(v1.ObjectReference{Kind:"Lease", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator-lock", UID:"58fcaeef-6414-4905-993a-4dc4dab79d87", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"74785", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' kube-controller-manager-operator-85c56dc8f6-fdkh5_02b8c469-7f8f-4154-ab33-632e1b0905ae became leader


4.9 kube-controller-manager operator pod logs:
==============================================
I0210 10:34:59.866035       1 genericapiserver.go:378] [graceful-termination] waiting for shutdown to be initiated
I0210 10:34:59.866048       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0210 10:34:59.869183       1 leaderelection.go:258] successfully acquired lease openshift-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-lock
I0210 10:34:59.869681       1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-cluster-kube-scheduler-operator-lock", UID:"e4f2e90e-9497-4e24-aa85-a9050b3f0401", APIVersion:"v1", ResourceVersion:"348553", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' openshift-kube-scheduler-operator-58797bc45d-ph65k_ac1b6de0-fe54-4f67-913d-c7100becff23 became leader

I0210 10:34:59.891055       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-kube-scheduler-operator", UID:"17b05e11-9702-4213-b520-a7b125c1502c", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'FastControllerResync' Controller "RevisionController" resync interval is set to 0s which might lead to client request throttling

4.9 logs does not use confimap lease.

Based on the above moving bug to verified state.

Comment 8 errata-xmlrpc 2022-08-10 10:48:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069