Bug 2041554

Summary: use lease for leader election
Product: OpenShift Container Platform Reporter: Sergiusz Urbaniak <surbania>
Component: apiserver-authAssignee: Emily Moss <emoss>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: akashem, aos-bugs, jsafrane, kewang, maszulik, mfojtik, surbania, wlewis, xxia
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2037856 Environment:
Last Closed: 2022-03-10 16:40:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2037856, 2042501    
Bug Blocks:    

Comment 1 Sergiusz Urbaniak 2022-01-21 06:40:23 UTC
The etcd-operator PR was errournously linked here. It should be https://github.com/openshift/cluster-authentication-operator/pull/537 instead.

Comment 3 Xingxing Xia 2022-01-30 04:58:17 UTC
Though two many links to read in above, went through them to understand. The main links to understand are:
https://github.com/kubernetes/kubernetes/pull/106852
https://github.com/kubernetes/kubernetes/issues/107454
Checked the library-go, found the lib PR is: https://github.com/openshift/library-go/pull/1282 . Read its code, the only difference is: leaderelection.go now switches to return ConfigMapsLeasesResourceLock instead of ConfigMapsResourceLock . Checked latest 4.10.0-0.nightly-2022-01-29-015515 :
$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-01-29-015515 | grep authentication-operator
  cluster-authentication-operator  https://github.com/openshift/cluster-authentication-operator 4770445...
Then checked the CAO repo of this bug's PR:
$ cd /path/to/github.com/openshift/cluster-authentication-operator
$ git pull
$ git checkout -b 4.10.0-0.nightly-2022-01-29-015515 477044
$ vi vendor/github.com/openshift/library-go/pkg/config/leaderelection/leaderelection.go
...
        rl, err := resourcelock.New(
                resourcelock.ConfigMapsLeasesResourceLock,
...

This means the PR indeed has landed into 4.10 payloads.

Then checked the definition and use of ConfigMapsLeasesResourceLock, it is in: https://github.com/openshift/cluster-authentication-operator/blob/4770445/vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go#L137-L140 :
	case ConfigMapsLeasesResourceLock:
		return &MultiLock{
			Primary:   configmapLock,
			Secondary: leaseLock,
This means 4.10 indeed both use old configmap-based election and new lease-baded election, proving Dev's plan in https://github.com/kubernetes/kubernetes/issues/107454 for 4.10, i.e. "version x+1". Further check from openshift-authentication-operator pod logs:
$ oc get cm -n openshift-authentication-operator | grep lock
cluster-authentication-operator-lock   0      25h
$ oc get lease -n openshift-authentication-operator | grep lock
cluster-authentication-operator-lock   authentication-operator-84bd79899c-sh9lf_baf2761e-f0cd-4f1c-a4a5-c67e3788e45d   25h
$ oc get lease -n openshift-authentication-operator cluster-authentication-operator-lock -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
...
spec:
  acquireTime: "2022-01-30T04:01:50.000000Z"
  holderIdentity: authentication-operator-84bd79899c-sh9lf_baf2761e-f0cd-4f1c-a4a5-c67e3788e45d
  leaseDurationSeconds: 137
  leaseTransitions: 2
  renewTime: "2022-01-30T04:42:46.623458Z"

There are both configmap and lease locks.

$ oc patch authentication.operator/cluster --type=merge -p="
spec:
  operatorLogLevel: TraceAll
"
Then check openshift-authentication-operator pod logs: delete openshift-authentication-operator pod, wait for the new pod to be created, check pod logs, there are:
2022-01-30T04:01:51.031563107Z I0130 04:01:51.031115       1 leaderelection.go:258] successfully acquired lease openshift-authentication-operator/cluster-authentication-operator-lock
2022-01-30T04:01:51.039809515Z I0130 04:01:51.033340       1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-authentication-operator", Name:"cluster-authentication-operator-lock", UID:"7d02b348-61f8-4410-b1b7-d846493e8526", APIVersion:"v1", ResourceVersion:"542456", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' authentication-operator-84bd79899c-sh9lf_baf2761e-f0cd-4f1c-a4a5-c67e3788e45d became leader
2022-01-30T04:01:51.039809515Z I0130 04:01:51.033406       1 event.go:285] Event(v1.ObjectReference{Kind:"Lease", Namespace:"openshift-authentication-operator", Name:"cluster-authentication-operator-lock", UID:"f41ef5f7-354a-4b68-896a-2acfe531dd30", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"542458", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' authentication-operator-84bd79899c-sh9lf_baf2761e-f0cd-4f1c-a4a5-c67e3788e45d became leader

This means configmap-based and lease-based elections both work well in 4.10.

Compare it versus 4.9, openshift-authentication-operator pod logs only show lines of configmap-based election. No lines of lease-based election. This further verifies 4.10 is working as expected by the bug's PR.

After above understanding, no further test can be done IMO, moving to VERIFIED. Per https://github.com/kubernetes/kubernetes/issues/107454 , we should watch QE upgrades from 4.9 (i.e. x) to 4.10 (i.e. x+1)to see if there would be election issue. If there would be, we'll file separate bug. Since 4.11 is not yet rebased to k8s 1.24, we cannot watch upgrade from 4.10 to 4.11 (i.e. x+2) right now.

Comment 6 errata-xmlrpc 2022-03-10 16:40:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056