our operators use 'configmap' in their respective namespaces for leader election lock. we need to use Lease. this warrants a change in library-go, in https://github.com/openshift/library-go/blob/master/pkg/config/leaderelection/leaderelection.go#L53, we need to use 'configmapsleases' instead of configmaps (version skew). I hope that's the only change we need to make. The context is configmap/endpoint based leader election is being removed in 1.24 https://github.com/kubernetes/kubernetes/pull/106852 we need to update the following operators: - kas - oas - etcd - auth are there other operators we need to update? we also need to add dedicated flowschema for each operator to make sure leader election traffic falls into 'leader-election' priority level. we only need to address configmap based traffic in to the flowscheam, lease based traffic will be addressed by apf bootstrap configuration in 1.24. we will be able to remove the configmaps in 4.12, the lock migration will be as follows: - 4.10: use 'configmapsleases' for leader election, this client will use both configmaps and lease objects, so that new (4.10) client can work with old (4.9) clients - 4.11: use 'lease' for leader election, so new (4.11) client use lease only, and old (4.10) client can work with lease object. but we can't remove the configmaps yet, looks like the MultiLock (4.10 client) relies on the configmap to be present. - 4.12: remove the configmaps Related issue in k/k: https://github.com/kubernetes/kubernetes/issues/107454 slack thread: https://coreos.slack.com/archives/CC3CZCQHM/p1641487878025300
> are there other operators we need to update? There is number of storage operators based on library-go and all of them have leader election: https://github.com/openshift/cluster-storage-operator https://github.com/openshift/cluster-csi-snapshot-controller-operator https://github.com/openshift/vsphere-problem-detector https://github.com/openshift/aws-ebs-csi-driver-operator https://github.com/openshift/aws-efs-csi-driver-operator https://github.com/openshift/azure-disk-csi-driver-operator https://github.com/openshift/azure-file-csi-driver-operator https://github.com/openshift/openstack-cinder-csi-driver-operator https://github.com/openshift/csi-driver-manila-operator https://github.com/openshift/gcp-pd-csi-driver-operator https://github.com/openshift/vmware-vsphere-csi-driver-operator https://github.com/openshift/alibaba-disk-csi-driver-operator https://github.com/openshift/ibm-vpc-block-csi-driver-operator
(In reply to Jan Safranek from comment #1) > There is number of storage operators based on library-go and all of them > have leader election: > > https://github.com/openshift/cluster-storage-operator > https://github.com/openshift/cluster-csi-snapshot-controller-operator > https://github.com/openshift/vsphere-problem-detector I created PRs in these repos ^. > https://github.com/openshift/aws-ebs-csi-driver-operator > https://github.com/openshift/aws-efs-csi-driver-operator > https://github.com/openshift/azure-disk-csi-driver-operator > https://github.com/openshift/azure-file-csi-driver-operator > https://github.com/openshift/openstack-cinder-csi-driver-operator > https://github.com/openshift/csi-driver-manila-operator > https://github.com/openshift/gcp-pd-csi-driver-operator > https://github.com/openshift/vmware-vsphere-csi-driver-operator > https://github.com/openshift/alibaba-disk-csi-driver-operator > https://github.com/openshift/ibm-vpc-block-csi-driver-operator These repos are/will be updated as part of fix for https://bugzilla.redhat.com/show_bug.cgi?id=2038934. We will not track them here.
kewang, we also need to verify it for oas operator: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/490 basically all operators in control plane group. BZs for other operators: - https://bugzilla.redhat.com/show_bug.cgi?id=2041554 - https://bugzilla.redhat.com/show_bug.cgi?id=2042501
Auth verification FYI: https://bugzilla.redhat.com/show_bug.cgi?id=2041554#c3
Verification steps for apiserver related operators, Checked the library-go, found the lib PR is: https://github.com/openshift/library-go/pull/1282 . Read its code, the only difference is: leaderelection.go now switches to return ConfigMapsLeasesResourceLock instead of ConfigMapsResourceLock, the definition and use of ConfigMapsLeasesResourceLock, in lines https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go#L137-L140 : case ConfigMapsLeasesResourceLock: return &MultiLock{ Primary: configmapLock, Secondary: leaseLock, This means 4.10 indeed both use old configmap-based election and new lease-baded election, Further check from openshift-kube-operator and openshift-apiserver-operator for both locks, For OCP 4.10 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-fc.4 True False 3h50m Cluster version is 4.10.0-fc.4 $ oc get cm -n openshift-kube-apiserver-operator | grep lock kube-apiserver-operator-lock 0 4h12m $ oc get lease -n openshift-kube-apiserver-operator NAME HOLDER AGE kube-apiserver-operator-lock kube-apiserver-operator-55f775964-5jtwp_905d5e54-b695-492c-a6f8-01b3997b22e0 4h18m $ oc get lease kube-apiserver-operator-lock -n openshift-kube-apiserver-operator -o yaml apiVersion: coordination.k8s.io/v1 kind: Lease metadata: creationTimestamp: "2022-01-30T01:45:41Z" name: kube-apiserver-operator-lock namespace: openshift-kube-apiserver-operator resourceVersion: "101626" uid: 9513411e-a2b0-4309-97c9-f5a2325ecc4b spec: acquireTime: "2022-01-30T01:45:41.000000Z" holderIdentity: kube-apiserver-operator-55f775964-5jtwp_905d5e54-b695-492c-a6f8-01b3997b22e0 leaseDurationSeconds: 137 leaseTransitions: 0 renewTime: "2022-01-30T06:03:28.351945Z" $ oc get cm -n openshift-apiserver-operator | grep lock openshift-apiserver-operator-lock 0 4h13m $ oc get lease -n openshift-apiserver-operator NAME HOLDER AGE openshift-apiserver-operator-lock openshift-apiserver-operator-54cb9c8457-pngzh_e2ad5ac7-4690-49ce-aec8-4a87bd1c838d 4h20m $ oc get lease openshift-apiserver-operator-lock -n openshift-apiserver-operator -o yaml apiVersion: coordination.k8s.io/v1 kind: Lease metadata: creationTimestamp: "2022-01-30T01:45:43Z" name: openshift-apiserver-operator-lock namespace: openshift-apiserver-operator resourceVersion: "102601" uid: ea4ee887-33b2-4df2-9892-2d4e9121c2c7 spec: acquireTime: "2022-01-30T01:45:43.000000Z" holderIdentity: openshift-apiserver-operator-54cb9c8457-pngzh_e2ad5ac7-4690-49ce-aec8-4a87bd1c838d leaseDurationSeconds: 137 leaseTransitions: 0 renewTime: "2022-01-30T06:06:27.629485Z" There are both configmap and lease locks. To make sure this feature is new in 4.10 by comparing it to OCP 4.9, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.18 True False 3h51m Cluster version is 4.9.18 $ oc get cm -n openshift-kube-apiserver-operator | grep lock kube-apiserver-operator-lock 0 4h12m $ oc get lease -n openshift-kube-apiserver-operator No resources found in openshift-kube-apiserver-operator namespace. $ oc get cm -n openshift-apiserver-operator | grep lock openshift-apiserver-operator-lock 0 4h16m $ oc get lease -n openshift-apiserver-operator No resources found in openshift-apiserver-operator namespace. Only configmap locks exist. To check the openshift-kube-operator and openshift-apiserver-operator for both locks in logs, on 4.9 and 4.10 cluster, applying the following patches, oc patch kubeapiservers.operator.openshift.io/cluster --type=merge -p=" spec: operatorLogLevel: TraceAll " oc patch openshiftapiservers.operator.openshift.io/cluster --type=merge -p=" spec: operatorLogLevel: TraceAll " 4.10, Deleted openshift-kube-operator pod, wait for the new pod to be created, then check openshift-kube-operator pod logs, there are: I0130 06:15:24.012794 1 leaderelection.go:258] successfully acquired lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock I0130 06:15:24.012918 1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator-lock", UID:"37a44dd4- 46e3-4681-931f-5be67996f7fc", APIVersion:"v1", ResourceVersion:"105463", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' kube-apiserver-operator-55f775964-7jvr5_6edb975a-4d8f-41 4b-9a2a-6d508cf8f682 became leader I0130 06:15:24.012934 1 event.go:285] Event(v1.ObjectReference{Kind:"Lease", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator-lock", UID:"9513411e-a2b0 -4309-97c9-f5a2325ecc4b", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"105464", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' kube-apiserver-operator-55f775964-7jvr5_ 6edb975a-4d8f-414b-9a2a-6d508cf8f682 became leader This means configmap-based and lease-based elections both work well in 4.10. Deleted openshift-apiserver-operator pod, wait for the new pod to be created, then check openshift-apiserver-operator pod logs, there are: I0130 06:15:42.533693 1 leaderelection.go:258] successfully acquired lease openshift-apiserver-operator/openshift-apiserver-operator-lock I0130 06:15:42.533921 1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator-lock", UID:"f0fae852 -6140-4a1f-8f06-10cb646e6877", APIVersion:"v1", ResourceVersion:"105633", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' openshift-apiserver-operator-54cb9c8457-66vwf_394f63fa -c25b-428c-8b34-2f7d18e8d00a became leader I0130 06:15:42.533945 1 event.go:285] Event(v1.ObjectReference{Kind:"Lease", Namespace:"openshift-apiserver-operator", Name:"openshift-apiserver-operator-lock", UID:"ea4ee887-33b 2-4df2-9892-2d4e9121c2c7", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"105634", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' openshift-apiserver-operator-54cb9c845 7-66vwf_394f63fa-c25b-428c-8b34-2f7d18e8d00a became leader The same goes for openshift-apiserver-operator. Compare it versus 4.9, openshift-kube-operator and openshift-apiserver-operator pod logs only show lines of configmap-based election. No lines of lease-based election. This further verifies 4.10 is working as expected by the bug's PR. Based on above, no further test can be done IMO, moving to VERIFIED. Per https://github.com/kubernetes/kubernetes/issues/107454 , we should watch QE upgrades from 4.9 (i.e. x) to 4.10 (i.e. x+1)to see if there would be election issue. If there would be, we'll file separate bug. Since 4.11 is not yet rebased to k8s 1.24, we cannot watch upgrade from 4.10 to 4.11 (i.e. x+2) right now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056