Bug 2052598

Summary: kube-scheduler should use configmap lease
Product: OpenShift Container Platform Reporter: ravig <rgudimet>
Component: kube-schedulerAssignee: ravig <rgudimet>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: aos-bugs, calfonso, maszulik, mfojtik
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: EmergencyRequest
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2052599 2052701 (view as bug list) Environment:
Last Closed: 2022-03-10 16:43:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2052701    
Bug Blocks:    

Description ravig 2022-02-09 16:14:54 UTC
Description of problem:

https://bugzilla.redhat.com/show_bug.cgi?id=2037856 wanted to make sure that the operators have configmapleases where as we currently have leases only.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Michal Fojtik 2022-02-09 16:41:38 UTC
** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 4 Michal Fojtik 2022-02-09 20:09:37 UTC
** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 10 RamaKasturi 2022-02-14 10:18:25 UTC
Verified with the build below and i see that 4.10 use old configmap-based election and new lease-baded election.

[knarra@knarra ~]$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.2   True        False         162m    Cluster version is 4.10.0-rc.2

[knarra@knarra ~]$ oc get cm -n openshift-kube-scheduler-operator | grep lock
openshift-cluster-kube-scheduler-operator-lock   0      3h

[knarra@knarra ~]$ oc get lease -n openshift-kube-scheduler-operator
NAME                                             HOLDER                                                                                    AGE
openshift-cluster-kube-scheduler-operator-lock   openshift-kube-scheduler-operator-74c54bf5d4-gzr6x_52ed5d5a-f4d4-4d0c-8943-829c423c9205   3h1m

[knarra@knarra ~]$ oc get lease openshift-cluster-kube-scheduler-operator-lock -n openshift-kube-scheduler-operator -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2022-02-14T07:05:37Z"
  name: openshift-cluster-kube-scheduler-operator-lock
  namespace: openshift-kube-scheduler-operator
  resourceVersion: "84598"
  uid: 951f7918-447f-458d-b334-bb7adda4f69c
spec:
  acquireTime: "2022-02-14T07:07:07.000000Z"
  holderIdentity: openshift-kube-scheduler-operator-74c54bf5d4-gzr6x_52ed5d5a-f4d4-4d0c-8943-829c423c9205
  leaseDurationSeconds: 137
  leaseTransitions: 1
  renewTime: "2022-02-14T10:06:42.533002Z"

kube-scheduler-operator pod logs:
======================================
I0214 07:07:07.055620       1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/var/run/secrets/serving-cert/tls.crt::/var/run/secrets/serving-cert/tls.key"
I0214 07:07:07.055675       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0214 07:07:07.065914       1 leaderelection.go:258] successfully acquired lease openshift-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-lock
I0214 07:07:07.066048       1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-cluster-kube-scheduler-operator-lock", UID:"5fcf8693-47b7-40cc-b504-3c8934133c04", APIVersion:"v1", ResourceVersion:"9253", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' openshift-kube-scheduler-operator-74c54bf5d4-gzr6x_52ed5d5a-f4d4-4d0c-8943-829c423c9205 became leader
I0214 07:07:07.066114       1 event.go:285] Event(v1.ObjectReference{Kind:"Lease", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-cluster-kube-scheduler-operator-lock", UID:"951f7918-447f-458d-b334-bb7adda4f69c", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"9254", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' openshift-kube-scheduler-operator-74c54bf5d4-gzr6x_52ed5d5a-f4d4-4d0c-8943-829c423c9205 became leader


Based on the above moving bug to verified state.

Comment 12 errata-xmlrpc 2022-03-10 16:43:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056