Bug 1881322 - kube-scheduler not scheduling pods for certificates not renewed automatically after nodes restoration
Summary: kube-scheduler not scheduling pods for certificates not renewed automatically...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Maciej Szulik
QA Contact: RamaKasturi
URL:
Whiteboard:
: 1876868 1882083 1882665 (view as bug list)
Depends On:
Blocks: 1903523 1903586
TreeView+ depends on / blocked
 
Reported: 2020-09-22 07:55 UTC by Christian Passarelli
Modified: 2023-12-15 19:27 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1903586 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:18:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-scheduler-operator pull 288 0 None closed Bug 1881322: Sync new kube-scheduler-client-cert-key on recovery 2021-02-15 17:37:49 UTC
Red Hat Bugzilla 1865981 0 low CLOSED NodeStatusUnknown Kubelet stopped posting node status 4.4.9 on Azure, can't reach apiserver 2023-12-15 18:42:31 UTC
Red Hat Knowledge Base (Solution) 5442201 0 None None None 2020-10-02 16:49:48 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:20:57 UTC

Comment 10 Tomáš Nožička 2020-10-02 11:39:30 UTC
*** Bug 1882665 has been marked as a duplicate of this bug. ***

Comment 11 Tomáš Nožička 2020-10-02 11:39:41 UTC
*** Bug 1876868 has been marked as a duplicate of this bug. ***

Comment 12 Tomáš Nožička 2020-10-02 11:40:24 UTC
*** Bug 1882083 has been marked as a duplicate of this bug. ***

Comment 19 RamaKasturi 2020-12-24 15:14:46 UTC
Hi Tomas,

   I have tried reproducing the bug by moving the time ahead to 61 days but i could not reproduce the issue what is said above.

steps followed to reproduce in 4.4 cluster:
==========================================
kube-scheduler-client-cert-key was supposed to expire on Jan23, 2021 from the below

2020-12-24T12:20:56Z  2021-01-23T12:20:57Z  openshift-kube-scheduler	kube-scheduler-client-cert-key

1) move time a head to 61 days 
current Time: 
[knarra@knarra openshift-client-linux-4.7.0-0.nightly-2020-12-21-131655]$ date -u
Thu Dec 24 14:48:42 UTC 2020

Time change: [core@knarra461224-4n4cq-compute-0 ~]$ date
Tue Feb 23 14:33:40 UTC 2021

2) Approved the CSR, all co operators came back, scheduled a pod and it ran fine.
[core@knarra449-v8bmk-compute-0 ~]$ oc get pods -o wide
NAME        READY   STATUS    RESTARTS   AGE     IP            NODE                        NOMINATED NODE   READINESS GATES
pod-bcp2m   1/1     Running   0          6m19s   10.131.0.22   knarra449-v8bmk-compute-2   <none>           <none>

openshift-kube-scheduler-knarra449-v8bmk-control-plane-0   2/2     Running     2          61d
openshift-kube-scheduler-knarra449-v8bmk-control-plane-1   2/2     Running     0          61d
openshift-kube-scheduler-knarra449-v8bmk-control-plane-2   2/2     Running     2          61d

But do see a error message as below when run "oc get <kube-scheduler-pod> -o yaml"

    lastState:
      terminated:
        containerID: cri-o://e450c3519f0c774cdca56a1a067365d7f513532c09c9f2a6ca64c4d5872dd3f7
        exitCode: 255
        finishedAt: "2021-02-23T13:19:56Z"
        message: |
          Get https://localhost:6443/apis/apps/v1/statefulsets?allowWatchBookmarks=true&resourceVersion=35408&timeout=6m21s&timeoutSeconds=381&watch=true: dial tcp [::1]:6443: connect: connection refused
          E0223 13:19:56.150096       1 reflector.go:307] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:246: Failed to watch *v1.Pod: Get https://localhost:6443/api/v1/pods?allowWatchBookmarks=true&fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=37548&timeoutSeconds=307&watch=true: dial tcp [::1]:6443: connect: connection refused
          E0223 13:19:56.151158       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?allowWatchBookmarks=true&resourceVersion=18128&timeout=9m35s&timeoutSeconds=575&watch=true: dial tcp [::1]:6443: connect: connection refused
          E0223 13:19:56.152186       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?allowWatchBookmarks=true&resourceVersion=18133&timeout=5m29s&timeoutSeconds=329&watch=true: dial tcp [::1]:6443: connect: connection refused
          E0223 13:19:56.153451       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?allowWatchBookmarks=true&resourceVersion=36851&timeout=7m46s&timeoutSeconds=466&watch=true: dial tcp [::1]:6443: connect: connection refused
          E0223 13:19:56.154451       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Service: Get https://localhost:6443/api/v1/services?allowWatchBookmarks=true&resourceVersion=36030&timeout=6m34s&timeoutSeconds=394&watch=true: dial tcp [::1]:6443: connect: connection refused
          I0223 13:19:56.844338       1 leaderelection.go:288] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition
          F0223 13:19:56.844475       1 server.go:257] leaderelection lost
        reason: Error
        startedAt: "2021-02-23T13:11:53Z"

Also seen in the logs as below:
====================================
E0223 13:17:42.849631       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Node: unknown (get nodes)
E0223 13:17:42.849740       1 leaderelection.go:331] error retrieving resource lock openshift-kube-scheduler/kube-scheduler: configmaps "kube-scheduler" is forbidden: User "system:kube-scheduler" cannot get resource "configmaps" in API group "" in the namespace "openshift-kube-scheduler"
E0223 13:17:42.849786       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.ReplicaSet: unknown (get replicasets.apps)
E0223 13:17:42.849812       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolume: unknown (get persistentvolumes)
E0223 13:17:42.849839       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)
E0223 13:17:42.849869       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Service: unknown (get services)
E0223 13:17:42.849897       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.CSINode: unknown (get csinodes.storage.k8s.io)
E0223 13:17:42.849924       1 reflector.go:307] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:246: Failed to watch *v1.Pod: unknown (get pods)
E0223 13:17:42.849949       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.StorageClass: unknown (get storageclasses.storage.k8s.io)
E0223 13:17:42.849973       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.ReplicationController: unknown (get replicationcontrollers)
I0223 13:19:50.009153       1 leaderelection.go:252] successfully acquired lease openshift-kube-scheduler/kube-scheduler
I0223 14:33:13.829148       1 scheduler.go:751] pod default/pod-bcp2m is bound successfully on node "knarra449-v8bmk-compute-2", 6 nodes evaluated, 3 nodes were found feasible.



steps followed to reproduce in 4.6 cluster:
==============================================
kube-scheduler-client-cert-key was supposed to expire on Jan23, 2021 from below

2020-12-24T06:59:52Z  2021-01-23T06:59:53Z  openshift-kube-scheduler	kube-scheduler-client-cert-key

1) move time a head to 61 days
current Time:
[knarra@knarra openshift-client-linux-4.7.0-0.nightly-2020-12-21-131655]$ date -u
Thu Dec 24 14:53:54 UTC 2020

Time moved ahead:
[core@knarra461224-4n4cq-compute-0 ~]$ date 
Tue Feb 23 14:39:06 UTC 2021

2) Approved CSR, all co operators came back, scheduled a pod and it ran fine.
[core@knarra461224-4n4cq-compute-0 ~]$ oc get pods -o wide
NAME        READY   STATUS    RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
pod-qz8x8   1/1     Running   0          78m   10.129.2.132   knarra461224-4n4cq-compute-2   <none>           <none>

[core@knarra461224-4n4cq-compute-0 ~]$ oc get pods -n openshift-kube-scheduler
NAME                                                          READY   STATUS      RESTARTS   AGE
openshift-kube-scheduler-knarra461224-4n4cq-control-plane-0   2/2     Running     1          61d
openshift-kube-scheduler-knarra461224-4n4cq-control-plane-1   2/2     Running     0          61d
openshift-kube-scheduler-knarra461224-4n4cq-control-plane-2   2/2     Running     0          61d

But do see a error message as below when run "oc get <kube-scheduler-pod> -o yaml" command

        exitCode: 1
        finishedAt: "2021-02-23T11:41:09Z"
        message: |
          rra461224-4n4cq-compute-2" evaluatedNodes=6 feasibleNodes=1
          I1224 11:40:36.411828       1 scheduler.go:597] "Successfully bound pod to node" pod="openshift-marketplace/community-operators-2k755" node="knarra461224-4n4cq-compute-2" evaluatedNodes=6 feasibleNodes=1
          E0223 11:40:59.662510       1 leaderelection.go:325] error retrieving resource lock openshift-kube-scheduler/kube-scheduler: Unauthorized
          E0223 11:41:01.664787       1 leaderelection.go:325] error retrieving resource lock openshift-kube-scheduler/kube-scheduler: Unauthorized
          E0223 11:41:03.664855       1 leaderelection.go:325] error retrieving resource lock openshift-kube-scheduler/kube-scheduler: Unauthorized
          E0223 11:41:05.665244       1 leaderelection.go:325] error retrieving resource lock openshift-kube-scheduler/kube-scheduler: Unauthorized
          E0223 11:41:07.664552       1 leaderelection.go:325] error retrieving resource lock openshift-kube-scheduler/kube-scheduler: Unauthorized
          E0223 11:41:08.508757       1 webhook.go:111] Failed to make webhook authenticator request: Unauthorized
          E0223 11:41:08.508810       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Unauthorized]
          E0223 11:41:08.508956       1 writers.go:107] apiserver was unable to write a JSON response: no kind is registered for the type v1.Status in scheme "k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30"
          E0223 11:41:08.508993       1 status.go:71] apiserver received an error that is not an metav1.Status: &runtime.notRegisteredErr{schemeName:"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30", gvk:schema.GroupVersionKind{Group:"", Version:"", Kind:""}, target:runtime.GroupVersioner(nil), t:(*reflect.rtype)(0x1bb25a0)}
          I0223 11:41:09.659870       1 leaderelection.go:278] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition
          E0223 11:41:09.659962       1 leaderelection.go:301] Failed to release lock: resource name may not be empty
          F0223 11:41:09.659986       1 server.go:211] leaderelection lost
        reason: Error
        startedAt: "2020-12-24T07:21:45Z"

see below in the logs:
===========================
E0223 11:52:44.444938       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized
E0223 11:52:47.255029       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.PodDisruptionBudget: failed to list *v1beta1.PodDisruptionBudget: Unauthorized
E0223 11:52:49.530676       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicationController: failed to list *v1.ReplicationController: Unauthorized
E0223 11:52:51.708276       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: Unauthorized
E0223 11:52:52.464690       1 webhook.go:111] Failed to make webhook authenticator request: Unauthorized
E0223 11:52:52.464868       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Unauthorized]
E0223 11:52:52.464983       1 writers.go:107] apiserver was unable to write a JSON response: no kind is registered for the type v1.Status in scheme "k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30"
E0223 11:52:52.465087       1 status.go:71] apiserver received an error that is not an metav1.Status: &runtime.notRegisteredErr{schemeName:"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30", gvk:schema.GroupVersionKind{Group:"", Version:"", Kind:""}, target:runtime.GroupVersioner(nil), t:(*reflect.rtype)(0x1bb25a0)}
E0223 11:53:05.446270       1 webhook.go:111] Failed to make webhook authenticator request: Unauthorized
E0223 11:53:05.446357       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Unauthorized]
E0223 11:53:05.446520       1 writers.go:107] apiserver was unable to write a JSON response: no kind is registered for the type v1.Status in scheme "k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30"
E0223 11:53:05.446552       1 status.go:71] apiserver received an error that is not an metav1.Status: &runtime.notRegisteredErr{schemeName:"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30", gvk:schema.GroupVersionKind{Group:"", Version:"", Kind:""}, target:runtime.GroupVersioner(nil), t:(*reflect.rtype)(0x1bb25a0)}
E0223 11:53:07.776379       1 reflector.go:127] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:193: Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
E0223 11:53:08.482780       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
E0223 11:53:09.409374       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StatefulSet: failed to list *v1.StatefulSet: Unauthorized
E0223 11:53:09.674412       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: Unauthorized
I0223 11:53:11.409175       1 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials
I0223 11:53:11.437030       1 cert_rotation.go:88] certificate rotation detected, shutting down client connections to start using new credentials


Could you please let me know how else i can reproduce the issue here ? Also if i cannot reproduce the issue,  is there a way for us to validate it in  4.7 cluster?

I have also tried moving to 365 days but when checked with xxia he recommended me to use 61 days due to this https://source.redhat.com/personal_blogs/ocp4_why_ca_cert_will_be_refreshed_at_292_day_when_the_cert_have_365days_validity

Comment 20 Tomáš Nožička 2021-01-06 07:56:34 UTC
discussed on slack, Rama is trying again with updated script

Comment 26 errata-xmlrpc 2021-02-24 15:18:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.