Bug 1834908

Summary: Frequent restarts of kubescheduler pods on baremetal deployments
Product: OpenShift Container Platform Reporter: Sai Sindhur Malleni <smalleni>
Component: NetworkingAssignee: Jacob Tanenbaum <jtanenba>
Networking sub component: ovn-kubernetes QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aconstan, aos-bugs, dblack, jtaleric, mfojtik, mkarg
Version: 4.5Keywords: Performance
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-20 16:32:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2020-05-12 16:08:37 UTC
Description of problem:

This is an OCP 4.5 deployment on baremetal with OVNKubernetes as the SDN.

WE are seeing frequent restarts of the kube-scheduler pods

[kni@e19-h24-b01-fc640 etcd-perf]$ oc get pods                       
NAME                                READY   STATUS      RESTARTS   AGE
installer-2-master-0                0/1     Completed   0          3d18h
installer-3-master-0                0/1     Completed   0          3d18h
installer-3-master-1                0/1     Completed   0          3d18h
installer-3-master-2                0/1     Completed   0          3d18h
installer-4-master-0                0/1     Completed   0          3d18h
installer-4-master-1                0/1     Completed   0          3d18h                                                                                                                                                                                                       
installer-4-master-2                0/1     Completed   0          3d18h                                                                                                                                                                                                       
installer-5-master-0                0/1     Completed   0          3d18h
installer-5-master-1                0/1     Completed   0          3d18h
installer-5-master-2                0/1     Completed   0          3d18h
openshift-kube-scheduler-master-0   2/2     Running     12         3d18h
openshift-kube-scheduler-master-1   2/2     Running     14         3d18h
openshift-kube-scheduler-master-2   2/2     Running     9          3d18h
revision-pruner-2-master-0          0/1     Completed   0          3d18h
revision-pruner-3-master-0          0/1     Completed   0          3d18h                   
revision-pruner-3-master-1          0/1     Completed   0          3d18h                                                                                                                                                                                                       
revision-pruner-3-master-2          0/1     Completed   0          3d18h                                                                                                                                                                                                       
revision-pruner-4-master-0          0/1     Completed   0          3d18h
revision-pruner-4-master-1          0/1     Completed   0          3d18h
revision-pruner-4-master-2          0/1     Completed   0          3d18h
revision-pruner-5-master-0          0/1     Completed   0          3d18h
revision-pruner-5-master-1          0/1     Completed   0          3d18h
revision-pruner-5-master-2          0/1     Completed   0          3d18h


Looking at one of the pods we see
    State:       Running
      Started:   Tue, 12 May 2020 15:40:33 +0000
    Last State:  Terminated
      Reason:    Error
      Message:   e: Get https://localhost:6443/api/v1/nodes?allowWatchBookmarks=true&resourceVersion=2912586&timeout=9m59s&timeoutSeconds=599&watch=true: dial tcp [::1]:6443: connect: connection refused                                                                     
E0512 15:40:31.811267       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1beta1.PodDisruptionBudget: Get https://localhost:6443/apis/policy/v1beta1/poddisruptionbudgets?allowWatchBookmarks=true&resourceVersion=2721641&timeout=9m43s&time
outSeconds=583&watch=true: dial tcp [::1]:6443: connect: connection refused
E0512 15:40:31.812242       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?allowWatchBookmarks=true&resourceVersion=2721639&timeout=8m15s&timeoutSeconds=495&watch=tru
e: dial tcp [::1]:6443: connect: connection refused
E0512 15:40:31.813289       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?allowWatchBookmarks=true&resourceVersion=2721639&timeout=8m42s&timeoutSeconds=522
&watch=true: dial tcp [::1]:6443: connect: connection refused
E0512 15:40:31.814445       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Service: Get https://localhost:6443/api/v1/services?allowWatchBookmarks=true&resourceVersion=2869975&timeout=9m34s&timeoutSeconds=574&watch=true: dial tcp [::1]:
6443: connect: connection refused
E0512 15:40:31.815408       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?allowWatchBookmarks=true&resourceVersion=2721642&timeout=5m41s&timeoutSeconds=341&
watch=true: dial tcp [::1]:6443: connect: connection refused
I0512 15:40:32.638778       1 leaderelection.go:277] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition                                                                                                                        



Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-04-113741

How reproducible:
100%

Steps to Reproduce:
1. Deploy cluster on baremetal with OVN Kubernetes
2. RUn some workloads like running pods
3.

Actual results:
Kube-scheduler restarts frequently

Expected results:
Should not see restarts/connect timeouts

Additional info:

Comment 1 Sai Sindhur Malleni 2020-05-12 16:09:18 UTC
Also, we checked etcd fsync latency and it was about 2ms which should be acceptable.

Comment 2 Maciej Szulik 2020-05-13 10:30:00 UTC
This looks like a networking issue where kube-scheduler can't reach kube-apiserver which should be available at https://localhost:6443/
I'm sending this to networking since that looks like a misconfiguration.

Comment 4 Jacob Tanenbaum 2020-05-20 16:32:54 UTC

*** This bug has been marked as a duplicate of bug 1837992 ***