Bug 1834908

Summary:	Frequent restarts of kubescheduler pods on baremetal deployments
Product:	OpenShift Container Platform	Reporter:	Sai Sindhur Malleni <smalleni>
Component:	Networking	Assignee:	Jacob Tanenbaum <jtanenba>
Networking sub component:	ovn-kubernetes	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	aconstan, aos-bugs, dblack, jtaleric, mfojtik, mkarg
Version:	4.5	Keywords:	Performance
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-20 16:32:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2020-05-12 16:08:37 UTC

Description of problem:

This is an OCP 4.5 deployment on baremetal with OVNKubernetes as the SDN.

WE are seeing frequent restarts of the kube-scheduler pods

[kni@e19-h24-b01-fc640 etcd-perf]$ oc get pods                       
NAME                                READY   STATUS      RESTARTS   AGE
installer-2-master-0                0/1     Completed   0          3d18h
installer-3-master-0                0/1     Completed   0          3d18h
installer-3-master-1                0/1     Completed   0          3d18h
installer-3-master-2                0/1     Completed   0          3d18h
installer-4-master-0                0/1     Completed   0          3d18h
installer-4-master-1                0/1     Completed   0          3d18h                                                                                                                                                                                                       
installer-4-master-2                0/1     Completed   0          3d18h                                                                                                                                                                                                       
installer-5-master-0                0/1     Completed   0          3d18h
installer-5-master-1                0/1     Completed   0          3d18h
installer-5-master-2                0/1     Completed   0          3d18h
openshift-kube-scheduler-master-0   2/2     Running     12         3d18h
openshift-kube-scheduler-master-1   2/2     Running     14         3d18h
openshift-kube-scheduler-master-2   2/2     Running     9          3d18h
revision-pruner-2-master-0          0/1     Completed   0          3d18h
revision-pruner-3-master-0          0/1     Completed   0          3d18h                   
revision-pruner-3-master-1          0/1     Completed   0          3d18h                                                                                                                                                                                                       
revision-pruner-3-master-2          0/1     Completed   0          3d18h                                                                                                                                                                                                       
revision-pruner-4-master-0          0/1     Completed   0          3d18h
revision-pruner-4-master-1          0/1     Completed   0          3d18h
revision-pruner-4-master-2          0/1     Completed   0          3d18h
revision-pruner-5-master-0          0/1     Completed   0          3d18h
revision-pruner-5-master-1          0/1     Completed   0          3d18h
revision-pruner-5-master-2          0/1     Completed   0          3d18h


Looking at one of the pods we see
    State:       Running
      Started:   Tue, 12 May 2020 15:40:33 +0000
    Last State:  Terminated
      Reason:    Error
      Message:   e: Get https://localhost:6443/api/v1/nodes?allowWatchBookmarks=true&resourceVersion=2912586&timeout=9m59s&timeoutSeconds=599&watch=true: dial tcp [::1]:6443: connect: connection refused                                                                     
E0512 15:40:31.811267       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1beta1.PodDisruptionBudget: Get https://localhost:6443/apis/policy/v1beta1/poddisruptionbudgets?allowWatchBookmarks=true&resourceVersion=2721641&timeout=9m43s&time
outSeconds=583&watch=true: dial tcp [::1]:6443: connect: connection refused
E0512 15:40:31.812242       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?allowWatchBookmarks=true&resourceVersion=2721639&timeout=8m15s&timeoutSeconds=495&watch=tru
e: dial tcp [::1]:6443: connect: connection refused
E0512 15:40:31.813289       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?allowWatchBookmarks=true&resourceVersion=2721639&timeout=8m42s&timeoutSeconds=522
&watch=true: dial tcp [::1]:6443: connect: connection refused
E0512 15:40:31.814445       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.Service: Get https://localhost:6443/api/v1/services?allowWatchBookmarks=true&resourceVersion=2869975&timeout=9m34s&timeoutSeconds=574&watch=true: dial tcp [::1]:
6443: connect: connection refused
E0512 15:40:31.815408       1 reflector.go:380] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?allowWatchBookmarks=true&resourceVersion=2721642&timeout=5m41s&timeoutSeconds=341&
watch=true: dial tcp [::1]:6443: connect: connection refused
I0512 15:40:32.638778       1 leaderelection.go:277] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition                                                                                                                        



Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-04-113741

How reproducible:
100%

Steps to Reproduce:
1. Deploy cluster on baremetal with OVN Kubernetes
2. RUn some workloads like running pods
3.

Actual results:
Kube-scheduler restarts frequently

Expected results:
Should not see restarts/connect timeouts

Additional info:

Comment 1 Sai Sindhur Malleni 2020-05-12 16:09:18 UTC

Also, we checked etcd fsync latency and it was about 2ms which should be acceptable.

Comment 2 Maciej Szulik 2020-05-13 10:30:00 UTC

This looks like a networking issue where kube-scheduler can't reach kube-apiserver which should be available at https://localhost:6443/
I'm sending this to networking since that looks like a misconfiguration.

Comment 4 Jacob Tanenbaum 2020-05-20 16:32:54 UTC


*** This bug has been marked as a duplicate of bug 1837992 ***