Bug 1747871

Summary: [ci] openshift-kube-scheduler operator fails
Product: OpenShift Container Platform Reporter: Yadan Pei <yapei>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: low    
Priority: low CC: agarcial, aos-bugs, calfonso, hongkliu, kgarriso, mfojtik, sttts, yapei
Version: 4.2.0Keywords: Reopened
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: buildcop
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-03 10:50:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yadan Pei 2019-09-02 06:55:24 UTC
Description of problem:
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-openstack-4.2/69


Sep 01 12:00:39.296 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready" to "StaticPodsDegraded: nodes/ci-op-wn1h3kbf-qkj68-master-0 pods/openshift-kube-scheduler-ci-op-wn1h3kbf-qkj68-master-0 container=\"scheduler\" is not ready\nStaticPodsDegraded: nodes/ci-op-wn1h3kbf-qkj68-master-0 pods/openshift-kube-scheduler-ci-op-wn1h3kbf-qkj68-master-0 container=\"scheduler\" is terminated: \"Error\" - \"configmaps\\\" in API group \\\"\\\" in the namespace \\\"openshift-kube-scheduler\\\"\\nE0901 12:00:10.828247       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User \\\"system:kube-scheduler\\\" cannot list resource \\\"replicationcontrollers\\\" in API group \\\"\\\" at the cluster scope\\nE0901 12:00:10.942694       1 webhook.go:107] Failed to make webhook authenticator request: tokenreviews.authentication.k8s.io is forbidden: User \\\"system:kube-scheduler\\\" cannot create resource \\\"tokenreviews\\\" in API group \\\"authentication.k8s.io\\\" at the cluster scope\\nE0901 12:00:10.942755       1 authentication.go:65] Unable to authenticate the request due to an error: [invalid bearer token, tokenreviews.authentication.k8s.io is forbidden: User \\\"system:kube-scheduler\\\" cannot create resource \\\"tokenreviews\\\" in API group \\\"authentication.k8s.io\\\" at the cluster scope]\\nE0901 12:00:11.289880       1 event.go:247] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:\\\"\\\", APIVersion:\\\"\\\"}, ObjectMeta:v1.ObjectMeta{Name:\\\"\\\", GenerateName:\\\"\\\", Namespace:\\\"\\\", SelfLink:\\\"\\\", UID:\\\"\\\", ResourceVersion:\\\"\\\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:\\\"\\\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'ci-op-wn1h3kbf-qkj68-master-0_28ebfa3f-ccab-11e9-ba3b-fa163ecbf40b stopped leading'\\nI0901 12:00:11.290047       1 leaderelection.go:263] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition\\nF0901 12:00:11.290075       1 server.go:247] leaderelection lost\\n\"\nNodeControllerDegraded: All master node(s) are ready"

Version-Release number of selected component (if applicable):


How reproducible:
sometimes

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Maciej Szulik 2019-09-02 10:58:27 UTC
I went through the logs and I don't see any problem with the scheduler, the operator is working as expected
and scheduler is working properly. If there's a problem it looks like a problem with either the nodes being
available which might be in turn a problem with openstack infrastructure. I'm closing this, if you thing
the problem still exists please direct the bug at a specific component that is failing and not at a component
that you happen to find a log matching it.

Comment 5 Maciej Szulik 2019-09-20 20:29:23 UTC
The root cause is MCO has not finished the upgrade, so kube-apiserver is not ready (degraded) which in turn casues kube-scheduler to fail as well.
I'll pass this over to the MCO team for an investigation.

Comment 6 Erica von Buelow 2019-11-25 15:58:55 UTC
The SDN container seems to be crash looping. I'm moving this over to the networking team, although sine this bug is somewhat old it would be good to see if this is still an issue.

Comment 7 Casey Callendrello 2019-12-03 10:50:02 UTC
I see the issue; it seems to be slow SDN startup time in concert with a poorly written liveness check on one of the nodes. Maybe that node is just slow or had other connectivity issues.

We fixed that in 1761609.

I see that CI has been reasonably green (though the release jobs are a trainwreck.. not this problem), so I think this is fixed.

*** This bug has been marked as a duplicate of bug 1761609 ***