Bug 1747871 - [ci] openshift-kube-scheduler operator fails
Summary: [ci] openshift-kube-scheduler operator fails
Keywords:
Status: CLOSED DUPLICATE of bug 1761609
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.3.0
Assignee: Casey Callendrello
QA Contact: zhaozhanqi
URL:
Whiteboard: buildcop
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-02 06:55 UTC by Yadan Pei
Modified: 2019-12-03 10:50 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-12-03 10:50:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Yadan Pei 2019-09-02 06:55:24 UTC
Description of problem:
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-openstack-4.2/69


Sep 01 12:00:39.296 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready" to "StaticPodsDegraded: nodes/ci-op-wn1h3kbf-qkj68-master-0 pods/openshift-kube-scheduler-ci-op-wn1h3kbf-qkj68-master-0 container=\"scheduler\" is not ready\nStaticPodsDegraded: nodes/ci-op-wn1h3kbf-qkj68-master-0 pods/openshift-kube-scheduler-ci-op-wn1h3kbf-qkj68-master-0 container=\"scheduler\" is terminated: \"Error\" - \"configmaps\\\" in API group \\\"\\\" in the namespace \\\"openshift-kube-scheduler\\\"\\nE0901 12:00:10.828247       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User \\\"system:kube-scheduler\\\" cannot list resource \\\"replicationcontrollers\\\" in API group \\\"\\\" at the cluster scope\\nE0901 12:00:10.942694       1 webhook.go:107] Failed to make webhook authenticator request: tokenreviews.authentication.k8s.io is forbidden: User \\\"system:kube-scheduler\\\" cannot create resource \\\"tokenreviews\\\" in API group \\\"authentication.k8s.io\\\" at the cluster scope\\nE0901 12:00:10.942755       1 authentication.go:65] Unable to authenticate the request due to an error: [invalid bearer token, tokenreviews.authentication.k8s.io is forbidden: User \\\"system:kube-scheduler\\\" cannot create resource \\\"tokenreviews\\\" in API group \\\"authentication.k8s.io\\\" at the cluster scope]\\nE0901 12:00:11.289880       1 event.go:247] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:\\\"\\\", APIVersion:\\\"\\\"}, ObjectMeta:v1.ObjectMeta{Name:\\\"\\\", GenerateName:\\\"\\\", Namespace:\\\"\\\", SelfLink:\\\"\\\", UID:\\\"\\\", ResourceVersion:\\\"\\\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:\\\"\\\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'ci-op-wn1h3kbf-qkj68-master-0_28ebfa3f-ccab-11e9-ba3b-fa163ecbf40b stopped leading'\\nI0901 12:00:11.290047       1 leaderelection.go:263] failed to renew lease openshift-kube-scheduler/kube-scheduler: timed out waiting for the condition\\nF0901 12:00:11.290075       1 server.go:247] leaderelection lost\\n\"\nNodeControllerDegraded: All master node(s) are ready"

Version-Release number of selected component (if applicable):


How reproducible:
sometimes

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Maciej Szulik 2019-09-02 10:58:27 UTC
I went through the logs and I don't see any problem with the scheduler, the operator is working as expected
and scheduler is working properly. If there's a problem it looks like a problem with either the nodes being
available which might be in turn a problem with openstack infrastructure. I'm closing this, if you thing
the problem still exists please direct the bug at a specific component that is failing and not at a component
that you happen to find a log matching it.

Comment 5 Maciej Szulik 2019-09-20 20:29:23 UTC
The root cause is MCO has not finished the upgrade, so kube-apiserver is not ready (degraded) which in turn casues kube-scheduler to fail as well.
I'll pass this over to the MCO team for an investigation.

Comment 6 Erica von Buelow 2019-11-25 15:58:55 UTC
The SDN container seems to be crash looping. I'm moving this over to the networking team, although sine this bug is somewhat old it would be good to see if this is still an issue.

Comment 7 Casey Callendrello 2019-12-03 10:50:02 UTC
I see the issue; it seems to be slow SDN startup time in concert with a poorly written liveness check on one of the nodes. Maybe that node is just slow or had other connectivity issues.

We fixed that in 1761609.

I see that CI has been reasonably green (though the release jobs are a trainwreck.. not this problem), so I think this is fixed.

*** This bug has been marked as a duplicate of bug 1761609 ***


Note You need to log in before you can comment on or make changes to this bug.