Bug 1691085

Summary:

clusteroperator/kube-scheduler not ready due to static pods failing with missing RBAC

Product:

OpenShift Container Platform

Reporter:

ewolinet

Component:

Master

Assignee:

Michal Fojtik <mfojtik>

Status:

CLOSED WORKSFORME

QA Contact:

Xingxing Xia <xxia>

Severity:

low

Docs Contact:

Priority:

low

Version:

4.1.0

CC:

aos-bugs, bparees, ccoleman, gblomqui, jokerman, mfojtik, mmccomas

Target Milestone:

---

Target Release:

4.3.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-03-10 19:32:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC	none

Description ewolinet 2019-03-20 20:07:26 UTC

Discovered in aws-serial test in monitor. Tests appear to continue running afterwards.

ref: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/3103

Snippet from monitor:

Mar 20 12:44:54.840 E clusteroperator/kube-scheduler changed Failing to True: StaticPodsFailing: StaticPodsFailing: nodes/ip-10-0-167-200.ec2.internal pods/openshift-kube-scheduler-ip-10-0-167-200.ec2.internal container="scheduler" is not ready\nStaticPodsFailing: nodes/ip-10-0-167-200.ec2.internal pods/openshift-kube-scheduler-ip-10-0-167-200.ec2.internal container="scheduler" is terminated: "Error" - "orization.k8s.io \"basic-user\" not found, clusterrole.rbac.authorization.k8s.io \"system:scope-impersonation\" not found, clusterrole.rbac.authorization.k8s.io \"system:kube-scheduler\" not found, clusterrole.rbac.authorization.k8s.io \"system:webhook\" not found, clusterrole.rbac.authorization.k8s.io \"system:discovery\" not found, clusterrole.rbac.authorization.k8s.io \"system:volume-scheduler\" not found, clusterrole.rbac.authorization.k8s.io \"system:oauth-token-deleter\" not found, clusterrole.rbac.authorization.k8s.io \"system:build-strategy-docker\" not found, clusterrole.rbac.authorization.k8s.io \"system:discovery\" not found, clusterrole.rbac.authorization.k8s.io \"system:build-strategy-source\" not found, clusterrole.rbac.authorization.k8s.io \"system:build-strategy-jenkinspipeline\" not found, clusterrole.rbac.authorization.k8s.io \"system:basic-user\" not found, clusterrole.rbac.authorization.k8s.io \"cluster-status\" not found, clusterrole.rbac.authorization.k8s.io \"self-access-reviewer\" not found]\nE0320 12:44:51.479586       1 event.go:259] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'ip-10-0-167-200_a29ea6be-4b0d-11e9-8ccb-0e09ebc855f0 stopped leading'\nI0320 12:44:51.479691       1 leaderelection.go:249] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded\nE0320 12:44:51.479713       1 server.go:207] lost master\nlost lease\n"

Comment 1 Ben Parees 2019-03-20 20:19:36 UTC

kube-scheduler (the operator reporting the error event) is the Pod team, so sending there first, but i expect they may send this to the master team since it looks like there were kube-apiserver issues.

Comment 2 W. Trevor King 2019-03-22 05:43:45 UTC

Created attachment 1546773 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This has caused 18 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'clusteroperator/kube-scheduler changed Failing to True.*clusterrole.rbac.authorization.*not found'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 4 Seth Jennings 2019-03-27 20:39:49 UTC

yes, sending to Master to figure out why these ClusterRoles do not exist.

Comment 5 Seth Jennings 2019-04-01 16:57:12 UTC

*** Bug 1694186 has been marked as a duplicate of this bug. ***

Comment 6 Erica von Buelow 2019-04-01 18:02:10 UTC

I'm not sure this belongs with auth. I don't see anything to indicate an auth issue. The scheduler pod eventually starts up; the error log gets swept up as the last error that occurred but the failure seems to be the test timed out. The clusterrole not found issue is more likely an api server uptime issue than a problem of those CRs not existing. AWS resource limits, cert rotation timings, possibly other things could be contributing to the delays in the test run. I'm not sure who's the right owner for those types of issues.

Comment 7 Michal Fojtik 2019-04-04 17:08:54 UTC

This has caused 18 of our 861 failures, lowering priority as this might only be transient failure.

Comment 9 Ben Parees 2020-03-10 19:32:11 UTC

doesn't seem to be occurring anymore.
https://search.svc.ci.openshift.org/?search=changed+Failing+to+True%3A+StaticPodsFailing%3A+StaticPodsFailing&maxAge=336h&context=2&type=all