1691085 – clusteroperator/kube-scheduler not ready due to static pods failing with missing RBAC

Bug 1691085 - clusteroperator/kube-scheduler not ready due to static pods failing with missing RBAC

Summary: clusteroperator/kube-scheduler not ready due to static pods failing with miss...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Michal Fojtik
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-20 20:07 UTC by ewolinet
Modified:	2020-03-10 19:32 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-10 19:32:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (341.40 KB, image/svg+xml) 2019-03-22 05:43 UTC, W. Trevor King	no flags	Details
View All

Description ewolinet 2019-03-20 20:07:26 UTC

Discovered in aws-serial test in monitor. Tests appear to continue running afterwards.

ref: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/3103

Snippet from monitor:

Mar 20 12:44:54.840 E clusteroperator/kube-scheduler changed Failing to True: StaticPodsFailing: StaticPodsFailing: nodes/ip-10-0-167-200.ec2.internal pods/openshift-kube-scheduler-ip-10-0-167-200.ec2.internal container="scheduler" is not ready\nStaticPodsFailing: nodes/ip-10-0-167-200.ec2.internal pods/openshift-kube-scheduler-ip-10-0-167-200.ec2.internal container="scheduler" is terminated: "Error" - "orization.k8s.io \"basic-user\" not found, clusterrole.rbac.authorization.k8s.io \"system:scope-impersonation\" not found, clusterrole.rbac.authorization.k8s.io \"system:kube-scheduler\" not found, clusterrole.rbac.authorization.k8s.io \"system:webhook\" not found, clusterrole.rbac.authorization.k8s.io \"system:discovery\" not found, clusterrole.rbac.authorization.k8s.io \"system:volume-scheduler\" not found, clusterrole.rbac.authorization.k8s.io \"system:oauth-token-deleter\" not found, clusterrole.rbac.authorization.k8s.io \"system:build-strategy-docker\" not found, clusterrole.rbac.authorization.k8s.io \"system:discovery\" not found, clusterrole.rbac.authorization.k8s.io \"system:build-strategy-source\" not found, clusterrole.rbac.authorization.k8s.io \"system:build-strategy-jenkinspipeline\" not found, clusterrole.rbac.authorization.k8s.io \"system:basic-user\" not found, clusterrole.rbac.authorization.k8s.io \"cluster-status\" not found, clusterrole.rbac.authorization.k8s.io \"self-access-reviewer\" not found]\nE0320 12:44:51.479586       1 event.go:259] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'ip-10-0-167-200_a29ea6be-4b0d-11e9-8ccb-0e09ebc855f0 stopped leading'\nI0320 12:44:51.479691       1 leaderelection.go:249] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded\nE0320 12:44:51.479713       1 server.go:207] lost master\nlost lease\n"

Comment 1 Ben Parees 2019-03-20 20:19:36 UTC

kube-scheduler (the operator reporting the error event) is the Pod team, so sending there first, but i expect they may send this to the master team since it looks like there were kube-apiserver issues.

Comment 2 W. Trevor King 2019-03-22 05:43:45 UTC

Created attachment 1546773 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This has caused 18 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'clusteroperator/kube-scheduler changed Failing to True.*clusterrole.rbac.authorization.*not found'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 4 Seth Jennings 2019-03-27 20:39:49 UTC

yes, sending to Master to figure out why these ClusterRoles do not exist.

Comment 5 Seth Jennings 2019-04-01 16:57:12 UTC

*** Bug 1694186 has been marked as a duplicate of this bug. ***

Comment 6 Erica von Buelow 2019-04-01 18:02:10 UTC

I'm not sure this belongs with auth. I don't see anything to indicate an auth issue. The scheduler pod eventually starts up; the error log gets swept up as the last error that occurred but the failure seems to be the test timed out. The clusterrole not found issue is more likely an api server uptime issue than a problem of those CRs not existing. AWS resource limits, cert rotation timings, possibly other things could be contributing to the delays in the test run. I'm not sure who's the right owner for those types of issues.

Comment 7 Michal Fojtik 2019-04-04 17:08:54 UTC

This has caused 18 of our 861 failures, lowering priority as this might only be transient failure.

Comment 9 Ben Parees 2020-03-10 19:32:11 UTC

doesn't seem to be occurring anymore.
https://search.svc.ci.openshift.org/?search=changed+Failing+to+True%3A+StaticPodsFailing%3A+StaticPodsFailing&maxAge=336h&context=2&type=all

Note You need to log in before you can comment on or make changes to this bug.