Bug 1691085

Summary: clusteroperator/kube-scheduler not ready due to static pods failing with missing RBAC
Product: OpenShift Container Platform Reporter: ewolinet
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED WORKSFORME QA Contact: Xingxing Xia <xxia>
Severity: low Docs Contact:
Priority: low    
Version: 4.1.0CC: aos-bugs, bparees, ccoleman, gblomqui, jokerman, mfojtik, mmccomas
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-10 19:32:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC none

Description ewolinet 2019-03-20 20:07:26 UTC
Discovered in aws-serial test in monitor. Tests appear to continue running afterwards.

ref: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.0/3103

Snippet from monitor:

Mar 20 12:44:54.840 E clusteroperator/kube-scheduler changed Failing to True: StaticPodsFailing: StaticPodsFailing: nodes/ip-10-0-167-200.ec2.internal pods/openshift-kube-scheduler-ip-10-0-167-200.ec2.internal container="scheduler" is not ready\nStaticPodsFailing: nodes/ip-10-0-167-200.ec2.internal pods/openshift-kube-scheduler-ip-10-0-167-200.ec2.internal container="scheduler" is terminated: "Error" - "orization.k8s.io \"basic-user\" not found, clusterrole.rbac.authorization.k8s.io \"system:scope-impersonation\" not found, clusterrole.rbac.authorization.k8s.io \"system:kube-scheduler\" not found, clusterrole.rbac.authorization.k8s.io \"system:webhook\" not found, clusterrole.rbac.authorization.k8s.io \"system:discovery\" not found, clusterrole.rbac.authorization.k8s.io \"system:volume-scheduler\" not found, clusterrole.rbac.authorization.k8s.io \"system:oauth-token-deleter\" not found, clusterrole.rbac.authorization.k8s.io \"system:build-strategy-docker\" not found, clusterrole.rbac.authorization.k8s.io \"system:discovery\" not found, clusterrole.rbac.authorization.k8s.io \"system:build-strategy-source\" not found, clusterrole.rbac.authorization.k8s.io \"system:build-strategy-jenkinspipeline\" not found, clusterrole.rbac.authorization.k8s.io \"system:basic-user\" not found, clusterrole.rbac.authorization.k8s.io \"cluster-status\" not found, clusterrole.rbac.authorization.k8s.io \"self-access-reviewer\" not found]\nE0320 12:44:51.479586       1 event.go:259] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:\"\"}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'ip-10-0-167-200_a29ea6be-4b0d-11e9-8ccb-0e09ebc855f0 stopped leading'\nI0320 12:44:51.479691       1 leaderelection.go:249] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded\nE0320 12:44:51.479713       1 server.go:207] lost master\nlost lease\n"

Comment 1 Ben Parees 2019-03-20 20:19:36 UTC
kube-scheduler (the operator reporting the error event) is the Pod team, so sending there first, but i expect they may send this to the master team since it looks like there were kube-apiserver issues.

Comment 2 W. Trevor King 2019-03-22 05:43:45 UTC
Created attachment 1546773 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This has caused 18 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'clusteroperator/kube-scheduler changed Failing to True.*clusterrole.rbac.authorization.*not found'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 4 Seth Jennings 2019-03-27 20:39:49 UTC
yes, sending to Master to figure out why these ClusterRoles do not exist.

Comment 5 Seth Jennings 2019-04-01 16:57:12 UTC
*** Bug 1694186 has been marked as a duplicate of this bug. ***

Comment 6 Erica von Buelow 2019-04-01 18:02:10 UTC
I'm not sure this belongs with auth. I don't see anything to indicate an auth issue. The scheduler pod eventually starts up; the error log gets swept up as the last error that occurred but the failure seems to be the test timed out. The clusterrole not found issue is more likely an api server uptime issue than a problem of those CRs not existing. AWS resource limits, cert rotation timings, possibly other things could be contributing to the delays in the test run. I'm not sure who's the right owner for those types of issues.

Comment 7 Michal Fojtik 2019-04-04 17:08:54 UTC
This has caused 18 of our 861 failures, lowering priority as this might only be transient failure.