Bug 1722835

Summary: Kube-scheduler broken on upgrade to 4.1.2
Product: OpenShift Container Platform Reporter: Naveen Malik <nmalik>
Component: MasterAssignee: ravig <rgudimet>
Status: CLOSED DUPLICATE QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.zCC: aos-bugs, deads, eparis, jeder, jokerman, mfojtik, mmccomas, rgudimet
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-24 17:28:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
clusterversion
none
rolebinding
none
role
none
who-can
none
pod logs none

Description Naveen Malik 2019-06-21 12:41:32 UTC
Description of problem:
Cluster upgraded from 4.1.0-rc.7 through to 4.1.2
Cluster reports upgrade is progressing with status "Unable to apply 4.1.2: the cluster operator kube-scheduler is degraded"
Review of kube-scheduler pods indicate RBAC issues, though user in question does appear to have permissions that are reported as missing.

This is on a long running cluster that was provisioned on 4.1.0-rc.7 on 2019-05-30.  Did not observer any issues with another cluster managed in the same way, upgraded from 4.1.0-rc.7 through to 4.1.2 on the same schedule.

Version-Release number of selected component (if applicable):
OCP 4.1.2

How reproducible:
One cluster upgrade

Steps to Reproduce:
1. Provision OCP 4.1.0-rc.7
2. Upgrade to 4.1.0-rc.9
3. Upgrade to 4.1.0
4. Upgrade to 4.1.2

Actual results:
Unable to complete upgrade to 4.1.2


Expected results:
Kube scheduler in good state on upgrade.


Additional info:
See attachments for logs and CR's.  Happy to provide more as needed.

Comment 1 Naveen Malik 2019-06-21 12:42:07 UTC
Created attachment 1583178 [details]
clusterversion

Comment 2 Naveen Malik 2019-06-21 12:42:26 UTC
Created attachment 1583179 [details]
rolebinding

Comment 3 Naveen Malik 2019-06-21 12:42:41 UTC
Created attachment 1583180 [details]
role

Comment 4 Naveen Malik 2019-06-21 12:43:10 UTC
Created attachment 1583181 [details]
who-can

Comment 5 Naveen Malik 2019-06-21 12:44:00 UTC
Created attachment 1583182 [details]
pod logs

I picked on the configmap access in the openshift-kube-scheduler namespace to dig into, hence the other RBAC related attachments.

Comment 6 David Eads 2019-06-24 13:04:13 UTC
Can you provide the output archive from `oc adm must-gather`?  It will include additional operator related information for us to debug.

Comment 7 ravig 2019-06-24 14:24:31 UTC
I think the underlying issue here is kube-scheduler not able to communicate with api-server.

`Failed to watch *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?resourceVersion=9463143&timeout=8m1s&timeoutSeconds=481&watch=true: dial tcp [::1]:6443: connect: connection refused`

Do you have logs from other scheduler pods on the remaining 2 master nodes and yes you can get that information from `oc adm must-gather` as David mentioned.

Comment 9 David Eads 2019-06-24 17:28:04 UTC
Thanks for the update.  Based on this we can clusteroperator/kube-scheduler and the kubescheduler.operator.openshift.io/cluster and we see it's a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1721566.

*** This bug has been marked as a duplicate of bug 1721566 ***