Bug 1722835

Summary:

Kube-scheduler broken on upgrade to 4.1.2

Product:

OpenShift Container Platform

Reporter:

Naveen Malik <nmalik>

Component:

Master

Assignee:

ravig <rgudimet>

Status:

CLOSED DUPLICATE

QA Contact:

Xingxing Xia <xxia>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

4.1.z

CC:

aos-bugs, deads, eparis, jeder, jokerman, mfojtik, mmccomas, rgudimet

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-24 17:28:04 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
clusterversion	none
rolebinding	none
role	none
who-can	none
pod logs	none

Description Naveen Malik 2019-06-21 12:41:32 UTC

Description of problem:
Cluster upgraded from 4.1.0-rc.7 through to 4.1.2
Cluster reports upgrade is progressing with status "Unable to apply 4.1.2: the cluster operator kube-scheduler is degraded"
Review of kube-scheduler pods indicate RBAC issues, though user in question does appear to have permissions that are reported as missing.

This is on a long running cluster that was provisioned on 4.1.0-rc.7 on 2019-05-30.  Did not observer any issues with another cluster managed in the same way, upgraded from 4.1.0-rc.7 through to 4.1.2 on the same schedule.

Version-Release number of selected component (if applicable):
OCP 4.1.2

How reproducible:
One cluster upgrade

Steps to Reproduce:
1. Provision OCP 4.1.0-rc.7
2. Upgrade to 4.1.0-rc.9
3. Upgrade to 4.1.0
4. Upgrade to 4.1.2

Actual results:
Unable to complete upgrade to 4.1.2


Expected results:
Kube scheduler in good state on upgrade.


Additional info:
See attachments for logs and CR's.  Happy to provide more as needed.

Comment 1 Naveen Malik 2019-06-21 12:42:07 UTC

Created attachment 1583178 [details]
clusterversion

Comment 2 Naveen Malik 2019-06-21 12:42:26 UTC

Created attachment 1583179 [details]
rolebinding

Comment 3 Naveen Malik 2019-06-21 12:42:41 UTC

Created attachment 1583180 [details]
role

Comment 4 Naveen Malik 2019-06-21 12:43:10 UTC

Created attachment 1583181 [details]
who-can

Comment 5 Naveen Malik 2019-06-21 12:44:00 UTC

Created attachment 1583182 [details]
pod logs

I picked on the configmap access in the openshift-kube-scheduler namespace to dig into, hence the other RBAC related attachments.

Comment 6 David Eads 2019-06-24 13:04:13 UTC

Can you provide the output archive from `oc adm must-gather`?  It will include additional operator related information for us to debug.

Comment 7 ravig 2019-06-24 14:24:31 UTC

I think the underlying issue here is kube-scheduler not able to communicate with api-server.

`Failed to watch *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?resourceVersion=9463143&timeout=8m1s&timeoutSeconds=481&watch=true: dial tcp [::1]:6443: connect: connection refused`

Do you have logs from other scheduler pods on the remaining 2 master nodes and yes you can get that information from `oc adm must-gather` as David mentioned.

Comment 9 David Eads 2019-06-24 17:28:04 UTC

Thanks for the update.  Based on this we can clusteroperator/kube-scheduler and the kubescheduler.operator.openshift.io/cluster and we see it's a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1721566.

*** This bug has been marked as a duplicate of bug 1721566 ***