Bug 1889746 - Cluster Upgrade is in progress 15 hours. Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out [NEEDINFO]
Summary: Cluster Upgrade is in progress 15 hours. Unable to apply 4.4.27: the cluster ...
Keywords:
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: 4.7.0
Assignee: Luis Sanchez
QA Contact: Xingxing Xia
URL:
Whiteboard: LifecycleStale
Depends On:
Blocks: 1869362
TreeView+ depends on / blocked
 
Reported: 2020-10-20 13:45 UTC by baiesi
Modified: 2020-11-20 14:12 UTC (History)
9 users (show)

Fixed In Version: milei@redhat.com , annair@redhat.com
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
mfojtik: needinfo?


Attachments (Terms of Use)
added pod apiserver log errors, maybe of interest... (21.86 KB, text/plain)
2020-10-20 14:08 UTC, baiesi
no flags Details

Description baiesi 2020-10-20 13:45:27 UTC
Title
Cluster Upgrade is in progress 15 hours. Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out

Description of problem:
Working cluster upgraded from v4.3.38->4.4.27 via stable-4.x channel caused our test env cluster in a unsucessful rolled out state.

cluster state:
* updated v4.3->v4.3.38 worked fine, no issues ,on stable channel via console, cluster/console running as expected
* update v4.3.38->4.4.27 on stable channel via ui never finished as of now 15+hours later, lost console, cli indicates "Unable to apply 4.4.27: the cluster  operator openshift-apiserver has not yet successfully rolled out"

Environment:
* upi initial installed on bare metal for v4.3.3 sucessful, all systems operational
* updated from v4.3.3 to v4.3.38 sucessfully via console with stable-4.x channel, all systems operational
* continued update from v4.3.38 to  v4.4.27  via console with stable-4.x channel, now stuck in limbo.
* load balancers: LB-(master0, master1, master2), LB-(worker0, worker1, worker2)
* 3 workers  and 3 master nodes
* 1 infra node has dual NICs to access both public and private network.
* cluster working as expected as  of v4.3.38

Version-Release number of selected component (if applicable):
Channel: stable-4.4
Last Completed Version:  4.3.38 - Upgrading to: v4.4.27
Update Status: Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out

How to reproduce:
Steps to Reproduce:
1.  Have a healthy v4.3.38 cluster
2.  Update from v4.3.38 to v4.4.27 on stable channel on UI.

Actual results:
* Cluster stuck: Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out
* No console access

[baiesi@laptop1 keys]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.38    True        True          157m    Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out

Expected results:
A successful upgrade to 4.4.27 to allow us to confinue to install v4.5 to apply system test load

Additional info:
oc get clusterversion
oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version    4.3.38    True        True          157m    Unable to apply 4.4.27: the  cluster operator openshift-apiserver has not yet successfully rolled  out

oc get pods -n openshift-apiserver
NAME                        READY   STATUS    RESTARTS   AGE
apiserver-7dbff755d-9w4zh   1/1     Running   0          3h44m
apiserver-7dbff755d-gnxnc   1/1     Running   0          3h44m
apiserver-7dbff755d-r2zhn   1/1     Running   0          3h43m

oc get pods -n openshift-console
NAME                         READY   STATUS    RESTARTS   AGE
console-76b985bc7c-hqcd7     1/1     Running   0          4h
console-76b985bc7c-j2dm4     1/1     Running   1          4h6m
downloads-74f6b6dcb6-wn5p7   1/1     Running   0          4h26m
downloads-74f6b6dcb6-xzxrl   1/1     Running   0          4h20m

oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out

oc get nodes
NAME      STATUS   ROLES           AGE     VERSION
master0   Ready    master,worker   7h34m   v1.17.1+45f8ddb
master1   Ready    master,worker   7h34m   v1.17.1+45f8ddb
master2   Ready    master,worker   7h33m   v1.17.1+45f8ddb
worker0   Ready    worker          7h34m   v1.17.1+45f8ddb
worker1   Ready    worker          7h34m   v1.17.1+45f8ddb
worker2   Ready    worker          7h34m   v1.17.1+45f8ddb

oc get co
NAME                                     VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                         4.4.27    True        True          True       7h16m
console                                   4.4.27    True        False         True       3h50m
insights                                   4.4.27    True        False         True       7h32m
monitoring                              4.4.27    False       False         True       11m
openshift-apiserver                 4.4.27    False       False         False      3h35m

oc adm must-gather
[root@dell-per730-09 aiesi]# oc adm must-gather
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d7c882054a4528eda72e69a7988c5931b5a1643913b11bfd2575a78a8620808f
[must-gather      ] OUT namespace/openshift-must-gather-nm7mh created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-b9j6h created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d7c882054a4528eda72e69a7988c5931b5a1643913b11bfd2575a78a8620808f created
[must-gather-jjsrn] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io openshift-must-gather-nm7mh)
[must-gather-jjsrn] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get buildconfigs.build.openshift.io)
[must-gather-jjsrn] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get imagestreams.image.openshift.io)
[must-gather-jjsrn] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get builds.build.openshift.io)
[must-gather-jjsrn] POD Wrote inspect data to must-gather.
[must-gather-jjsrn] POD Gathering data for ns/openshift-cluster-version...
[must-gather-jjsrn] POD Wrote inspect data to must-gather.
[must-gather-jjsrn] POD error: errors ocurred while gathering data:
[must-gather-jjsrn] POD     the server doesn't have a resource type "buildconfigs"
[must-gather-jjsrn] POD Gathering data for ns/openshift-config...
[must-gather-jjsrn] POD Gathering data for ns/openshift-config-managed...
[must-gather-jjsrn] POD Gathering data for ns/openshift-authentication...
[must-gather-jjsrn] POD Gathering data for ns/openshift-authentication-operator...
[must-gather-jjsrn] POD Gathering data for ns/openshift-ingress...
[must-gather-jjsrn] POD Gathering data for ns/openshift-cloud-credential-operator...
[must-gather-jjsrn] OUT gather logs unavailable: unexpected EOF
[must-gather-jjsrn] OUT waiting for gather to complete
[must-gather-jjsrn] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-b9j6h deleted
[must-gather      ] OUT namespace/openshift-must-gather-nm7mh deleted
error: gather never finished for pod must-gather-jjsrn: timed out waiting for the condition

Other:
System is available for a developer to checkout before we attempt to recover the cluster to a useable state.  Just let us know asap.  :)

Comment 1 baiesi 2020-10-20 14:08:59 UTC
Created attachment 1722909 [details]
added pod apiserver log errors, maybe of interest...

Comment 2 baiesi 2020-10-20 17:15:54 UTC
Able to collect must-gather:
oc adm must-gather -v=9 --keep

link:
http://10.8.32.38/str/ocpdebug/must-gather_v4.4_update_failure.tar.gz

Comment 3 baiesi 2020-10-20 17:47:38 UTC
Recovered Update:
After over 15hours in this update stuck state, we reboot master node0.  When master node0 came backup the update continued.

before reboot:
[root@dell-per730-09 aiesi]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.38    True        True          20h     Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out

-------

After reboot:
[root@dell-per730-09 aiesi]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.38    True        True          20h     Working towards 4.4.27: 24% complete

[root@dell-per730-09 aiesi]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.38    True        True          20h     Working towards 4.4.27: 77% complete

[root@dell-per730-09 aiesi]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.27    True        False         9m2s    Cluster version is 4.4.27

Sysytem console is functional
System cluster looks healthy

We are planning to update to the latest release 4.5....

Comment 5 Vadim Rutkovsky 2020-10-21 13:43:16 UTC
openshift-apiserver not becoming ready, although network and machines are looking fine. Reassigning

Comment 6 Michal Fojtik 2020-11-20 14:12:10 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.


Note You need to log in before you can comment on or make changes to this bug.