Title Cluster Upgrade is in progress 15 hours. Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out Description of problem: Working cluster upgraded from v4.3.38->4.4.27 via stable-4.x channel caused our test env cluster in a unsucessful rolled out state. cluster state: * updated v4.3->v4.3.38 worked fine, no issues ,on stable channel via console, cluster/console running as expected * update v4.3.38->4.4.27 on stable channel via ui never finished as of now 15+hours later, lost console, cli indicates "Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out" Environment: * upi initial installed on bare metal for v4.3.3 sucessful, all systems operational * updated from v4.3.3 to v4.3.38 sucessfully via console with stable-4.x channel, all systems operational * continued update from v4.3.38 to v4.4.27 via console with stable-4.x channel, now stuck in limbo. * load balancers: LB-(master0, master1, master2), LB-(worker0, worker1, worker2) * 3 workers and 3 master nodes * 1 infra node has dual NICs to access both public and private network. * cluster working as expected as of v4.3.38 Version-Release number of selected component (if applicable): Channel: stable-4.4 Last Completed Version: 4.3.38 - Upgrading to: v4.4.27 Update Status: Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out How to reproduce: Steps to Reproduce: 1. Have a healthy v4.3.38 cluster 2. Update from v4.3.38 to v4.4.27 on stable channel on UI. Actual results: * Cluster stuck: Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out * No console access [baiesi@laptop1 keys]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.38 True True 157m Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out Expected results: A successful upgrade to 4.4.27 to allow us to confinue to install v4.5 to apply system test load Additional info: oc get clusterversion oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.38 True True 157m Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-7dbff755d-9w4zh 1/1 Running 0 3h44m apiserver-7dbff755d-gnxnc 1/1 Running 0 3h44m apiserver-7dbff755d-r2zhn 1/1 Running 0 3h43m oc get pods -n openshift-console NAME READY STATUS RESTARTS AGE console-76b985bc7c-hqcd7 1/1 Running 0 4h console-76b985bc7c-j2dm4 1/1 Running 1 4h6m downloads-74f6b6dcb6-wn5p7 1/1 Running 0 4h26m downloads-74f6b6dcb6-xzxrl 1/1 Running 0 4h20m oc adm upgrade info: An upgrade is in progress. Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out oc get nodes NAME STATUS ROLES AGE VERSION master0 Ready master,worker 7h34m v1.17.1+45f8ddb master1 Ready master,worker 7h34m v1.17.1+45f8ddb master2 Ready master,worker 7h33m v1.17.1+45f8ddb worker0 Ready worker 7h34m v1.17.1+45f8ddb worker1 Ready worker 7h34m v1.17.1+45f8ddb worker2 Ready worker 7h34m v1.17.1+45f8ddb oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.27 True True True 7h16m console 4.4.27 True False True 3h50m insights 4.4.27 True False True 7h32m monitoring 4.4.27 False False True 11m openshift-apiserver 4.4.27 False False False 3h35m oc adm must-gather [root@dell-per730-09 aiesi]# oc adm must-gather [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d7c882054a4528eda72e69a7988c5931b5a1643913b11bfd2575a78a8620808f [must-gather ] OUT namespace/openshift-must-gather-nm7mh created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-b9j6h created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d7c882054a4528eda72e69a7988c5931b5a1643913b11bfd2575a78a8620808f created [must-gather-jjsrn] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io openshift-must-gather-nm7mh) [must-gather-jjsrn] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get buildconfigs.build.openshift.io) [must-gather-jjsrn] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get imagestreams.image.openshift.io) [must-gather-jjsrn] POD Error from server (ServiceUnavailable): the server is currently unable to handle the request (get builds.build.openshift.io) [must-gather-jjsrn] POD Wrote inspect data to must-gather. [must-gather-jjsrn] POD Gathering data for ns/openshift-cluster-version... [must-gather-jjsrn] POD Wrote inspect data to must-gather. [must-gather-jjsrn] POD error: errors ocurred while gathering data: [must-gather-jjsrn] POD the server doesn't have a resource type "buildconfigs" [must-gather-jjsrn] POD Gathering data for ns/openshift-config... [must-gather-jjsrn] POD Gathering data for ns/openshift-config-managed... [must-gather-jjsrn] POD Gathering data for ns/openshift-authentication... [must-gather-jjsrn] POD Gathering data for ns/openshift-authentication-operator... [must-gather-jjsrn] POD Gathering data for ns/openshift-ingress... [must-gather-jjsrn] POD Gathering data for ns/openshift-cloud-credential-operator... [must-gather-jjsrn] OUT gather logs unavailable: unexpected EOF [must-gather-jjsrn] OUT waiting for gather to complete [must-gather-jjsrn] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-b9j6h deleted [must-gather ] OUT namespace/openshift-must-gather-nm7mh deleted error: gather never finished for pod must-gather-jjsrn: timed out waiting for the condition Other: System is available for a developer to checkout before we attempt to recover the cluster to a useable state. Just let us know asap. :)
Created attachment 1722909 [details] added pod apiserver log errors, maybe of interest...
Able to collect must-gather: oc adm must-gather -v=9 --keep link: http://10.8.32.38/str/ocpdebug/must-gather_v4.4_update_failure.tar.gz
Recovered Update: After over 15hours in this update stuck state, we reboot master node0. When master node0 came backup the update continued. before reboot: [root@dell-per730-09 aiesi]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.38 True True 20h Unable to apply 4.4.27: the cluster operator openshift-apiserver has not yet successfully rolled out ------- After reboot: [root@dell-per730-09 aiesi]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.38 True True 20h Working towards 4.4.27: 24% complete [root@dell-per730-09 aiesi]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.38 True True 20h Working towards 4.4.27: 77% complete [root@dell-per730-09 aiesi]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.27 True False 9m2s Cluster version is 4.4.27 Sysytem console is functional System cluster looks healthy We are planning to update to the latest release 4.5....
openshift-apiserver not becoming ready, although network and machines are looking fine. Reassigning
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
Setting blocker- as priority = low.
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.