Description of problem: Upgrading a cluster with 2K projects from 4.2.9 to 4.3.0-0.nightly-2019-12-05-073829 failed with monitoring hung for an hour. 'Failed to rollout the stack. Error: running task Updating Thanos Querier failed: waiting for Thanos Querier Route to become ready failed: waiting for RouteReady of thanos-querier: no status available for thanos-querier' reason: UpdatingThanosQuerierFailed [root@ip-10-0-12-153 must-gather]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.9 True True 69m Unable to apply 4.3.0-0.nightly-2019-12-05-073829: the cluster operator monitoring has not yet successfully rolled out [root@ip-10-0-12-153 must-gather]# oc get clusteroperator monitoring NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE monitoring 4.2.9 False True True 57m oc get clusteroperator monitoring -o yaml: apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2019-12-05T16:17:46Z" generation: 1 name: monitoring resourceVersion: "472659" selfLink: /apis/config.openshift.io/v1/clusteroperators/monitoring uid: c5e018c9-177a-11ea-b08d-06d8f7691e62 spec: {} status: conditions: - lastTransitionTime: "2019-12-05T20:59:25Z" message: Rolling out the stack. reason: RollOutInProgress status: "True" type: Progressing - lastTransitionTime: "2019-12-05T20:19:21Z" message: 'Failed to rollout the stack. Error: running task Updating Thanos Querier failed: waiting for Thanos Querier Route to become ready failed: waiting for RouteReady of thanos-querier: no status available for thanos-querier' reason: UpdatingThanosQuerierFailed status: "True" type: Degraded - lastTransitionTime: "2019-12-05T20:59:25Z" message: Rollout of the monitoring stack is in progress. Please wait until it finishes. reason: RollOutInProgress status: "True" type: Upgradeable - lastTransitionTime: "2019-12-05T20:14:21Z" status: "False" type: Available extension: null relatedObjects: - group: "" name: openshift-monitoring resource: namespaces - group: "" name: openshift-monitoring resource: all - group: monitoring.coreos.com name: "" resource: servicemonitors - group: monitoring.coreos.com name: "" resource: prometheusrules - group: monitoring.coreos.com name: "" resource: alertmanagers - group: monitoring.coreos.com name: "" resource: prometheuses versions: - name: operator version: 4.2.9 Version-Release number of selected component (if applicable): How reproducible: 4.2.9 to 4.3.0-0.nightly-2019-12-05-073829 Steps to Reproduce: 1. Installed a 4.2.9 cluster 2. Ran the QE cluster-loader tool to create the following projects: projects: - num: 2000 basename: svt-1- templates: - num: 6 file: ./content/build-template.json - num: 10 file: ./content/image-stream-template.json - num: 2 file: ./content/deployment-config-0rep-pause-template.json parameters: - ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12" - num: 40 file: ./content/ssh-secret-template.json - num: 5 file: ./content/route-template.json - num: 20 file: ./content/configmap-template.json # rcs and services are implemented in deployments. quotas: - name: default 3. Attempted upgrade to 4.3.0-0.nightly-2019-12-05-073829 Actual results: Upgrade stalled on monitoring. I will include location of must-gather in a private comment.
I gave this some more time, but it remained wedged: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0-0.nightly-2019-12-05-073829 True False False 8h cloud-credential 4.3.0-0.nightly-2019-12-05-073829 True False False 8h cluster-autoscaler 4.3.0-0.nightly-2019-12-05-073829 True False False 8h console 4.3.0-0.nightly-2019-12-05-073829 True False False 4h27m dns 4.2.9 True False False 8h image-registry 4.3.0-0.nightly-2019-12-05-073829 True False False 8h ingress 4.3.0-0.nightly-2019-12-05-073829 True False False 4h30m insights 4.3.0-0.nightly-2019-12-05-073829 True False False 8h kube-apiserver 4.3.0-0.nightly-2019-12-05-073829 True False False 8h kube-controller-manager 4.3.0-0.nightly-2019-12-05-073829 True False False 8h kube-scheduler 4.3.0-0.nightly-2019-12-05-073829 True False False 8h machine-api 4.3.0-0.nightly-2019-12-05-073829 True False False 8h machine-config 4.2.9 True False False 8h marketplace 4.3.0-0.nightly-2019-12-05-073829 True False False 4h31m monitoring 4.2.9 False True True 4h27m network 4.2.9 True False False 8h node-tuning 4.3.0-0.nightly-2019-12-05-073829 True False False 4h32m openshift-apiserver 4.3.0-0.nightly-2019-12-05-073829 True False False 8h openshift-controller-manager 4.3.0-0.nightly-2019-12-05-073829 True False False 8h openshift-samples 4.3.0-0.nightly-2019-12-05-073829 True False False 4h26m operator-lifecycle-manager 4.3.0-0.nightly-2019-12-05-073829 True False False 8h operator-lifecycle-manager-catalog 4.3.0-0.nightly-2019-12-05-073829 True False False 8h operator-lifecycle-manager-packageserver 4.3.0-0.nightly-2019-12-05-073829 True False False 4h30m service-ca 4.3.0-0.nightly-2019-12-05-073829 True False False 8h service-catalog-apiserver 4.3.0-0.nightly-2019-12-05-073829 True False False 8h service-catalog-controller-manager 4.3.0-0.nightly-2019-12-05-073829 True False False 8h storage 4.3.0-0.nightly-2019-12-05-073829 True False False 4h32m
curl-ing http://thanos-querier-openshift-monitoring.apps.mffiedler-1205.perf-testing.devcluster.openshift.com gives an empty response, no error.
The route in question was created at time 2019-12-05T20:09:20Z. From 2019-12-05T20:09:41Z to 2019-12-05T20:12:15Z, the API was returning "forbidden: not yet ready to handle request" errors (not only to the router: I see the same error in logs for the auth operator, samples operator, CVO, and kube-controller-manager). The router may have admitted the route, but the router failed to update the route's status due to the API outage. The router is designed to be able to function with limited privileges. It is not clear to me whether the router should retry on forbidden errors, or whether the API is incorrect in returning a forbidden error for a request that should be retried.
didn't see the issue during recent 4.4 upgrade testing, moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
I am now seeing again the monitoring operator stuck in the failed state after installing version 4.5.0 monitoring False True True 98m The reason of failing is: Conditions: Last Transition Time: 2020-07-30T03:29:03Z Message: Failed to rollout the stack. Error: running task Updating Alertmanager failed: syncing Thanos Querier trusted CA bundle ConfigMap failed: waiting for config map key "ca-bundle.crt" in openshift-monitoring/alertmanager-trusted-ca-bundle ConfigMap object failed: timed out waiting for the condition: empty value Reason: UpdatingAlertmanagerFailed Status: True Type: Degraded Last Transition Time: 2020-07-30T04:34:29Z
I'm seeing a similar issue but with a different error message upgrading to latest 4.7.xx release: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Grafana host: getting Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io grafana)