Bug 1780398 - 4.2 to 4.3 upgrade stuck on monitoring: waiting for Thanos Querier Route to become ready failed
Summary: 4.2 to 4.3 upgrade stuck on monitoring: waiting for Thanos Querier Route to b...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.4.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks: 1781313
TreeView+ depends on / blocked
 
Reported: 2019-12-05 21:14 UTC by Mike Fiedler
Modified: 2020-07-30 04:50 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When the ingress controller tried to update a Route object's status and received an HTTP 403 error response from the API, the ingress controller did not retry the update. In the case of other errors, the ingress controller retried up to 3 times. Consequence: During API outages (for example, during upgrades), the ingress controller sometimes failed to update a Route object's status. In particular, the ingress controller was sometimes failing to record that a newly admitted Route had been admitted. This failure impeded roll out of other components, such as the monitoring stack. Fix: The ingress controller now retries all failed API calls until they succeed. Result: The ingress controller is now more resilient to API outages.
Clone Of:
: 1780794 1781313 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:18:37 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift router pull 68 0 'None' closed Bug 1780398: status: performIngressConditionUpdate: Retry on 403 2021-02-15 03:36:29 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:19:10 UTC

Description Mike Fiedler 2019-12-05 21:14:33 UTC
Description of problem:

Upgrading a cluster with 2K projects from 4.2.9 to 4.3.0-0.nightly-2019-12-05-073829 failed with monitoring hung for an hour.  

'Failed to rollout the stack. Error: running task Updating Thanos Querier
      failed: waiting for Thanos Querier Route to become ready failed: waiting for
      RouteReady of thanos-querier: no status available for thanos-querier'
    reason: UpdatingThanosQuerierFailed

[root@ip-10-0-12-153 must-gather]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.9     True        True          69m     Unable to apply 4.3.0-0.nightly-2019-12-05-073829: the cluster operator monitoring has not yet successfully rolled out
[root@ip-10-0-12-153 must-gather]# oc get clusteroperator monitoring
NAME         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring   4.2.9     False       True          True       57m


oc get clusteroperator monitoring -o yaml:

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-12-05T16:17:46Z"
  generation: 1
  name: monitoring
  resourceVersion: "472659"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/monitoring
  uid: c5e018c9-177a-11ea-b08d-06d8f7691e62
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-12-05T20:59:25Z"
    message: Rolling out the stack.
    reason: RollOutInProgress
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-12-05T20:19:21Z"
    message: 'Failed to rollout the stack. Error: running task Updating Thanos Querier
      failed: waiting for Thanos Querier Route to become ready failed: waiting for
      RouteReady of thanos-querier: no status available for thanos-querier'
    reason: UpdatingThanosQuerierFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2019-12-05T20:59:25Z"
    message: Rollout of the monitoring stack is in progress. Please wait until it
      finishes.
    reason: RollOutInProgress
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2019-12-05T20:14:21Z"
    status: "False"
    type: Available
  extension: null
  relatedObjects:
  - group: ""
    name: openshift-monitoring
    resource: namespaces
  - group: ""
    name: openshift-monitoring
    resource: all
  - group: monitoring.coreos.com
    name: ""
    resource: servicemonitors
  - group: monitoring.coreos.com
    name: ""
    resource: prometheusrules
  - group: monitoring.coreos.com
    name: ""
    resource: alertmanagers
  - group: monitoring.coreos.com
    name: ""
    resource: prometheuses
  versions:
  - name: operator
    version: 4.2.9



Version-Release number of selected component (if applicable):


How reproducible:  4.2.9 to 4.3.0-0.nightly-2019-12-05-073829


Steps to Reproduce:
1. Installed a 4.2.9 cluster
2. Ran the QE cluster-loader tool to create the following projects:

projects:
  - num: 2000
    basename: svt-1-
    templates:
      -
        num: 6
        file: ./content/build-template.json
      -
        num: 10
        file: ./content/image-stream-template.json
      -
        num: 2
        file: ./content/deployment-config-0rep-pause-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 40
        file: ./content/ssh-secret-template.json
      -
        num: 5
        file: ./content/route-template.json
      -
        num: 20
        file: ./content/configmap-template.json
      # rcs and services are implemented in deployments.
quotas:
  - name: default



3. Attempted upgrade to 4.3.0-0.nightly-2019-12-05-073829

Actual results:

Upgrade stalled on monitoring.  I will include location of must-gather in a private comment.

Comment 2 Mike Fiedler 2019-12-06 00:42:58 UTC
I gave this some more time, but it remained wedged:

NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
cloud-credential                           4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
cluster-autoscaler                         4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
console                                    4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h27m
dns                                        4.2.9                               True        False         False      8h
image-registry                             4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
ingress                                    4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h30m
insights                                   4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
kube-apiserver                             4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
kube-controller-manager                    4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
kube-scheduler                             4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
machine-api                                4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
machine-config                             4.2.9                               True        False         False      8h
marketplace                                4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h31m
monitoring                                 4.2.9                               False       True          True       4h27m
network                                    4.2.9                               True        False         False      8h
node-tuning                                4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h32m
openshift-apiserver                        4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
openshift-controller-manager               4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
openshift-samples                          4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h26m
operator-lifecycle-manager                 4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h30m
service-ca                                 4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
service-catalog-apiserver                  4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
service-catalog-controller-manager         4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
storage                                    4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h32m

Comment 3 Mike Fiedler 2019-12-06 00:46:47 UTC
curl-ing http://thanos-querier-openshift-monitoring.apps.mffiedler-1205.perf-testing.devcluster.openshift.com   gives an empty response, no error.

Comment 5 Miciah Dashiel Butler Masters 2019-12-06 18:58:26 UTC
The route in question was created at time 2019-12-05T20:09:20Z.  From 2019-12-05T20:09:41Z to 2019-12-05T20:12:15Z, the API was returning "forbidden: not yet ready to handle request" errors (not only to the router: I see the same error in logs for the auth operator, samples operator, CVO, and kube-controller-manager).  The router may have admitted the route, but the router failed to update the route's status due to the API outage.  The router is designed to be able to function with limited privileges.  It is not clear to me whether the router should retry on forbidden errors, or whether the API is incorrect in returning a forbidden error for a request that should be retried.

Comment 7 Hongan Li 2020-02-10 07:47:06 UTC
didn't see the issue during recent 4.4 upgrade testing, moving to verified.

Comment 9 errata-xmlrpc 2020-05-04 11:18:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 10 Sam Rostampour 2020-07-30 04:50:41 UTC
I am now seeing again the monitoring operator stuck in the failed state after installing version 4.5.0

monitoring                                           False       True          True       98m

The reason of failing is:

  Conditions:
    Last Transition Time:  2020-07-30T03:29:03Z
    Message:               Failed to rollout the stack. Error: running task Updating Alertmanager failed: syncing Thanos Querier trusted CA bundle ConfigMap failed: waiting for config map key "ca-bundle.crt" in openshift-monitoring/alertmanager-trusted-ca-bundle ConfigMap object failed: timed out waiting for the condition: empty value
    Reason:                UpdatingAlertmanagerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-07-30T04:34:29Z


Note You need to log in before you can comment on or make changes to this bug.