Bug 1780398

Summary:	4.2 to 4.3 upgrade stuck on monitoring: waiting for Thanos Querier Route to become ready failed
Product:	OpenShift Container Platform	Reporter:	Mike Fiedler <mifiedle>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	alegrand, anpicker, aos-bugs, ematysek, erooth, kakkoyun, lcosic, mloibl, pkrupa, sayam.masood, srostamp, surbania
Version:	4.3.0
Target Milestone:	---
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: When the ingress controller tried to update a Route object's status and received an HTTP 403 error response from the API, the ingress controller did not retry the update. In the case of other errors, the ingress controller retried up to 3 times. Consequence: During API outages (for example, during upgrades), the ingress controller sometimes failed to update a Route object's status. In particular, the ingress controller was sometimes failing to record that a newly admitted Route had been admitted. This failure impeded roll out of other components, such as the monitoring stack. Fix: The ingress controller now retries all failed API calls until they succeed. Result: The ingress controller is now more resilient to API outages.	Story Points:	---
Clone Of:
Clones:	1780794 1781313 (view as bug list)		Environment:
Last Closed:	2020-05-04 11:18:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1781313

Description Mike Fiedler 2019-12-05 21:14:33 UTC

Description of problem:

Upgrading a cluster with 2K projects from 4.2.9 to 4.3.0-0.nightly-2019-12-05-073829 failed with monitoring hung for an hour.  

'Failed to rollout the stack. Error: running task Updating Thanos Querier
      failed: waiting for Thanos Querier Route to become ready failed: waiting for
      RouteReady of thanos-querier: no status available for thanos-querier'
    reason: UpdatingThanosQuerierFailed

[root@ip-10-0-12-153 must-gather]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.9     True        True          69m     Unable to apply 4.3.0-0.nightly-2019-12-05-073829: the cluster operator monitoring has not yet successfully rolled out
[root@ip-10-0-12-153 must-gather]# oc get clusteroperator monitoring
NAME         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring   4.2.9     False       True          True       57m


oc get clusteroperator monitoring -o yaml:

apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2019-12-05T16:17:46Z"
  generation: 1
  name: monitoring
  resourceVersion: "472659"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/monitoring
  uid: c5e018c9-177a-11ea-b08d-06d8f7691e62
spec: {}
status:
  conditions:
  - lastTransitionTime: "2019-12-05T20:59:25Z"
    message: Rolling out the stack.
    reason: RollOutInProgress
    status: "True"
    type: Progressing
  - lastTransitionTime: "2019-12-05T20:19:21Z"
    message: 'Failed to rollout the stack. Error: running task Updating Thanos Querier
      failed: waiting for Thanos Querier Route to become ready failed: waiting for
      RouteReady of thanos-querier: no status available for thanos-querier'
    reason: UpdatingThanosQuerierFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2019-12-05T20:59:25Z"
    message: Rollout of the monitoring stack is in progress. Please wait until it
      finishes.
    reason: RollOutInProgress
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2019-12-05T20:14:21Z"
    status: "False"
    type: Available
  extension: null
  relatedObjects:
  - group: ""
    name: openshift-monitoring
    resource: namespaces
  - group: ""
    name: openshift-monitoring
    resource: all
  - group: monitoring.coreos.com
    name: ""
    resource: servicemonitors
  - group: monitoring.coreos.com
    name: ""
    resource: prometheusrules
  - group: monitoring.coreos.com
    name: ""
    resource: alertmanagers
  - group: monitoring.coreos.com
    name: ""
    resource: prometheuses
  versions:
  - name: operator
    version: 4.2.9



Version-Release number of selected component (if applicable):


How reproducible:  4.2.9 to 4.3.0-0.nightly-2019-12-05-073829


Steps to Reproduce:
1. Installed a 4.2.9 cluster
2. Ran the QE cluster-loader tool to create the following projects:

projects:
  - num: 2000
    basename: svt-1-
    templates:
      -
        num: 6
        file: ./content/build-template.json
      -
        num: 10
        file: ./content/image-stream-template.json
      -
        num: 2
        file: ./content/deployment-config-0rep-pause-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 40
        file: ./content/ssh-secret-template.json
      -
        num: 5
        file: ./content/route-template.json
      -
        num: 20
        file: ./content/configmap-template.json
      # rcs and services are implemented in deployments.
quotas:
  - name: default



3. Attempted upgrade to 4.3.0-0.nightly-2019-12-05-073829

Actual results:

Upgrade stalled on monitoring.  I will include location of must-gather in a private comment.

Comment 2 Mike Fiedler 2019-12-06 00:42:58 UTC

I gave this some more time, but it remained wedged:

NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
cloud-credential                           4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
cluster-autoscaler                         4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
console                                    4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h27m
dns                                        4.2.9                               True        False         False      8h
image-registry                             4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
ingress                                    4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h30m
insights                                   4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
kube-apiserver                             4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
kube-controller-manager                    4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
kube-scheduler                             4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
machine-api                                4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
machine-config                             4.2.9                               True        False         False      8h
marketplace                                4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h31m
monitoring                                 4.2.9                               False       True          True       4h27m
network                                    4.2.9                               True        False         False      8h
node-tuning                                4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h32m
openshift-apiserver                        4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
openshift-controller-manager               4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
openshift-samples                          4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h26m
operator-lifecycle-manager                 4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h30m
service-ca                                 4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
service-catalog-apiserver                  4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
service-catalog-controller-manager         4.3.0-0.nightly-2019-12-05-073829   True        False         False      8h
storage                                    4.3.0-0.nightly-2019-12-05-073829   True        False         False      4h32m

Comment 3 Mike Fiedler 2019-12-06 00:46:47 UTC

curl-ing http://thanos-querier-openshift-monitoring.apps.mffiedler-1205.perf-testing.devcluster.openshift.com   gives an empty response, no error.

Comment 5 Miciah Dashiel Butler Masters 2019-12-06 18:58:26 UTC

The route in question was created at time 2019-12-05T20:09:20Z.  From 2019-12-05T20:09:41Z to 2019-12-05T20:12:15Z, the API was returning "forbidden: not yet ready to handle request" errors (not only to the router: I see the same error in logs for the auth operator, samples operator, CVO, and kube-controller-manager).  The router may have admitted the route, but the router failed to update the route's status due to the API outage.  The router is designed to be able to function with limited privileges.  It is not clear to me whether the router should retry on forbidden errors, or whether the API is incorrect in returning a forbidden error for a request that should be retried.

Comment 7 Hongan Li 2020-02-10 07:47:06 UTC

didn't see the issue during recent 4.4 upgrade testing, moving to verified.

Comment 9 errata-xmlrpc 2020-05-04 11:18:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 10 Sam Rostampour 2020-07-30 04:50:41 UTC

I am now seeing again the monitoring operator stuck in the failed state after installing version 4.5.0

monitoring                                           False       True          True       98m

The reason of failing is:

  Conditions:
    Last Transition Time:  2020-07-30T03:29:03Z
    Message:               Failed to rollout the stack. Error: running task Updating Alertmanager failed: syncing Thanos Querier trusted CA bundle ConfigMap failed: waiting for config map key "ca-bundle.crt" in openshift-monitoring/alertmanager-trusted-ca-bundle ConfigMap object failed: timed out waiting for the condition: empty value
    Reason:                UpdatingAlertmanagerFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2020-07-30T04:34:29Z

Comment 11 Sayam Masood 2021-07-09 20:27:30 UTC

I'm seeing a similar issue but with a different error message upgrading to latest 4.7.xx release:

 Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Grafana host: getting Route object failed: the server is currently unable to handle the request
(get routes.route.openshift.io grafana)