Bug 1780398
Summary: | 4.2 to 4.3 upgrade stuck on monitoring: waiting for Thanos Querier Route to become ready failed | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | |
Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> | |
Networking sub component: | router | QA Contact: | Hongan Li <hongli> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | medium | CC: | alegrand, anpicker, aos-bugs, ematysek, erooth, kakkoyun, lcosic, mloibl, pkrupa, sayam.masood, srostamp, surbania | |
Version: | 4.3.0 | |||
Target Milestone: | --- | |||
Target Release: | 4.4.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: When the ingress controller tried to update a Route object's status and received an HTTP 403 error response from the API, the ingress controller did not retry the update. In the case of other errors, the ingress controller retried up to 3 times.
Consequence: During API outages (for example, during upgrades), the ingress controller sometimes failed to update a Route object's status. In particular, the ingress controller was sometimes failing to record that a newly admitted Route had been admitted. This failure impeded roll out of other components, such as the monitoring stack.
Fix: The ingress controller now retries all failed API calls until they succeed.
Result: The ingress controller is now more resilient to API outages.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1780794 1781313 (view as bug list) | Environment: | ||
Last Closed: | 2020-05-04 11:18:37 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1781313 |
Description
Mike Fiedler
2019-12-05 21:14:33 UTC
I gave this some more time, but it remained wedged: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.0-0.nightly-2019-12-05-073829 True False False 8h cloud-credential 4.3.0-0.nightly-2019-12-05-073829 True False False 8h cluster-autoscaler 4.3.0-0.nightly-2019-12-05-073829 True False False 8h console 4.3.0-0.nightly-2019-12-05-073829 True False False 4h27m dns 4.2.9 True False False 8h image-registry 4.3.0-0.nightly-2019-12-05-073829 True False False 8h ingress 4.3.0-0.nightly-2019-12-05-073829 True False False 4h30m insights 4.3.0-0.nightly-2019-12-05-073829 True False False 8h kube-apiserver 4.3.0-0.nightly-2019-12-05-073829 True False False 8h kube-controller-manager 4.3.0-0.nightly-2019-12-05-073829 True False False 8h kube-scheduler 4.3.0-0.nightly-2019-12-05-073829 True False False 8h machine-api 4.3.0-0.nightly-2019-12-05-073829 True False False 8h machine-config 4.2.9 True False False 8h marketplace 4.3.0-0.nightly-2019-12-05-073829 True False False 4h31m monitoring 4.2.9 False True True 4h27m network 4.2.9 True False False 8h node-tuning 4.3.0-0.nightly-2019-12-05-073829 True False False 4h32m openshift-apiserver 4.3.0-0.nightly-2019-12-05-073829 True False False 8h openshift-controller-manager 4.3.0-0.nightly-2019-12-05-073829 True False False 8h openshift-samples 4.3.0-0.nightly-2019-12-05-073829 True False False 4h26m operator-lifecycle-manager 4.3.0-0.nightly-2019-12-05-073829 True False False 8h operator-lifecycle-manager-catalog 4.3.0-0.nightly-2019-12-05-073829 True False False 8h operator-lifecycle-manager-packageserver 4.3.0-0.nightly-2019-12-05-073829 True False False 4h30m service-ca 4.3.0-0.nightly-2019-12-05-073829 True False False 8h service-catalog-apiserver 4.3.0-0.nightly-2019-12-05-073829 True False False 8h service-catalog-controller-manager 4.3.0-0.nightly-2019-12-05-073829 True False False 8h storage 4.3.0-0.nightly-2019-12-05-073829 True False False 4h32m curl-ing http://thanos-querier-openshift-monitoring.apps.mffiedler-1205.perf-testing.devcluster.openshift.com gives an empty response, no error. The route in question was created at time 2019-12-05T20:09:20Z. From 2019-12-05T20:09:41Z to 2019-12-05T20:12:15Z, the API was returning "forbidden: not yet ready to handle request" errors (not only to the router: I see the same error in logs for the auth operator, samples operator, CVO, and kube-controller-manager). The router may have admitted the route, but the router failed to update the route's status due to the API outage. The router is designed to be able to function with limited privileges. It is not clear to me whether the router should retry on forbidden errors, or whether the API is incorrect in returning a forbidden error for a request that should be retried. didn't see the issue during recent 4.4 upgrade testing, moving to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 I am now seeing again the monitoring operator stuck in the failed state after installing version 4.5.0 monitoring False True True 98m The reason of failing is: Conditions: Last Transition Time: 2020-07-30T03:29:03Z Message: Failed to rollout the stack. Error: running task Updating Alertmanager failed: syncing Thanos Querier trusted CA bundle ConfigMap failed: waiting for config map key "ca-bundle.crt" in openshift-monitoring/alertmanager-trusted-ca-bundle ConfigMap object failed: timed out waiting for the condition: empty value Reason: UpdatingAlertmanagerFailed Status: True Type: Degraded Last Transition Time: 2020-07-30T04:34:29Z I'm seeing a similar issue but with a different error message upgrading to latest 4.7.xx release: Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Grafana host: getting Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io grafana) |