Bug 1896168
Summary: | HAProxyReloadFail alert only briefly fires in the event of a broken HAProxy config | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | OpenShift BugZilla Robot <openshift-bugzilla-robot> |
Component: | Networking | Assignee: | Stephen Greene <sgreene> |
Networking sub component: | router | QA Contact: | Arvind iyengar <aiyengar> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | high | CC: | aiyengar, aos-bugs, bperkins, wking |
Version: | 4.7 | ||
Target Milestone: | --- | ||
Target Release: | 4.5.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
OpenShift router creates an invalid HAProxy config that causes router reloads to fail.
Consequence:
HAProxyReloadFail prometheus alert only fires for a span of ~5 minutes, regardless of the actual duration of the reload outage.
Fix:
Replace the router template_router_reload_fails counter metric with the new template_router_reload_failure gauge metric. Change the HAProxyReloadFail alert to fire based on the boolean status of the template_router_reload_failure metric.
Result:
The HAProxyReloadFail metric fires for the entire time that HAProxy reloads are failing.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-04-13 23:43:27 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1896167 | ||
Bug Blocks: |
Description
OpenShift BugZilla Robot
2020-11-09 21:16:20 UTC
Adding upcoming sprint as this backport is in the queue. Depends on the 4.6.z BZ which has not merged yet. Adding upcoming sprint. this is awaiting cherry-pick approval. The PRs for this BZ are still awaiting cherry-pick approval. Tested in "4.5.0-0.nightly-2021-03-28-125930" payload. It is noted that the new metric and the associated Prometheus rules are added as intended: ---- curl -sS -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6ImRFT2ZlVFlHSzNWdFBoLVdscjllVWNuNmQ5b0FqS1RNOFhnZDRrOUpPSjAifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi02NGpjdCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjAzZDEwNzc5LTgyYzktNDc5MC1hN2EyLTNkZmZkMjk5ZDE5ZCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.RtdIZaHCJLbJ4tiFC5_I8S4_O3hDIUSOtaKQXy65lxvfW6QokUyR9UI3WILy31Jjv_5LDljxOK241kHvmyLOwBvAHxJ1l2lOuC5ckQ60rTJHZf9Yrc3tf9QOhDPLxagdQkn3vZjqWJC-Gm_iRalJgMqKp6NAR0qiJOTWpbSOfLQokZsIafaYYo81jWigAWjXrU8YgRP2_vSBwbMrtIrgXHuyGfyCUsa9VoH6jl7tW5m6xbij4N8hUWAXrTTqBmecuvT0ScxVpg51JF8QW1nV0HAauBHPLGifgBoby38pqsWbXWH9lKcCfHDY8KxOh4VZrMK58-spqiujKrmn9D3-wQ" -k https://10.131.0.14:1936/metrics | grep -i template_router_reload_failure # HELP template_router_reload_failure Metric to track the status of the most recent HAProxy reload # TYPE template_router_reload_failure gauge template_router_reload_failure 0 ---- Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.37 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1015 |