Bug 1892338
| Summary: | HAProxyReloadFail alert only briefly fires in the event of a broken HAProxy config | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Stephen Greene <sgreene> |
| Component: | Networking | Assignee: | Stephen Greene <sgreene> |
| Networking sub component: | router | QA Contact: | Arvind iyengar <aiyengar> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | high | CC: | aos-bugs, bperkins, hongli, jeder, wking |
| Version: | 4.7 | Keywords: | ServiceDeliveryImpact |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause:
OpenShift router creates an invalid HAProxy config that causes router reloads to fail.
Consequence:
HAProxyReloadFail prometheus alert only fires for a span of ~5 minutes, regardless of the actual duration of the reload outage.
Fix:
Replace the router template_router_reload_fails counter metric with the new template_router_reload_failure gauge metric. Change the HAProxyReloadFail alert to fire based on the boolean status of the template_router_reload_failure metric.
Result:
The HAProxyReloadFail metric fires for the entire time that HAProxy reloads are failing.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:28:37 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1896167 | ||
|
Description
Stephen Greene
2020-10-28 14:04:36 UTC
I'd mentioned a positive name, but a negative name like template_router_reload_failure might be more convenient if you wanted a label with a reason slug, or some such. You could always add a reason to a positive label too, but template_router_reload_success{failure_reason="whatever"} feels more awkward than template_router_reload_failure{reason="whatever"}.
(In reply to W. Trevor King from comment #1) > I'd mentioned a positive name, but a negative name like > template_router_reload_failure might be more convenient if you wanted a > label with a reason slug, or some such. You could always add a reason to a > positive label too, but > template_router_reload_success{failure_reason="whatever"} feels more awkward > than template_router_reload_failure{reason="whatever"}. Noted, I will make sure to use a negative name instead. :) Not a regression, so it's hard to imagine holding 4.7.0 on a fix for this. Tested in "4.7.0-0.nightly-2020-11-25-114114" payload. It is noted that the new metric and the associated Prometheus rules are added as intended: ---- $ curl -sS -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IjhjRDJ6ZnJsdnhKczlVQ2R6TndrOW1RS29BdS1LSEhzbGtQUEVPZFNsVUkifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1yZmhtdiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjFiYmIyOGQwLThkMDgtNGI2MC1hNDk4LTBjYzc3NzE4OWY1YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.b2nqVlbW4xHGuh3EYuXTsCV9fcjc6G5Yq9TABxqAYaUuRSmp79lH5dfhC9k9TRKDITlVxsXDvMuH_CN392RlwXIMyytEidnNP_zTH-rqpl12NrDTxGfurf2WtfZefPGDM1tSadcAm_jGmebDLzmjWGPGm5mWIYIdRiBaILku0HDrhDLgfhG-BpsZ5WXTikJhdskmCs38Ru9oQcuyXIJEXSOnGqKJZYQFqdzPsA8zh-aCotI49R42qo903tXvoh5kvBU4kx0gAgwdaXYBPOrKe8kWWDvE6gL_NHcb3pLIoI05vXvMW0BwBnMX8h9X7KCXymsU7aH-IiBVnNaRQ8BxYQ" -k https://10.129.2.22:1936/metrics | grep -i template_router_reload_failure # HELP template_router_reload_failure Metric to track the status of the most recent HAProxy reload # TYPE template_router_reload_failure gauge template_router_reload_failure 0 ---- Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |