1892338 – HAProxyReloadFail alert only briefly fires in the event of a broken HAProxy config

Bug 1892338 - HAProxyReloadFail alert only briefly fires in the event of a broken HAProxy config

Summary: HAProxyReloadFail alert only briefly fires in the event of a broken HAProxy c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Stephen Greene
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1896167
TreeView+	depends on / blocked

Reported:	2020-10-28 14:04 UTC by Stephen Greene
Modified:	2022-08-04 22:30 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: OpenShift router creates an invalid HAProxy config that causes router reloads to fail. Consequence: HAProxyReloadFail prometheus alert only fires for a span of ~5 minutes, regardless of the actual duration of the reload outage. Fix: Replace the router template_router_reload_fails counter metric with the new template_router_reload_failure gauge metric. Change the HAProxyReloadFail alert to fire based on the boolean status of the template_router_reload_failure metric. Result: The HAProxyReloadFail metric fires for the entire time that HAProxy reloads are failing.
Clone Of:
Environment:
Last Closed:	2021-02-24 15:28:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 481	None	closed	Bug 1892338: Fix HAProxyReloadFail alert	2021-02-15 03:14:38 UTC
Github	openshift router pull 209	None	closed	Bug 1892338: metrics: Rework template_router_reload_failure metric	2021-02-15 03:14:39 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:29:05 UTC

Description Stephen Greene 2020-10-28 14:04:36 UTC

If a route were to break the HAProxy config files, and thus break HAProxy reloads, the HAProxyReloadFail alert will only fire for ~5 minutes. 

Instead, the HAProxyReloadFail alert (& base metric) should be reworked.

The template_router_reload_fails metric should be dropped in exchange for a metric that tracks the status of the most recent reload. ie template_router_reload_success which is pinned to 1 on successful reloads, and 0 on failed reloads. The current template_router_reload_fails reports an increasing value of failed reloads, which is difficult to alert on properly. A flag/boolean metric is trivial to alert on for the actual duration of the problem.

This affects 4.7, 4.6, and 4.5, so backports will be required.

Comment 1 W. Trevor King 2020-10-28 14:09:57 UTC

I'd mentioned a positive name, but a negative name like template_router_reload_failure might be more convenient if you wanted a label with a reason slug, or some such.  You could always add a reason to a positive label too, but template_router_reload_success{failure_reason="whatever"} feels more awkward than template_router_reload_failure{reason="whatever"}.

Comment 2 Stephen Greene 2020-10-28 14:17:23 UTC

(In reply to W. Trevor King from comment #1)
> I'd mentioned a positive name, but a negative name like
> template_router_reload_failure might be more convenient if you wanted a
> label with a reason slug, or some such.  You could always add a reason to a
> positive label too, but
> template_router_reload_success{failure_reason="whatever"} feels more awkward
> than template_router_reload_failure{reason="whatever"}.

Noted, I will make sure to use a negative name instead. :)

Comment 3 W. Trevor King 2020-10-28 15:32:44 UTC

Not a regression, so it's hard to imagine holding 4.7.0 on a fix for this.

Comment 5 Arvind iyengar 2020-11-26 06:44:45 UTC

Tested in "4.7.0-0.nightly-2020-11-25-114114" payload. It is noted that the new metric and the associated Prometheus rules are added as intended:
----
$ curl -sS -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IjhjRDJ6ZnJsdnhKczlVQ2R6TndrOW1RS29BdS1LSEhzbGtQUEVPZFNsVUkifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cy10b2tlbi1yZmhtdiIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJwcm9tZXRoZXVzLWs4cyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjFiYmIyOGQwLThkMDgtNGI2MC1hNDk4LTBjYzc3NzE4OWY1YSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpvcGVuc2hpZnQtbW9uaXRvcmluZzpwcm9tZXRoZXVzLWs4cyJ9.b2nqVlbW4xHGuh3EYuXTsCV9fcjc6G5Yq9TABxqAYaUuRSmp79lH5dfhC9k9TRKDITlVxsXDvMuH_CN392RlwXIMyytEidnNP_zTH-rqpl12NrDTxGfurf2WtfZefPGDM1tSadcAm_jGmebDLzmjWGPGm5mWIYIdRiBaILku0HDrhDLgfhG-BpsZ5WXTikJhdskmCs38Ru9oQcuyXIJEXSOnGqKJZYQFqdzPsA8zh-aCotI49R42qo903tXvoh5kvBU4kx0gAgwdaXYBPOrKe8kWWDvE6gL_NHcb3pLIoI05vXvMW0BwBnMX8h9X7KCXymsU7aH-IiBVnNaRQ8BxYQ" -k https://10.129.2.22:1936/metrics | grep -i template_router_reload_failure
# HELP template_router_reload_failure Metric to track the status of the most recent HAProxy reload
# TYPE template_router_reload_failure gauge
template_router_reload_failure 0
----

Comment 8 errata-xmlrpc 2021-02-24 15:28:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.