1861455 – Defunct router doesn't trigger alerts

Bug 1861455 - Defunct router doesn't trigger alerts

Summary: Defunct router doesn't trigger alerts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Stephen Greene
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1866454 1871175
TreeView+	depends on / blocked

Reported:	2020-07-28 16:36 UTC by aaleman
Modified:	2022-08-04 22:30 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Updating the cluster when routes that break HAProxy exist in the cluster. Consequence: HAProxy does not initialize properly and does not respect cluster routes on upgrade. Upgrade succeeds into a cluster with not functioning routes. Fix: Correct HAProxy initial sync logic so that upgrades with broken routes will fail. Add router template reload failure metric and alert. Result: Upgrading with broken routes will not be successful, and HAProxy failed reloads at any time within a cluster will be reported via alerts.
Clone Of:
Clones:	1871175 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:21:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
router logs (1.84 MB, text/plain) 2020-07-29 10:32 UTC, Rick Rackow	no flags	Details
Prometheus graph data from patched cluster version (207.89 KB, image/png) 2020-08-12 10:10 UTC, Arvind iyengar	no flags	Details
Prometheus graph data from unpatched cluster version (140.71 KB, image/png) 2020-08-12 10:11 UTC, Arvind iyengar	no flags	Details
Alermanager data from patched cluster version (102.21 KB, image/png) 2020-08-12 10:11 UTC, Arvind iyengar	no flags	Details
alertmanager dashboard example - 2 (144.50 KB, image/png) 2020-08-13 09:41 UTC, Arvind iyengar	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 397	None	closed	Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures	2021-01-18 00:44:57 UTC
Github	openshift router pull 165	None	closed	Bug 1861455: Remove initial haproxy template commitAndReload	2021-01-18 00:44:17 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:21:40 UTC

Description aaleman 2020-07-28 16:36:57 UTC

Description of problem:

After upgrading our cluster to OCP 4.4 we encountered an issue that resulted in the ingress not working: https://bugzilla.redhat.com/show_bug.cgi?id=1861383#c1

This didn't trigger any alerts and the ingress operator considered the state of affairs to be fine:

```
$ k get clusteroperator
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
[...]   
ingress                                    4.4.11    True        False         False      20d
[...]
```


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Follow the steps in https://bugzilla.redhat.com/show_bug.cgi?id=1861383 to break the router on 4.4
2. Observe no alerts
3.

Actual results:


Expected results:

An alert with level critical goes off 


Additional info:

Comment 1 Rick Rackow 2020-07-29 10:32:10 UTC

Created attachment 1702783 [details]
router logs

Adding router logs with loglevel set to 10 to easier follow up what's happening

Comment 2 Steve Kuznetsov 2020-07-29 15:43:29 UTC

In this case, the operand was entirely nonfunctional. The fact that this did not cause an alert to be fired, the operator to be considered degraded, or the upgrade to fail is a serious bug.

Comment 4 Andrew McDermott 2020-07-30 09:54:03 UTC

Target a fix for 4.6 and will backport.

Added UpcomingSprint.

Comment 7 Stephen Greene 2020-08-06 15:30:02 UTC

I am working on a fix for this bug. There are 2 interesting parts to this issue. The HAProxy "up" metric is actually working.

First, we identified an issue with how the router syncs on startup. Currently, the router performs an initial reload before syncing the state of all route resources. This is a defect. This initial "routeless" reload starts HAProxy, just in a defunct state with respect to cluster ingress. Removing this premature reload should cause new router pods to never become ready in the event that improper routes exist, since HAProxy will not start without having all the route resources available. The reason that the HAProxy "up" metric was reporting "1" during the mentioned upgrade to 4.4 was because the previous "routeless" HAProxy process was still running after the first "route in" reload failed. The router intentionally does not kill the running "old" HAProxy process if a reload fails.

Second, there is no metric that tracks HAProxy router reload failures. A router reload failure that happens after the router pod successfully starts will not crash the router pod, and this is intentional. By adding a router reload failure count metric, we can alert cluster admins when the router is not respecting newly created/modified routes. This new metric, in addition to the existing HAProxy "up" metric, will serve as great metrics for HAProxy up/down alerts.

I will have the fix PR for this BZ up shortly after I perform some more tests. I am providing this comment to scope this BZ and to help QA during the verification process.

Comment 12 Arvind iyengar 2020-08-12 10:09:06 UTC

The PR was merged and made into "4.6.0-0.nightly-2020-08-11-235456" release. With this payload, it is noted that "template_router_reload_fail" metric is now functional and display the reload failures for the routers whereas, the "haproxy_up" metric triggers warning in the alert manager during router down events.

Comment 13 Arvind iyengar 2020-08-12 10:10:25 UTC

Created attachment 1711166 [details]
Prometheus graph data from patched cluster version

Comment 14 Arvind iyengar 2020-08-12 10:11:09 UTC

Created attachment 1711167 [details]
Prometheus graph data from unpatched cluster version

Comment 15 Arvind iyengar 2020-08-12 10:11:39 UTC

Created attachment 1711168 [details]
Alermanager data from patched cluster version

Comment 16 Arvind iyengar 2020-08-13 09:41:44 UTC

Created attachment 1711301 [details]
alertmanager dashboard  example - 2

Comment 18 errata-xmlrpc 2020-10-27 16:21:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.