Bug 1861455 - Defunct router doesn't trigger alerts
Summary: Defunct router doesn't trigger alerts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Stephen Greene
QA Contact: Arvind iyengar
URL:
Whiteboard:
Depends On:
Blocks: 1866454 1871175
TreeView+ depends on / blocked
 
Reported: 2020-07-28 16:36 UTC by aaleman
Modified: 2020-10-27 16:21 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Updating the cluster when routes that break HAProxy exist in the cluster. Consequence: HAProxy does not initialize properly and does not respect cluster routes on upgrade. Upgrade succeeds into a cluster with not functioning routes. Fix: Correct HAProxy initial sync logic so that upgrades with broken routes will fail. Add router template reload failure metric and alert. Result: Upgrading with broken routes will not be successful, and HAProxy failed reloads at any time within a cluster will be reported via alerts.
Clone Of:
: 1871175 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:21:16 UTC
Target Upstream Version:


Attachments (Terms of Use)
router logs (1.84 MB, text/plain)
2020-07-29 10:32 UTC, Rick Rackow
no flags Details
Prometheus graph data from patched cluster version (207.89 KB, image/png)
2020-08-12 10:10 UTC, Arvind iyengar
no flags Details
Prometheus graph data from unpatched cluster version (140.71 KB, image/png)
2020-08-12 10:11 UTC, Arvind iyengar
no flags Details
Alermanager data from patched cluster version (102.21 KB, image/png)
2020-08-12 10:11 UTC, Arvind iyengar
no flags Details
alertmanager dashboard example - 2 (144.50 KB, image/png)
2020-08-13 09:41 UTC, Arvind iyengar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 397 0 None closed Bug 1861455: Add basic HAProxy alert rules for HAProxy status and Reload failures 2021-01-18 00:44:57 UTC
Github openshift router pull 165 0 None closed Bug 1861455: Remove initial haproxy template commitAndReload 2021-01-18 00:44:17 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:21:40 UTC

Description aaleman 2020-07-28 16:36:57 UTC
Description of problem:

After upgrading our cluster to OCP 4.4 we encountered an issue that resulted in the ingress not working: https://bugzilla.redhat.com/show_bug.cgi?id=1861383#c1

This didn't trigger any alerts and the ingress operator considered the state of affairs to be fine:

```
$ k get clusteroperator
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
[...]   
ingress                                    4.4.11    True        False         False      20d
[...]
```


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Follow the steps in https://bugzilla.redhat.com/show_bug.cgi?id=1861383 to break the router on 4.4
2. Observe no alerts
3.

Actual results:


Expected results:

An alert with level critical goes off 


Additional info:

Comment 1 Rick Rackow 2020-07-29 10:32:10 UTC
Created attachment 1702783 [details]
router logs

Adding router logs with loglevel set to 10 to easier follow up what's happening

Comment 2 Steve Kuznetsov 2020-07-29 15:43:29 UTC
In this case, the operand was entirely nonfunctional. The fact that this did not cause an alert to be fired, the operator to be considered degraded, or the upgrade to fail is a serious bug.

Comment 4 Andrew McDermott 2020-07-30 09:54:03 UTC
Target a fix for 4.6 and will backport.

Added UpcomingSprint.

Comment 7 Stephen Greene 2020-08-06 15:30:02 UTC
I am working on a fix for this bug. There are 2 interesting parts to this issue. The HAProxy "up" metric is actually working.

First, we identified an issue with how the router syncs on startup. Currently, the router performs an initial reload before syncing the state of all route resources. This is a defect. This initial "routeless" reload starts HAProxy, just in a defunct state with respect to cluster ingress. Removing this premature reload should cause new router pods to never become ready in the event that improper routes exist, since HAProxy will not start without having all the route resources available. The reason that the HAProxy "up" metric was reporting "1" during the mentioned upgrade to 4.4 was because the previous "routeless" HAProxy process was still running after the first "route in" reload failed. The router intentionally does not kill the running "old" HAProxy process if a reload fails. 

Second, there is no metric that tracks HAProxy router reload failures. A router reload failure that happens after the router pod successfully starts will not crash the router pod, and this is intentional. By adding a router reload failure count metric, we can alert cluster admins when the router is not respecting newly created/modified routes. This new metric, in addition to the existing HAProxy "up" metric, will serve as great metrics for HAProxy up/down alerts. 

I will have the fix PR for this BZ up shortly after I perform some more tests. I am providing this comment to scope this BZ and to help QA during the verification process.

Comment 12 Arvind iyengar 2020-08-12 10:09:06 UTC
The PR was merged and made into "4.6.0-0.nightly-2020-08-11-235456" release. With this payload, it is noted that "template_router_reload_fail" metric is now functional and display the reload failures for the routers whereas, the "haproxy_up" metric triggers warning in the alert manager during router down events.

Comment 13 Arvind iyengar 2020-08-12 10:10:25 UTC
Created attachment 1711166 [details]
Prometheus graph data from patched cluster version

Comment 14 Arvind iyengar 2020-08-12 10:11:09 UTC
Created attachment 1711167 [details]
Prometheus graph data from unpatched cluster version

Comment 15 Arvind iyengar 2020-08-12 10:11:39 UTC
Created attachment 1711168 [details]
Alermanager data from patched cluster version

Comment 16 Arvind iyengar 2020-08-13 09:41:44 UTC
Created attachment 1711301 [details]
alertmanager dashboard  example - 2

Comment 18 errata-xmlrpc 2020-10-27 16:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.