Bug 1861455

Summary: Defunct router doesn't trigger alerts
Product: OpenShift Container Platform Reporter: aaleman
Component: NetworkingAssignee: Stephen Greene <sgreene>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: amcdermo, aos-bugs, hongli, jeder, rrackow, sgreene, skuznets, wking
Version: 4.4Keywords: ServiceDeliveryImpact
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Updating the cluster when routes that break HAProxy exist in the cluster. Consequence: HAProxy does not initialize properly and does not respect cluster routes on upgrade. Upgrade succeeds into a cluster with not functioning routes. Fix: Correct HAProxy initial sync logic so that upgrades with broken routes will fail. Add router template reload failure metric and alert. Result: Upgrading with broken routes will not be successful, and HAProxy failed reloads at any time within a cluster will be reported via alerts.
Story Points: ---
Clone Of:
: 1871175 (view as bug list) Environment:
Last Closed: 2020-10-27 16:21:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1866454, 1871175    
Attachments:
Description Flags
router logs
none
Prometheus graph data from patched cluster version
none
Prometheus graph data from unpatched cluster version
none
Alermanager data from patched cluster version
none
alertmanager dashboard example - 2 none

Description aaleman 2020-07-28 16:36:57 UTC
Description of problem:

After upgrading our cluster to OCP 4.4 we encountered an issue that resulted in the ingress not working: https://bugzilla.redhat.com/show_bug.cgi?id=1861383#c1

This didn't trigger any alerts and the ingress operator considered the state of affairs to be fine:

```
$ k get clusteroperator
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
[...]   
ingress                                    4.4.11    True        False         False      20d
[...]
```


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Follow the steps in https://bugzilla.redhat.com/show_bug.cgi?id=1861383 to break the router on 4.4
2. Observe no alerts
3.

Actual results:


Expected results:

An alert with level critical goes off 


Additional info:

Comment 1 Rick Rackow 2020-07-29 10:32:10 UTC
Created attachment 1702783 [details]
router logs

Adding router logs with loglevel set to 10 to easier follow up what's happening

Comment 2 Steve Kuznetsov 2020-07-29 15:43:29 UTC
In this case, the operand was entirely nonfunctional. The fact that this did not cause an alert to be fired, the operator to be considered degraded, or the upgrade to fail is a serious bug.

Comment 4 Andrew McDermott 2020-07-30 09:54:03 UTC
Target a fix for 4.6 and will backport.

Added UpcomingSprint.

Comment 7 Stephen Greene 2020-08-06 15:30:02 UTC
I am working on a fix for this bug. There are 2 interesting parts to this issue. The HAProxy "up" metric is actually working.

First, we identified an issue with how the router syncs on startup. Currently, the router performs an initial reload before syncing the state of all route resources. This is a defect. This initial "routeless" reload starts HAProxy, just in a defunct state with respect to cluster ingress. Removing this premature reload should cause new router pods to never become ready in the event that improper routes exist, since HAProxy will not start without having all the route resources available. The reason that the HAProxy "up" metric was reporting "1" during the mentioned upgrade to 4.4 was because the previous "routeless" HAProxy process was still running after the first "route in" reload failed. The router intentionally does not kill the running "old" HAProxy process if a reload fails. 

Second, there is no metric that tracks HAProxy router reload failures. A router reload failure that happens after the router pod successfully starts will not crash the router pod, and this is intentional. By adding a router reload failure count metric, we can alert cluster admins when the router is not respecting newly created/modified routes. This new metric, in addition to the existing HAProxy "up" metric, will serve as great metrics for HAProxy up/down alerts. 

I will have the fix PR for this BZ up shortly after I perform some more tests. I am providing this comment to scope this BZ and to help QA during the verification process.

Comment 12 Arvind iyengar 2020-08-12 10:09:06 UTC
The PR was merged and made into "4.6.0-0.nightly-2020-08-11-235456" release. With this payload, it is noted that "template_router_reload_fail" metric is now functional and display the reload failures for the routers whereas, the "haproxy_up" metric triggers warning in the alert manager during router down events.

Comment 13 Arvind iyengar 2020-08-12 10:10:25 UTC
Created attachment 1711166 [details]
Prometheus graph data from patched cluster version

Comment 14 Arvind iyengar 2020-08-12 10:11:09 UTC
Created attachment 1711167 [details]
Prometheus graph data from unpatched cluster version

Comment 15 Arvind iyengar 2020-08-12 10:11:39 UTC
Created attachment 1711168 [details]
Alermanager data from patched cluster version

Comment 16 Arvind iyengar 2020-08-13 09:41:44 UTC
Created attachment 1711301 [details]
alertmanager dashboard  example - 2

Comment 18 errata-xmlrpc 2020-10-27 16:21:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196