Bug 1871175 - Defunct router doesn't trigger alerts (alerts 4.5 backport)
Summary: Defunct router doesn't trigger alerts (alerts 4.5 backport)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.z
Assignee: Stephen Greene
QA Contact: Arvind iyengar
URL:
Whiteboard:
Depends On: 1861455
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-21 13:47 UTC by Stephen Greene
Modified: 2020-09-08 10:55 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
This is the alert portion backport of https://bugzilla.redhat.com/show_bug.cgi?id=1861455 Add router template reload failure alert. Also add basic HAProxy up alert.
Clone Of: 1861455
Environment:
Last Closed: 2020-09-08 10:54:47 UTC
Target Upstream Version:


Attachments (Terms of Use)
Alertmanager dashboard view of alert rule in patched cluster version (133.87 KB, image/png)
2020-08-31 06:39 UTC, Arvind iyengar
no flags Details
Alermanager dashboard view for "haproxy_up" rule from patched cluster version (111.94 KB, image/png)
2020-08-31 06:39 UTC, Arvind iyengar
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 440 None closed Bug 1871175: Add basic HAProxy alert rules for HAProxy status and Reload failures 2020-09-02 13:22:26 UTC
Red Hat Product Errata RHBA-2020:3510 None None None 2020-09-08 10:55:15 UTC

Comment 3 Arvind iyengar 2020-08-31 06:37:37 UTC
The PR merge made into "4.5.0-0.nightly-2020-08-27-040633" payload. With the patch in place, it is noted that, the ingress CO goes into degraded state along with the alerts firing specifically indicating the router failure as per the set alert rules:
-----
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-08-27-040633   True        False         140m    Error while reconciling 4.5.0-0.nightly-2020-08-27-040633: the cluster operator ingress is degraded

 oc get co ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.5.0-0.nightly-2020-08-27-040633   False       True          True       16m


$ oc -n openshift-ingress logs deployments/router-default --tail 5                                            
Found 2 pods, using pod/router-default-945d7559f-fmccq
[ALERT] 243/061753 (633) : Fatal errors found in configuration.
E0831 06:17:58.879782       1 limiter.go:165] error reloading router: exit status 1
[ALERT] 243/061758 (636) : parsing [/var/lib/haproxy/conf/haproxy.config:319] : timer overflow in argument '999d' to 'timeout server' (maximum value is 2147483647 ms or ~24.8 days)
[ALERT] 243/061758 (636) : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config
[ALERT] 243/061758 (636) : Fatal errors found in configuration.
-----

Comment 4 Arvind iyengar 2020-08-31 06:39:02 UTC
Created attachment 1713110 [details]
Alertmanager dashboard view of alert rule in patched cluster version

Comment 5 Arvind iyengar 2020-08-31 06:39:54 UTC
Created attachment 1713111 [details]
Alermanager dashboard view for "haproxy_up" rule from patched cluster version

Comment 7 errata-xmlrpc 2020-09-08 10:54:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3510


Note You need to log in before you can comment on or make changes to this bug.