Bug 1871175

Summary: Defunct router doesn't trigger alerts (alerts 4.5 backport)
Product: OpenShift Container Platform Reporter: Stephen Greene <sgreene>
Component: NetworkingAssignee: Stephen Greene <sgreene>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aaleman, aiyengar, amcdermo, aos-bugs, bbennett, hongli, jeder, rrackow, sgreene, skuznets, wking
Version: 4.4Keywords: ServiceDeliveryImpact
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
This is the alert portion backport of https://bugzilla.redhat.com/show_bug.cgi?id=1861455 Add router template reload failure alert. Also add basic HAProxy up alert.
Story Points: ---
Clone Of: 1861455 Environment:
Last Closed: 2020-09-08 10:54:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1861455    
Bug Blocks:    
Attachments:
Description Flags
Alertmanager dashboard view of alert rule in patched cluster version
none
Alermanager dashboard view for "haproxy_up" rule from patched cluster version none

Comment 3 Arvind iyengar 2020-08-31 06:37:37 UTC
The PR merge made into "4.5.0-0.nightly-2020-08-27-040633" payload. With the patch in place, it is noted that, the ingress CO goes into degraded state along with the alerts firing specifically indicating the router failure as per the set alert rules:
-----
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-08-27-040633   True        False         140m    Error while reconciling 4.5.0-0.nightly-2020-08-27-040633: the cluster operator ingress is degraded

 oc get co ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.5.0-0.nightly-2020-08-27-040633   False       True          True       16m


$ oc -n openshift-ingress logs deployments/router-default --tail 5                                            
Found 2 pods, using pod/router-default-945d7559f-fmccq
[ALERT] 243/061753 (633) : Fatal errors found in configuration.
E0831 06:17:58.879782       1 limiter.go:165] error reloading router: exit status 1
[ALERT] 243/061758 (636) : parsing [/var/lib/haproxy/conf/haproxy.config:319] : timer overflow in argument '999d' to 'timeout server' (maximum value is 2147483647 ms or ~24.8 days)
[ALERT] 243/061758 (636) : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config
[ALERT] 243/061758 (636) : Fatal errors found in configuration.
-----

Comment 4 Arvind iyengar 2020-08-31 06:39:02 UTC
Created attachment 1713110 [details]
Alertmanager dashboard view of alert rule in patched cluster version

Comment 5 Arvind iyengar 2020-08-31 06:39:54 UTC
Created attachment 1713111 [details]
Alermanager dashboard view for "haproxy_up" rule from patched cluster version

Comment 7 errata-xmlrpc 2020-09-08 10:54:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3510