Bug 1871175

Summary:

Defunct router doesn't trigger alerts (alerts 4.5 backport)

Product:

OpenShift Container Platform

Reporter:

Stephen Greene <sgreene>

Component:

Networking

Assignee:

Stephen Greene <sgreene>

Networking sub component:

router

QA Contact:

Arvind iyengar <aiyengar>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

aaleman, aiyengar, amcdermo, aos-bugs, bbennett, hongli, jeder, rrackow, sgreene, skuznets, wking

Version:

4.4

Keywords:

ServiceDeliveryImpact

Target Milestone:

---

Target Release:

4.5.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

This is the alert portion backport of https://bugzilla.redhat.com/show_bug.cgi?id=1861455 Add router template reload failure alert. Also add basic HAProxy up alert.

Story Points:

---

Clone Of:

1861455

Environment:

Last Closed:

2020-09-08 10:54:47 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1861455

Bug Blocks:

Attachments:

Description	Flags
Alertmanager dashboard view of alert rule in patched cluster version	none
Alermanager dashboard view for "haproxy_up" rule from patched cluster version	none

Comment 3 Arvind iyengar 2020-08-31 06:37:37 UTC

The PR merge made into "4.5.0-0.nightly-2020-08-27-040633" payload. With the patch in place, it is noted that, the ingress CO goes into degraded state along with the alerts firing specifically indicating the router failure as per the set alert rules:
-----
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-08-27-040633   True        False         140m    Error while reconciling 4.5.0-0.nightly-2020-08-27-040633: the cluster operator ingress is degraded

 oc get co ingress
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
ingress   4.5.0-0.nightly-2020-08-27-040633   False       True          True       16m


$ oc -n openshift-ingress logs deployments/router-default --tail 5                                            
Found 2 pods, using pod/router-default-945d7559f-fmccq
[ALERT] 243/061753 (633) : Fatal errors found in configuration.
E0831 06:17:58.879782       1 limiter.go:165] error reloading router: exit status 1
[ALERT] 243/061758 (636) : parsing [/var/lib/haproxy/conf/haproxy.config:319] : timer overflow in argument '999d' to 'timeout server' (maximum value is 2147483647 ms or ~24.8 days)
[ALERT] 243/061758 (636) : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config
[ALERT] 243/061758 (636) : Fatal errors found in configuration.
-----

Comment 4 Arvind iyengar 2020-08-31 06:39:02 UTC

Created attachment 1713110 [details]
Alertmanager dashboard view of alert rule in patched cluster version

Comment 5 Arvind iyengar 2020-08-31 06:39:54 UTC

Created attachment 1713111 [details]
Alermanager dashboard view for "haproxy_up" rule from patched cluster version

Comment 7 errata-xmlrpc 2020-09-08 10:54:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3510