1866454 – Defunct router doesn't trigger alerts

Bug 1866454 - Defunct router doesn't trigger alerts

Summary: Defunct router doesn't trigger alerts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Stephen Greene
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:	1861455
Blocks:	1868521
TreeView+	depends on / blocked

Reported:	2020-08-05 15:25 UTC by OpenShift BugZilla Robot
Modified:	2022-08-04 22:30 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-24 15:13:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Prometheus graph data from patched cluster version (239.73 KB, image/png) 2020-08-13 10:17 UTC, Arvind iyengar	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 440	None	closed	Bug 1871175: Add basic HAProxy alert rules for HAProxy status and Reload failures	2020-10-16 21:50:23 UTC
Github	openshift router pull 167	None	closed	[release-4.5] Bug 1866454: Back port Remove initial haproxy template commitAndReload	2020-10-16 21:50:23 UTC
Red Hat Product Errata	RHBA-2020:3436	None	None	None	2020-08-24 15:14:08 UTC

Comment 3 Arvind iyengar 2020-08-13 10:17:12 UTC

The PR merge made into "4.5.0-0.nightly-2020-08-13-011355" release. With the fix in place it is noted that the ingress operator now goes into degraded and triggers the alerts when the router goes down for conditions like misconfigured routes:
-----
$ oc -n openshift-ingress get pods -o wide
NAME                                   READY   STATUS             RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-5d7dffdbf-4f5p6         1/1     Running            0          5h31m   10.131.0.14   ip-10-0-155-247.us-east-2.compute.internal   <none>           <none>
router-default-5d7dffdbf-dpcxv         1/1     Running            0          5h31m   10.128.2.3    ip-10-0-164-143.us-east-2.compute.internal   <none>           <none>
router-internalapps-796c7b8bd8-brgnj   1/2     CrashLoopBackOff   33         114m    10.129.2.13   ip-10-0-223-101.us-east-2.compute.internal   <none>           <none>

$ oc -n openshift-ingress logs router-internalapps-796c7b8bd8-brgnj -c router --tail 5
E0813 10:12:10.322714       1 haproxy.go:416] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
E0813 10:12:10.333434       1 limiter.go:165] error reloading router: exit status 1
[ALERT] 225/101210 (18) : parsing [/var/lib/haproxy/conf/haproxy.config:343] : timer overflow in argument '999d' to 'timeout server' (maximum value is 2147483647 ms or ~24.8 days)
[ALERT] 225/101210 (18) : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config
[ALERT] 225/101210 (18) : Fatal errors found in configuration.
-----

The prometheus UI now has the "template_router_reload_fails" metric displaying the reload failures.

Comment 4 Arvind iyengar 2020-08-13 10:17:53 UTC

Created attachment 1711306 [details]
Prometheus graph data from patched cluster version

Comment 8 Arvind iyengar 2020-08-21 13:38:34 UTC

Marking this bug as "verfied" in reference to https://github.com/openshift/router/pull/167.

Comment 10 errata-xmlrpc 2020-08-24 15:13:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.7 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3436

Note You need to log in before you can comment on or make changes to this bug.