Bug 1866454 - Defunct router doesn't trigger alerts
Summary: Defunct router doesn't trigger alerts
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.4
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.5.z
Assignee: Stephen Greene
QA Contact: Arvind iyengar
Depends On: 1861455
Blocks: 1868521
TreeView+ depends on / blocked
Reported: 2020-08-05 15:25 UTC by OpenShift BugZilla Robot
Modified: 2020-08-24 15:14 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2020-08-24 15:13:44 UTC
Target Upstream Version:

Attachments (Terms of Use)
Prometheus graph data from patched cluster version (239.73 KB, image/png)
2020-08-13 10:17 UTC, Arvind iyengar
no flags Details

System ID Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 440 None closed Bug 1871175: Add basic HAProxy alert rules for HAProxy status and Reload failures 2020-09-08 09:00:42 UTC
Github openshift router pull 167 None closed [release-4.5] Bug 1866454: Back port Remove initial haproxy template commitAndReload 2020-09-08 09:00:42 UTC
Red Hat Product Errata RHBA-2020:3436 None None None 2020-08-24 15:14:08 UTC

Comment 3 Arvind iyengar 2020-08-13 10:17:12 UTC
The PR merge made into "4.5.0-0.nightly-2020-08-13-011355" release. With the fix in place it is noted that the ingress operator now goes into degraded and triggers the alerts when the router goes down for conditions like misconfigured routes:
$ oc -n openshift-ingress get pods -o wide
NAME                                   READY   STATUS             RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-5d7dffdbf-4f5p6         1/1     Running            0          5h31m   ip-10-0-155-247.us-east-2.compute.internal   <none>           <none>
router-default-5d7dffdbf-dpcxv         1/1     Running            0          5h31m    ip-10-0-164-143.us-east-2.compute.internal   <none>           <none>
router-internalapps-796c7b8bd8-brgnj   1/2     CrashLoopBackOff   33         114m   ip-10-0-223-101.us-east-2.compute.internal   <none>           <none>

$ oc -n openshift-ingress logs router-internalapps-796c7b8bd8-brgnj -c router --tail 5
E0813 10:12:10.322714       1 haproxy.go:416] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
E0813 10:12:10.333434       1 limiter.go:165] error reloading router: exit status 1
[ALERT] 225/101210 (18) : parsing [/var/lib/haproxy/conf/haproxy.config:343] : timer overflow in argument '999d' to 'timeout server' (maximum value is 2147483647 ms or ~24.8 days)
[ALERT] 225/101210 (18) : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config
[ALERT] 225/101210 (18) : Fatal errors found in configuration.

The prometheus UI now has the "template_router_reload_fails" metric displaying the reload failures.

Comment 4 Arvind iyengar 2020-08-13 10:17:53 UTC
Created attachment 1711306 [details]
Prometheus graph data from patched cluster version

Comment 8 Arvind iyengar 2020-08-21 13:38:34 UTC
Marking this bug as "verfied" in reference to https://github.com/openshift/router/pull/167.

Comment 10 errata-xmlrpc 2020-08-24 15:13:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.7 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.