Description of problem: After upgrading our cluster to OCP 4.4 we encountered an issue that resulted in the ingress not working: https://bugzilla.redhat.com/show_bug.cgi?id=1861383#c1 This didn't trigger any alerts and the ingress operator considered the state of affairs to be fine: ``` $ k get clusteroperator NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE [...] ingress 4.4.11 True False False 20d [...] ``` Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Follow the steps in https://bugzilla.redhat.com/show_bug.cgi?id=1861383 to break the router on 4.4 2. Observe no alerts 3. Actual results: Expected results: An alert with level critical goes off Additional info:
Created attachment 1702783 [details] router logs Adding router logs with loglevel set to 10 to easier follow up what's happening
In this case, the operand was entirely nonfunctional. The fact that this did not cause an alert to be fired, the operator to be considered degraded, or the upgrade to fail is a serious bug.
Target a fix for 4.6 and will backport. Added UpcomingSprint.
I am working on a fix for this bug. There are 2 interesting parts to this issue. The HAProxy "up" metric is actually working. First, we identified an issue with how the router syncs on startup. Currently, the router performs an initial reload before syncing the state of all route resources. This is a defect. This initial "routeless" reload starts HAProxy, just in a defunct state with respect to cluster ingress. Removing this premature reload should cause new router pods to never become ready in the event that improper routes exist, since HAProxy will not start without having all the route resources available. The reason that the HAProxy "up" metric was reporting "1" during the mentioned upgrade to 4.4 was because the previous "routeless" HAProxy process was still running after the first "route in" reload failed. The router intentionally does not kill the running "old" HAProxy process if a reload fails. Second, there is no metric that tracks HAProxy router reload failures. A router reload failure that happens after the router pod successfully starts will not crash the router pod, and this is intentional. By adding a router reload failure count metric, we can alert cluster admins when the router is not respecting newly created/modified routes. This new metric, in addition to the existing HAProxy "up" metric, will serve as great metrics for HAProxy up/down alerts. I will have the fix PR for this BZ up shortly after I perform some more tests. I am providing this comment to scope this BZ and to help QA during the verification process.
The PR was merged and made into "4.6.0-0.nightly-2020-08-11-235456" release. With this payload, it is noted that "template_router_reload_fail" metric is now functional and display the reload failures for the routers whereas, the "haproxy_up" metric triggers warning in the alert manager during router down events.
Created attachment 1711166 [details] Prometheus graph data from patched cluster version
Created attachment 1711167 [details] Prometheus graph data from unpatched cluster version
Created attachment 1711168 [details] Alermanager data from patched cluster version
Created attachment 1711301 [details] alertmanager dashboard example - 2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196