Description of problem: If backing pods of a service exposed via route are unavailable (crashlooping, deleted, etc) the router responds with a 503 but haproxy_server_http_responses_total disappears for that route. Version-Release number of selected component (if applicable): Tested in OSD 4.3.18 How reproducible: Reproducible Steps to Reproduce: 1. Create an http deployment, a service and expose it through a route 2. Create a client that queries the route (a shell curl loop should be enough) 3. Delete deployment Actual results: haproxy_server_http_responses_total metric for that route is no longer available, which means that monitoring on that route is no longer possible (e.g to monitor for errors) Expected results: haproxy_server_http_responses_total registering the 503s the client is getting Additional info: * I've taken a look to a few of haproxy_server_* metrics and they are also not available. * haproxy_backend_up metric returns 1, which looks wrong
The PR merge made into "4.5.0-0.nightly-2020-05-26-224432" version. In the fixed version, we see the "haproxy_backend_*_metrics" are getting populated properly and "haproxy_backend_http_responses_total" now shows the "error 5xx" (error 503s) in an event with the backend pods goes unavailable as expected.
Created attachment 1693034 [details] Prometheus graph data from patched cluster version
For future reference this was backported to 4.4 in https://github.com/openshift/router/pull/141/commits
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409