Bug 1835845 - haproxy_server_http_responses_total metric disappears when backing pods are not available
Summary: haproxy_server_http_responses_total metric disappears when backing pods are n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Andrew McDermott
QA Contact: Arvind iyengar
URL:
Whiteboard:
Depends On:
Blocks: 1855852
TreeView+ depends on / blocked
 
Reported: 2020-05-14 15:20 UTC by Rafa Porres Molina
Modified: 2020-07-13 17:39 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
If backing pods of a service exposed via a route are unavailable (crashlooping, deleted, etc) the router responds with a 503 but the haproxy_server_http_responses_total metric disappears for that route. We now always report all backend metrics so users can track when no pods are up (e.g., crashlooping).
Clone Of:
: 1855852 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:38:54 UTC
Target Upstream Version:


Attachments (Terms of Use)
Prometheus graph data from patched cluster version (903.35 KB, image/png)
2020-05-28 12:18 UTC, Arvind iyengar
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 24994 None closed Bug 1835845: Metrics test for router should allow backend metrics 2020-09-21 12:29:52 UTC
Github openshift router pull 132 None closed Bug 1835845: Report all backend metrics for when there are no endpoints 2020-09-21 12:29:48 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:39:10 UTC

Description Rafa Porres Molina 2020-05-14 15:20:50 UTC
Description of problem: If backing pods of a service exposed via route are unavailable (crashlooping, deleted, etc) the router responds with a 503 but haproxy_server_http_responses_total disappears for that route.


Version-Release number of selected component (if applicable): Tested in OSD 4.3.18


How reproducible: Reproducible


Steps to Reproduce:
1. Create an http deployment, a service and expose it through a route
2. Create a client that queries the route (a shell curl loop should be enough)
3. Delete deployment


Actual results: haproxy_server_http_responses_total metric for that route is no longer available, which means that monitoring on that route is no longer possible (e.g to monitor for errors)


Expected results: haproxy_server_http_responses_total registering the 503s the client is getting


Additional info: 

* I've taken a look to a few of haproxy_server_* metrics and they are also not available. 
* haproxy_backend_up metric returns 1, which looks wrong

Comment 6 Arvind iyengar 2020-05-28 12:17:06 UTC
The PR merge made into "4.5.0-0.nightly-2020-05-26-224432" version. In the fixed version, we see the "haproxy_backend_*_metrics" are getting populated properly and "haproxy_backend_http_responses_total" now shows the "error 5xx" (error 503s) in an event with the backend pods goes unavailable as expected.

Comment 7 Arvind iyengar 2020-05-28 12:18:12 UTC
Created attachment 1693034 [details]
Prometheus graph data from patched cluster version

Comment 8 Stephen Greene 2020-07-10 20:13:42 UTC
For future reference this was backported to 4.4 in https://github.com/openshift/router/pull/141/commits

Comment 9 errata-xmlrpc 2020-07-13 17:38:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.