Bug 1835845

Summary: haproxy_server_http_responses_total metric disappears when backing pods are not available
Product: OpenShift Container Platform Reporter: Rafa Porres Molina <rporresm>
Component: NetworkingAssignee: Andrew McDermott <amcdermo>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aiyengar, aos-bugs, bperkins, jbeakley, sgreene
Version: 4.3.z   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Release Note
Doc Text:
If backing pods of a service exposed via a route are unavailable (crashlooping, deleted, etc) the router responds with a 503 but the haproxy_server_http_responses_total metric disappears for that route. We now always report all backend metrics so users can track when no pods are up (e.g., crashlooping).
Story Points: ---
Clone Of:
: 1855852 (view as bug list) Environment:
Last Closed: 2020-07-13 17:38:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1855852    
Attachments:
Description Flags
Prometheus graph data from patched cluster version none

Description Rafa Porres Molina 2020-05-14 15:20:50 UTC
Description of problem: If backing pods of a service exposed via route are unavailable (crashlooping, deleted, etc) the router responds with a 503 but haproxy_server_http_responses_total disappears for that route.


Version-Release number of selected component (if applicable): Tested in OSD 4.3.18


How reproducible: Reproducible


Steps to Reproduce:
1. Create an http deployment, a service and expose it through a route
2. Create a client that queries the route (a shell curl loop should be enough)
3. Delete deployment


Actual results: haproxy_server_http_responses_total metric for that route is no longer available, which means that monitoring on that route is no longer possible (e.g to monitor for errors)


Expected results: haproxy_server_http_responses_total registering the 503s the client is getting


Additional info: 

* I've taken a look to a few of haproxy_server_* metrics and they are also not available. 
* haproxy_backend_up metric returns 1, which looks wrong

Comment 6 Arvind iyengar 2020-05-28 12:17:06 UTC
The PR merge made into "4.5.0-0.nightly-2020-05-26-224432" version. In the fixed version, we see the "haproxy_backend_*_metrics" are getting populated properly and "haproxy_backend_http_responses_total" now shows the "error 5xx" (error 503s) in an event with the backend pods goes unavailable as expected.

Comment 7 Arvind iyengar 2020-05-28 12:18:12 UTC
Created attachment 1693034 [details]
Prometheus graph data from patched cluster version

Comment 8 Stephen Greene 2020-07-10 20:13:42 UTC
For future reference this was backported to 4.4 in https://github.com/openshift/router/pull/141/commits

Comment 9 errata-xmlrpc 2020-07-13 17:38:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409