Bug 1835845

Summary:

haproxy_server_http_responses_total metric disappears when backing pods are not available

Product:

OpenShift Container Platform

Reporter:

Rafa Porres Molina <rporresm>

Component:

Networking

Assignee:

Andrew McDermott <amcdermo>

Networking sub component:

router

QA Contact:

Arvind iyengar <aiyengar>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

unspecified

CC:

aiyengar, aos-bugs, bperkins, jbeakley, sgreene

Version:

4.3.z

Target Milestone:

---

Target Release:

4.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Release Note

Doc Text:

If backing pods of a service exposed via a route are unavailable (crashlooping, deleted, etc) the router responds with a 503 but the haproxy_server_http_responses_total metric disappears for that route. We now always report all backend metrics so users can track when no pods are up (e.g., crashlooping).

Story Points:

---

Clone Of:

Clones:

1855852 (view as bug list)

Environment:

Last Closed:

2020-07-13 17:38:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1855852

Attachments:

Description	Flags
Prometheus graph data from patched cluster version	none

Description Rafa Porres Molina 2020-05-14 15:20:50 UTC

Description of problem: If backing pods of a service exposed via route are unavailable (crashlooping, deleted, etc) the router responds with a 503 but haproxy_server_http_responses_total disappears for that route.


Version-Release number of selected component (if applicable): Tested in OSD 4.3.18


How reproducible: Reproducible


Steps to Reproduce:
1. Create an http deployment, a service and expose it through a route
2. Create a client that queries the route (a shell curl loop should be enough)
3. Delete deployment


Actual results: haproxy_server_http_responses_total metric for that route is no longer available, which means that monitoring on that route is no longer possible (e.g to monitor for errors)


Expected results: haproxy_server_http_responses_total registering the 503s the client is getting


Additional info: 

* I've taken a look to a few of haproxy_server_* metrics and they are also not available. 
* haproxy_backend_up metric returns 1, which looks wrong

Comment 6 Arvind iyengar 2020-05-28 12:17:06 UTC

The PR merge made into "4.5.0-0.nightly-2020-05-26-224432" version. In the fixed version, we see the "haproxy_backend_*_metrics" are getting populated properly and "haproxy_backend_http_responses_total" now shows the "error 5xx" (error 503s) in an event with the backend pods goes unavailable as expected.

Comment 7 Arvind iyengar 2020-05-28 12:18:12 UTC

Created attachment 1693034 [details]
Prometheus graph data from patched cluster version

Comment 8 Stephen Greene 2020-07-10 20:13:42 UTC

For future reference this was backported to 4.4 in https://github.com/openshift/router/pull/141/commits

Comment 9 errata-xmlrpc 2020-07-13 17:38:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409