Bug 1401470 - [RFE] Enhance HARouters metrics and aliveness probe
Summary: [RFE] Enhance HARouters metrics and aliveness probe
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RFE
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Ben Bennett
QA Contact: Xiaoli Tian
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-05 11:38 UTC by Carsten Lichy-Bittendorf
Modified: 2021-12-10 14:49 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-18 19:16:24 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Carsten Lichy-Bittendorf 2016-12-05 11:38:21 UTC
1. Proposed title of this feature request  
Enhance HARouters metrics and aliveness probe

2. Who is the customer behind the request?  
Account: Produban Servicios Informaticos Generales, S.L. acct # 1596976
TAM customer: yes  
SRM customer: yes  
Strategic: yes  

3. What is the nature and description of the request?  

4. Why does the customer need this? (List the business requirements here)  
We need more info about the routers, not only the haproxy stats. We need info about the router binary. We have moments that the router fails when it have to reload the routes and the router isn't synchronized and we haven't got any way to know it.
We have several examples:
a)
    1. The router is working well.
    2. At the moment when the router try to reload the routes, it can't connect with the api an the reload fails.
    3. The router isn't synchronized.
b)
    1. The router is working well.
    2. An user create a new route with its own certificate. The certificate is not compatible with haproxy.
    3. The router try to reload the routes but when it is the moment to evaluate the new certificate, it fails and stop the reload operation.
    4. All the routes in the router are corrupted or inaccessible.
We need that the router has smarter readyness and aliveness probes, not only a check of an endpoint of haproxy. Also, we need some way to know if the router fails to reload the routes, an endpoint to see metrics and the status of router, not only the stats of haproxy.

5. How would the customer like to achieve this? (List the functional requirements here)  
More of a kind of white box monitoring of the haproxy.
  
6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.
Several issues with the haproxy should get caused and the enhanced monitoring should be able to detect them all.
  
7. Is there already an existing RFE upstream or in Red Hat Bugzilla?  
No aware of

8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?  
As soon as possible, as it’s already 

9. Is the sales team involved in this request and do they have any additional input?  
Sales is not involved but aware of this request

10. List any affected packages or components.
-  haproxy
(- monitoring)

11. Would the customer be able to assist in testing this functionality if implemented?
Based on experiences in operating the HAProxies the customer will be able to share feedback.

Comment 2 Ben Bennett 2017-02-09 14:35:04 UTC
We screen certificates now (available in 3.3 and soon 3.2)

Stats is a known problem, but it needs a later version of haproxy to address.  We are working on that, but it's not the top priority.

As to smarter monitoring, that is not likely to get addressed directly.  But you can add it yourself.  You could add a stable endpoint and make it always run, then have your liveness checks hit a route to those pods (by editing the router DC).

Comment 4 Ben Bennett 2017-05-18 19:16:24 UTC
We have added the stats to the router in 3.6: https://github.com/openshift/origin/pull/13337


Note You need to log in before you can comment on or make changes to this bug.