Bug 1401470

Summary: [RFE] Enhance HARouters metrics and aliveness probe
Product: OpenShift Container Platform Reporter: Carsten Lichy-Bittendorf <clichybi>
Component: RFEAssignee: Ben Bennett <bbennett>
Status: CLOSED NEXTRELEASE QA Contact: Xiaoli Tian <xtian>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: aos-bugs, bbennett, eparis, jokerman, mmccomas, pdwyer, sjr
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-18 19:16:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Carsten Lichy-Bittendorf 2016-12-05 11:38:21 UTC
1. Proposed title of this feature request  
Enhance HARouters metrics and aliveness probe

2. Who is the customer behind the request?  
Account: Produban Servicios Informaticos Generales, S.L. acct # 1596976
TAM customer: yes  
SRM customer: yes  
Strategic: yes  

3. What is the nature and description of the request?  

4. Why does the customer need this? (List the business requirements here)  
We need more info about the routers, not only the haproxy stats. We need info about the router binary. We have moments that the router fails when it have to reload the routes and the router isn't synchronized and we haven't got any way to know it.
We have several examples:
a)
    1. The router is working well.
    2. At the moment when the router try to reload the routes, it can't connect with the api an the reload fails.
    3. The router isn't synchronized.
b)
    1. The router is working well.
    2. An user create a new route with its own certificate. The certificate is not compatible with haproxy.
    3. The router try to reload the routes but when it is the moment to evaluate the new certificate, it fails and stop the reload operation.
    4. All the routes in the router are corrupted or inaccessible.
We need that the router has smarter readyness and aliveness probes, not only a check of an endpoint of haproxy. Also, we need some way to know if the router fails to reload the routes, an endpoint to see metrics and the status of router, not only the stats of haproxy.

5. How would the customer like to achieve this? (List the functional requirements here)  
More of a kind of white box monitoring of the haproxy.
  
6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.
Several issues with the haproxy should get caused and the enhanced monitoring should be able to detect them all.
  
7. Is there already an existing RFE upstream or in Red Hat Bugzilla?  
No aware of

8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?  
As soon as possible, as it’s already 

9. Is the sales team involved in this request and do they have any additional input?  
Sales is not involved but aware of this request

10. List any affected packages or components.
-  haproxy
(- monitoring)

11. Would the customer be able to assist in testing this functionality if implemented?
Based on experiences in operating the HAProxies the customer will be able to share feedback.

Comment 2 Ben Bennett 2017-02-09 14:35:04 UTC
We screen certificates now (available in 3.3 and soon 3.2)

Stats is a known problem, but it needs a later version of haproxy to address.  We are working on that, but it's not the top priority.

As to smarter monitoring, that is not likely to get addressed directly.  But you can add it yourself.  You could add a stable endpoint and make it always run, then have your liveness checks hit a route to those pods (by editing the router DC).

Comment 4 Ben Bennett 2017-05-18 19:16:24 UTC
We have added the stats to the router in 3.6: https://github.com/openshift/origin/pull/13337