Bug 1410899

Summary: Metrics - Could not acquire a Kubernetes client connection
Product: OpenShift Container Platform Reporter: Eric Jones <erjones>
Component: HawkularAssignee: Matt Wringe <mwringe>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: medium    
Version: 3.3.1CC: aos-bugs, juzhao, mifiedle, mwringe, pruan, tdawson, tkimura
Target Milestone: ---   
Target Release: 3.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When authenticating users, Hawkular Metrics was not properly handling error response back from the OpenShift master for a subjectaccessreview. Consequence: If the authentication token passed was not valid, the connection to Hawkular Metrics would stay open until a timeout. Fix: Hawkular Metrics now properly handles an error response back from the OpenShift server and closes the connection. Result: If a user passes an invalid token, their connection will close properly and not remain open until a timeout.
Story Points: ---
Clone Of:
: 1448999 (view as bug list) Environment:
Last Closed: 2017-01-31 20:19:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1448999    

Description Eric Jones 2017-01-06 19:04:28 UTC
Description of problem:
After upping the cassandra and hawkular-metrics pods from 1 up to 3 replicas, hawkular-metrics periodically will become unreachable.

Gateway timeout in the WebUI and "Could not acquire a Kubernetes client connection" from curling it directly.

They appear to be able to restore normal usage temporarily by restarting the pods.

After discussing things in an internal mailing list [0], it was recommended that they get a bugzilla opened.

[0] http://post-office.corp.redhat.com/archives/openshift-sme/2017-January/msg00178.html

Version-Release number of selected component (if applicable):

Steps to Reproduce:
Deploy metrics on OCP 3.3 via Ansible and post installation scale cassandra (via cassandra-node template) to 3 pods and the hawkular metrics pod to 3.

Once the pods have been deployed, cURL the 3 pods in a loop until failure.

Customer has provided tcpdumps, thread dumps, and a little more other information that will be provided in a private update shortly.

Comment 11 Peng Li 2017-01-25 13:32:26 UTC
set up a cluster with 1 master and 3 nodes, vm type: m3.large, installed 3.4.1 metrics, and observe UI for 2hours, issue is not reproduced.

# openshift version
openshift v3.4.1.2
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

#image
metrics-hawkular-metrics   3.4.1               ea4c68d376ca        18 hours ago        1.5 GB


# oc get pod
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-48ofe   1/1       Running   0          3h
hawkular-cassandra-2-rlzbk   1/1       Running   1          3h
hawkular-cassandra-3-3peho   1/1       Running   0          3h
hawkular-metrics-dka4f       1/1       Running   0          2h
hawkular-metrics-n5mgv       1/1       Running   0          3h
hawkular-metrics-tvy9w       1/1       Running   0          2h

Comment 19 Peter Ruan 2017-01-30 23:01:25 UTC
marking bug as verified per comment #18

Comment 20 Mike Fiedler 2017-01-31 00:48:59 UTC
Verified on 3.4.1.2.  Requests with invalid tokens no longer hang indefinitely.   Also tested were missing tokens and invalid endpoints, both of which worked well.

Comment 22 errata-xmlrpc 2017-01-31 20:19:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0218