Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1410899

Summary:	Metrics - Could not acquire a Kubernetes client connection
Product:	OpenShift Container Platform	Reporter:	Eric Jones <erjones>
Component:	Hawkular	Assignee:	Matt Wringe <mwringe>
Status:	CLOSED ERRATA	QA Contact:	Mike Fiedler <mifiedle>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.3.1	CC:	aos-bugs, juzhao, mifiedle, mwringe, pruan, tdawson, tkimura
Target Milestone:	---
Target Release:	3.4.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: When authenticating users, Hawkular Metrics was not properly handling error response back from the OpenShift master for a subjectaccessreview. Consequence: If the authentication token passed was not valid, the connection to Hawkular Metrics would stay open until a timeout. Fix: Hawkular Metrics now properly handles an error response back from the OpenShift server and closes the connection. Result: If a user passes an invalid token, their connection will close properly and not remain open until a timeout.	Story Points:	---
Clone Of:
Clones:	1448999 (view as bug list)		Environment:
Last Closed:	2017-01-31 20:19:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1448999

Description Eric Jones 2017-01-06 19:04:28 UTC

Description of problem:
After upping the cassandra and hawkular-metrics pods from 1 up to 3 replicas, hawkular-metrics periodically will become unreachable.

Gateway timeout in the WebUI and "Could not acquire a Kubernetes client connection" from curling it directly.

They appear to be able to restore normal usage temporarily by restarting the pods.

After discussing things in an internal mailing list [0], it was recommended that they get a bugzilla opened.

[0] http://post-office.corp.redhat.com/archives/openshift-sme/2017-January/msg00178.html

Version-Release number of selected component (if applicable):

Steps to Reproduce:
Deploy metrics on OCP 3.3 via Ansible and post installation scale cassandra (via cassandra-node template) to 3 pods and the hawkular metrics pod to 3.

Once the pods have been deployed, cURL the 3 pods in a loop until failure.

Customer has provided tcpdumps, thread dumps, and a little more other information that will be provided in a private update shortly.

Comment 11 Peng Li 2017-01-25 13:32:26 UTC

set up a cluster with 1 master and 3 nodes, vm type: m3.large, installed 3.4.1 metrics, and observe UI for 2hours, issue is not reproduced.

# openshift version
openshift v3.4.1.2
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

#image
metrics-hawkular-metrics   3.4.1               ea4c68d376ca        18 hours ago        1.5 GB


# oc get pod
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-48ofe   1/1       Running   0          3h
hawkular-cassandra-2-rlzbk   1/1       Running   1          3h
hawkular-cassandra-3-3peho   1/1       Running   0          3h
hawkular-metrics-dka4f       1/1       Running   0          2h
hawkular-metrics-n5mgv       1/1       Running   0          3h
hawkular-metrics-tvy9w       1/1       Running   0          2h

Comment 19 Peter Ruan 2017-01-30 23:01:25 UTC

marking bug as verified per comment #18

Comment 20 Mike Fiedler 2017-01-31 00:48:59 UTC

Verified on 3.4.1.2.  Requests with invalid tokens no longer hang indefinitely.   Also tested were missing tokens and invalid endpoints, both of which worked well.

Comment 22 errata-xmlrpc 2017-01-31 20:19:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0218