Bug 1410899 - Metrics - Could not acquire a Kubernetes client connection
Summary: Metrics - Could not acquire a Kubernetes client connection
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.3.1
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 3.4.z
Assignee: Matt Wringe
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks: 1448999
TreeView+ depends on / blocked
 
Reported: 2017-01-06 19:04 UTC by Eric Jones
Modified: 2020-12-14 07:59 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When authenticating users, Hawkular Metrics was not properly handling error response back from the OpenShift master for a subjectaccessreview. Consequence: If the authentication token passed was not valid, the connection to Hawkular Metrics would stay open until a timeout. Fix: Hawkular Metrics now properly handles an error response back from the OpenShift server and closes the connection. Result: If a user passes an invalid token, their connection will close properly and not remain open until a timeout.
Clone Of:
: 1448999 (view as bug list)
Environment:
Last Closed: 2017-01-31 20:19:37 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0218 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4.1.2 bug fix update 2017-02-01 01:18:20 UTC

Description Eric Jones 2017-01-06 19:04:28 UTC
Description of problem:
After upping the cassandra and hawkular-metrics pods from 1 up to 3 replicas, hawkular-metrics periodically will become unreachable.

Gateway timeout in the WebUI and "Could not acquire a Kubernetes client connection" from curling it directly.

They appear to be able to restore normal usage temporarily by restarting the pods.

After discussing things in an internal mailing list [0], it was recommended that they get a bugzilla opened.

[0] http://post-office.corp.redhat.com/archives/openshift-sme/2017-January/msg00178.html

Version-Release number of selected component (if applicable):

Steps to Reproduce:
Deploy metrics on OCP 3.3 via Ansible and post installation scale cassandra (via cassandra-node template) to 3 pods and the hawkular metrics pod to 3.

Once the pods have been deployed, cURL the 3 pods in a loop until failure.

Customer has provided tcpdumps, thread dumps, and a little more other information that will be provided in a private update shortly.

Comment 11 Peng Li 2017-01-25 13:32:26 UTC
set up a cluster with 1 master and 3 nodes, vm type: m3.large, installed 3.4.1 metrics, and observe UI for 2hours, issue is not reproduced.

# openshift version
openshift v3.4.1.2
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

#image
metrics-hawkular-metrics   3.4.1               ea4c68d376ca        18 hours ago        1.5 GB


# oc get pod
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-48ofe   1/1       Running   0          3h
hawkular-cassandra-2-rlzbk   1/1       Running   1          3h
hawkular-cassandra-3-3peho   1/1       Running   0          3h
hawkular-metrics-dka4f       1/1       Running   0          2h
hawkular-metrics-n5mgv       1/1       Running   0          3h
hawkular-metrics-tvy9w       1/1       Running   0          2h

Comment 19 Peter Ruan 2017-01-30 23:01:25 UTC
marking bug as verified per comment #18

Comment 20 Mike Fiedler 2017-01-31 00:48:59 UTC
Verified on 3.4.1.2.  Requests with invalid tokens no longer hang indefinitely.   Also tested were missing tokens and invalid endpoints, both of which worked well.

Comment 22 errata-xmlrpc 2017-01-31 20:19:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0218


Note You need to log in before you can comment on or make changes to this bug.