Description of problem: After upping the cassandra and hawkular-metrics pods from 1 up to 3 replicas, hawkular-metrics periodically will become unreachable. Gateway timeout in the WebUI and "Could not acquire a Kubernetes client connection" from curling it directly. They appear to be able to restore normal usage temporarily by restarting the pods. After discussing things in an internal mailing list [0], it was recommended that they get a bugzilla opened. [0] http://post-office.corp.redhat.com/archives/openshift-sme/2017-January/msg00178.html Version-Release number of selected component (if applicable): Steps to Reproduce: Deploy metrics on OCP 3.3 via Ansible and post installation scale cassandra (via cassandra-node template) to 3 pods and the hawkular metrics pod to 3. Once the pods have been deployed, cURL the 3 pods in a loop until failure. Customer has provided tcpdumps, thread dumps, and a little more other information that will be provided in a private update shortly.
set up a cluster with 1 master and 3 nodes, vm type: m3.large, installed 3.4.1 metrics, and observe UI for 2hours, issue is not reproduced. # openshift version openshift v3.4.1.2 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0 #image metrics-hawkular-metrics 3.4.1 ea4c68d376ca 18 hours ago 1.5 GB # oc get pod NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-48ofe 1/1 Running 0 3h hawkular-cassandra-2-rlzbk 1/1 Running 1 3h hawkular-cassandra-3-3peho 1/1 Running 0 3h hawkular-metrics-dka4f 1/1 Running 0 2h hawkular-metrics-n5mgv 1/1 Running 0 3h hawkular-metrics-tvy9w 1/1 Running 0 2h
marking bug as verified per comment #18
Verified on 3.4.1.2. Requests with invalid tokens no longer hang indefinitely. Also tested were missing tokens and invalid endpoints, both of which worked well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0218