Description of problem:
After upping the cassandra and hawkular-metrics pods from 1 up to 3 replicas, hawkular-metrics periodically will become unreachable.
Gateway timeout in the WebUI and "Could not acquire a Kubernetes client connection" from curling it directly.
They appear to be able to restore normal usage temporarily by restarting the pods.
After discussing things in an internal mailing list , it was recommended that they get a bugzilla opened.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Deploy metrics on OCP 3.3 via Ansible and post installation scale cassandra (via cassandra-node template) to 3 pods and the hawkular metrics pod to 3.
Once the pods have been deployed, cURL the 3 pods in a loop until failure.
Customer has provided tcpdumps, thread dumps, and a little more other information that will be provided in a private update shortly.
set up a cluster with 1 master and 3 nodes, vm type: m3.large, installed 3.4.1 metrics, and observe UI for 2hours, issue is not reproduced.
# openshift version
metrics-hawkular-metrics 3.4.1 ea4c68d376ca 18 hours ago 1.5 GB
# oc get pod
NAME READY STATUS RESTARTS AGE
hawkular-cassandra-1-48ofe 1/1 Running 0 3h
hawkular-cassandra-2-rlzbk 1/1 Running 1 3h
hawkular-cassandra-3-3peho 1/1 Running 0 3h
hawkular-metrics-dka4f 1/1 Running 0 2h
hawkular-metrics-n5mgv 1/1 Running 0 3h
hawkular-metrics-tvy9w 1/1 Running 0 2h
marking bug as verified per comment #18
Verified on 188.8.131.52. Requests with invalid tokens no longer hang indefinitely. Also tested were missing tokens and invalid endpoints, both of which worked well.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.