Red Hat Bugzilla – Bug 1469423
[3.6]hawkular-metrics pod took a long time to become running if set openshift_metrics_hawkular_replicas as non-default value
Last modified: 2018-01-30 20:58:16 EST
Description of problem:
Tested on GCE(vm_type: n1-standard-2) and openstack(vm_type: m1.large)
set openshift_metrics_hawkular_replicas as non-default value, such as 2 in inventory file, and deploy metrics via ansible.
hawkular-metrics pod failed readiness probe, and exceeded its timeout(500s), then it restarted once to become ready, described hawkular-metrics pod,the info is normal.
It did not happen when scaling up hawkular-metrics rc to 2, hawkular-metrics pod could be become running only for 2-3 minutes.
Version-Release number of selected component (if applicable):
# openshift version
metrics images from brew registry
metrics-hawkular-metrics v3.6.140-1 3a5bebd0476a 24 hours ago 1.293 GB
metrics-cassandra v3.6.140-1 9644ec21e399 24 hours ago 573.2 MB
metrics-heapster v3.6.140-1 5549c67d8607 24 hours ago 274.4 MB
# rpm -qa | grep openshift-ansible
Steps to Reproduce:
1. Set openshift_metrics_hawkular_replicas=2 in inventory file, and deploy metrics via ansible
hawkular-metrics pod failed readiness probe, and exceeded its timeout, and then it restarted once to become ready
hawkular-metrics pod should not wait for so long to become ready.
Are you sure this is related to setting the initial number of Hawkular Metric instances to 2?
When Hawkular Metrics starts up, it will wait for Cassandra to become ready. If it takes a while to fully start up Cassandra (including downloading the Cassandra image) it will appear that its taking Hawkular Metrics a long time to startup.
If Cassandra is already running, then deploying another Hawkular Metrics pod is expected to take much less time to get running.
Does this long delay also occur if only 1 pod is specified for Hawkular Metrics? For the pods that took a long time to get deployed, do you have the logs for those?
(In reply to Matt Wringe from comment #1)
> Are you sure this is related to setting the initial number of Hawkular
> Metric instances to 2?
Yes, openshift_metrics_hawkular_replicas default value is 1, it did not have this issue, and this issue only happens when we set openshift_metrics_hawkular_replicas as non default value and deploy metrics on OCP for the first time, checked the hawkular-metrics pods log, TimeoutException was found.
If we undeploy metrics and re-deploy metrics with the same configurations, the Hawkular Metric instances would start up around 3-4 minutes, and there is no TimeoutException in hawkular-metrics pods log, see the attached file.
> Does this long delay also occur if only 1 pod is specified for Hawkular
> Metrics? For the pods that took a long time to get deployed, do you have the
> logs for those?
There is no long delay if only 1 pod is specified for Hawkular Metrics
Created attachment 1310439 [details]
events and hawkular_metrics pod log， openshift_metrics_hawkular_replicas as non-default value
Created attachment 1310440 [details]
events and hawkular_metrics pod log，undeploy metrics and re-deploy metrics
Created attachment 1310441 [details]
ansible inventory file
This can be reproduced. This looks like its a bug with the jgroups clustering for OpenShift/Kubernetes that need to be resolved. We may also be able to get around this by delaying deploying a second Hawkular Metrics pod.
Hawkular Metrics will eventually kill the pod as expected and the restarted pod will be able to connect to the cluster properly.
(In reply to Matt Wringe from comment #6)
> This can be reproduced. This looks like its a bug with the jgroups
> clustering for OpenShift/Kubernetes that need to be resolved. We may also be
> able to get around this by delaying deploying a second Hawkular Metrics pod.
> Hawkular Metrics will eventually kill the pod as expected and the restarted
> pod will be able to connect to the cluster properly.
I was reviewing the configuration in standalone.xml in origin-metrics. I am not familiar with setting up JGroups clustering. I just read http://blog.infinispan.org/2016/08/running-infinispan-cluster-on-openshift.html and am wondering, do we need to declare a kubernetes stack in the jgroups subsystem?
We do things a bit differently in origin metrics than what we do in OCP (we bring in the jars for origin, in OCP they are provided by the EAP container).
The setup between the two in standalone.xml is about the same, although since they are using different versions, there is a slight difference in naming.
It is the same error if set openshift_metrics_cassandra_replicas as non default value.
# openshift version
Created attachment 1347212 [details]
metrics pods log, set openshift_metrics_cassandra_replicas as non default value
(In reply to Junqi Zhao from comment #9)
> It is the same error if set openshift_metrics_cassandra_replicas as non
> default value.
> # openshift version
> openshift v3.7.0-0.190.0
> kubernetes v1.7.6+a08f5eeb62
> etcd 3.2.8
When you say it is the same error, are you referring to the long start up time or something else?
We can reproduce the JGroups exception in hawkular-metrics, although not entirely consistently. Cassandra however does not use JGroups.
I did see Cassandra initialization ran for ~11 minutes and was not finished when the logs cut off.
(In reply to John Sanda from comment #11)
> (In reply to Junqi Zhao from comment #9)
> > It is the same error if set openshift_metrics_cassandra_replicas as non
> > default value.
> > # openshift version
> > openshift v3.7.0-0.190.0
> > kubernetes v1.7.6+a08f5eeb62
> > etcd 3.2.8
> > images
> > metrics-heapster/images/v3.7.0-0.190.0.0
> > metrics-cassandra/images/v3.7.0-0.190.0.0
> > metrics-hawkular-metrics/images/v3.7.0-0.190.0.0
> When you say it is the same error, are you referring to the long start up
> time or something else?
Yes, I mean the start up time is long