Bug 1469423 - hawkular-metrics pod took a long time to become running if set openshift_metrics_hawkular_replicas as non-default value
hawkular-metrics pod took a long time to become running if set openshift_metr...
Status: ASSIGNED
Product: OpenShift Container Platform
Classification: Red Hat
Component: Metrics (Show other bugs)
3.6.0
Unspecified Unspecified
low Severity low
: ---
: 3.7.z
Assigned To: John Sanda
Junqi Zhao
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-11 04:57 EDT by Junqi Zhao
Modified: 2018-01-05 11:39 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
events and hawkular_metrics pod log, openshift_metrics_hawkular_replicas as non-default value (382.57 KB, text/plain)
2017-08-08 01:33 EDT, Junqi Zhao
no flags Details
events and hawkular_metrics pod log,undeploy metrics and re-deploy metrics (195.46 KB, text/plain)
2017-08-08 01:34 EDT, Junqi Zhao
no flags Details
ansible inventory file (479 bytes, text/plain)
2017-08-08 01:36 EDT, Junqi Zhao
no flags Details
metrics pods log, set openshift_metrics_cassandra_replicas as non default value (566.28 KB, text/plain)
2017-11-03 04:45 EDT, Junqi Zhao
no flags Details

  None (edit)
Description Junqi Zhao 2017-07-11 04:57:03 EDT
Description of problem:
Tested on GCE(vm_type: n1-standard-2) and openstack(vm_type: m1.large)
set openshift_metrics_hawkular_replicas as non-default value, such as 2 in inventory file, and deploy metrics via ansible.

hawkular-metrics pod failed readiness probe, and exceeded its timeout(500s), then it restarted once to become ready, described hawkular-metrics pod,the info is normal.

It did not happen when  scaling up hawkular-metrics rc to 2, hawkular-metrics pod could be become running only for 2-3 minutes.

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.6.140
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

metrics images from brew registry
metrics-hawkular-metrics   v3.6.140-1          3a5bebd0476a        24 hours ago        1.293 GB
metrics-cassandra          v3.6.140-1          9644ec21e399        24 hours ago        573.2 MB
metrics-heapster           v3.6.140-1          5549c67d8607        24 hours ago        274.4 MB


# rpm -qa | grep openshift-ansible
openshift-ansible-callback-plugins-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-playbooks-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-lookup-plugins-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-roles-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-docs-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-filter-plugins-3.6.140-1.git.0.4a02427.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Set openshift_metrics_hawkular_replicas=2 in inventory file, and deploy metrics via ansible
2. 
3.

Actual results:
hawkular-metrics pod failed readiness probe, and exceeded its timeout, and then it restarted once to become ready 

Expected results:
hawkular-metrics pod should not wait for so long to become ready.

Additional info:
Comment 1 Matt Wringe 2017-07-24 12:12:33 EDT
Are you sure this is related to setting the initial number of Hawkular Metric instances to 2?

When Hawkular Metrics starts up, it will wait for Cassandra to become ready. If it takes a while to fully start up Cassandra (including downloading the Cassandra image) it will appear that its taking Hawkular Metrics a long time to startup.

If Cassandra is already running, then deploying another Hawkular Metrics pod is expected to take much less time to get running.

Does this long delay also occur if only 1 pod is specified for Hawkular Metrics? For the pods that took a long time to get deployed, do you have the logs for those?
Comment 2 Junqi Zhao 2017-08-08 01:31:30 EDT
(In reply to Matt Wringe from comment #1)
> Are you sure this is related to setting the initial number of Hawkular
> Metric instances to 2?

 Yes, openshift_metrics_hawkular_replicas default value is 1, it did not have this issue, and this issue only happens when we set openshift_metrics_hawkular_replicas as non default value and deploy metrics on OCP for the first time, checked the hawkular-metrics pods log, TimeoutException was found.

If we undeploy metrics and re-deploy metrics with the same configurations, the Hawkular Metric instances would start up around 3-4 minutes, and there is no TimeoutException in hawkular-metrics pods log, see the attached file.

 
> Does this long delay also occur if only 1 pod is specified for Hawkular
> Metrics? For the pods that took a long time to get deployed, do you have the
> logs for those?

There is no long delay if only 1 pod is specified for Hawkular Metrics
Comment 3 Junqi Zhao 2017-08-08 01:33 EDT
Created attachment 1310439 [details]
events and hawkular_metrics pod log, openshift_metrics_hawkular_replicas as non-default value
Comment 4 Junqi Zhao 2017-08-08 01:34 EDT
Created attachment 1310440 [details]
events and hawkular_metrics pod log,undeploy metrics and re-deploy metrics
Comment 5 Junqi Zhao 2017-08-08 01:36 EDT
Created attachment 1310441 [details]
ansible inventory file
Comment 6 Matt Wringe 2017-08-08 11:39:34 EDT
This can be reproduced. This looks like its a bug with the jgroups clustering for OpenShift/Kubernetes that need to be resolved. We may also be able to get around this by delaying deploying a second Hawkular Metrics pod.

Hawkular Metrics will eventually kill the pod as expected and the restarted pod will be able to connect to the cluster properly.
Comment 7 John Sanda 2017-10-09 15:25:24 EDT
(In reply to Matt Wringe from comment #6)
> This can be reproduced. This looks like its a bug with the jgroups
> clustering for OpenShift/Kubernetes that need to be resolved. We may also be
> able to get around this by delaying deploying a second Hawkular Metrics pod.
> 
> Hawkular Metrics will eventually kill the pod as expected and the restarted
> pod will be able to connect to the cluster properly.

I was reviewing the configuration in standalone.xml in origin-metrics. I am not familiar with setting up JGroups clustering. I just read http://blog.infinispan.org/2016/08/running-infinispan-cluster-on-openshift.html and am wondering, do we need to declare a kubernetes stack in the jgroups subsystem?
Comment 8 Matt Wringe 2017-10-10 09:44:59 EDT
We do things a bit differently in origin metrics than what we do in OCP (we bring in the jars for origin, in OCP they are provided by the EAP container).

Eg: https://github.com/openshift/origin-metrics/blob/master/hawkular-metrics/Dockerfile#L48

The setup between the two in standalone.xml is about the same, although since they are using different versions, there is a slight difference in naming.

Eg: https://github.com/openshift/origin-metrics/blob/master/hawkular-metrics/standalone.xml#L322
Comment 9 Junqi Zhao 2017-11-03 04:44:18 EDT
It is the same error if set openshift_metrics_cassandra_replicas as non default value.

# openshift version
openshift v3.7.0-0.190.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

images
metrics-heapster/images/v3.7.0-0.190.0.0
metrics-cassandra/images/v3.7.0-0.190.0.0
metrics-hawkular-metrics/images/v3.7.0-0.190.0.0
Comment 10 Junqi Zhao 2017-11-03 04:45 EDT
Created attachment 1347212 [details]
metrics pods log, set openshift_metrics_cassandra_replicas as non default value
Comment 11 John Sanda 2017-11-03 10:02:12 EDT
(In reply to Junqi Zhao from comment #9)
> It is the same error if set openshift_metrics_cassandra_replicas as non
> default value.
> 
> # openshift version
> openshift v3.7.0-0.190.0
> kubernetes v1.7.6+a08f5eeb62
> etcd 3.2.8
> 
> images
> metrics-heapster/images/v3.7.0-0.190.0.0
> metrics-cassandra/images/v3.7.0-0.190.0.0
> metrics-hawkular-metrics/images/v3.7.0-0.190.0.0

When you say it is the same error, are you referring to the long start up time or something else? 

We can reproduce the JGroups exception in hawkular-metrics, although not entirely consistently. Cassandra however does not use JGroups.

I did see Cassandra initialization ran for ~11 minutes and was not finished when the logs cut off.
Comment 12 Junqi Zhao 2017-11-13 19:15:33 EST
(In reply to John Sanda from comment #11)
> (In reply to Junqi Zhao from comment #9)
> > It is the same error if set openshift_metrics_cassandra_replicas as non
> > default value.
> > 
> > # openshift version
> > openshift v3.7.0-0.190.0
> > kubernetes v1.7.6+a08f5eeb62
> > etcd 3.2.8
> > 
> > images
> > metrics-heapster/images/v3.7.0-0.190.0.0
> > metrics-cassandra/images/v3.7.0-0.190.0.0
> > metrics-hawkular-metrics/images/v3.7.0-0.190.0.0
> 
> When you say it is the same error, are you referring to the long start up
> time or something else? 
 Yes, I mean the start up time is long

Note You need to log in before you can comment on or make changes to this bug.