Bug 1469423

Summary: [3.6]hawkular-metrics pod took a long time to become running if set openshift_metrics_hawkular_replicas as non-default value
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: HawkularAssignee: Ruben Vargas Palma <rvargasp>
Status: CLOSED DEFERRED QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 3.6.0CC: aos-bugs, juzhao
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1540413 (view as bug list) Environment:
Last Closed: 2019-11-20 18:48:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1560695, 1590449, 1590451, 1592966    
Bug Blocks: 1540413    
Attachments:
Description Flags
events and hawkular_metrics pod log, openshift_metrics_hawkular_replicas as non-default value
none
events and hawkular_metrics pod log,undeploy metrics and re-deploy metrics
none
ansible inventory file
none
metrics pods log, set openshift_metrics_cassandra_replicas as non default value
none
hawkular-metrics restarted pod log none

Description Junqi Zhao 2017-07-11 08:57:03 UTC
Description of problem:
Tested on GCE(vm_type: n1-standard-2) and openstack(vm_type: m1.large)
set openshift_metrics_hawkular_replicas as non-default value, such as 2 in inventory file, and deploy metrics via ansible.

hawkular-metrics pod failed readiness probe, and exceeded its timeout(500s), then it restarted once to become ready, described hawkular-metrics pod,the info is normal.

It did not happen when  scaling up hawkular-metrics rc to 2, hawkular-metrics pod could be become running only for 2-3 minutes.

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.6.140
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

metrics images from brew registry
metrics-hawkular-metrics   v3.6.140-1          3a5bebd0476a        24 hours ago        1.293 GB
metrics-cassandra          v3.6.140-1          9644ec21e399        24 hours ago        573.2 MB
metrics-heapster           v3.6.140-1          5549c67d8607        24 hours ago        274.4 MB


# rpm -qa | grep openshift-ansible
openshift-ansible-callback-plugins-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-playbooks-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-lookup-plugins-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-roles-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-docs-3.6.140-1.git.0.4a02427.el7.noarch
openshift-ansible-filter-plugins-3.6.140-1.git.0.4a02427.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Set openshift_metrics_hawkular_replicas=2 in inventory file, and deploy metrics via ansible
2. 
3.

Actual results:
hawkular-metrics pod failed readiness probe, and exceeded its timeout, and then it restarted once to become ready 

Expected results:
hawkular-metrics pod should not wait for so long to become ready.

Additional info:

Comment 1 Matt Wringe 2017-07-24 16:12:33 UTC
Are you sure this is related to setting the initial number of Hawkular Metric instances to 2?

When Hawkular Metrics starts up, it will wait for Cassandra to become ready. If it takes a while to fully start up Cassandra (including downloading the Cassandra image) it will appear that its taking Hawkular Metrics a long time to startup.

If Cassandra is already running, then deploying another Hawkular Metrics pod is expected to take much less time to get running.

Does this long delay also occur if only 1 pod is specified for Hawkular Metrics? For the pods that took a long time to get deployed, do you have the logs for those?

Comment 2 Junqi Zhao 2017-08-08 05:31:30 UTC
(In reply to Matt Wringe from comment #1)
> Are you sure this is related to setting the initial number of Hawkular
> Metric instances to 2?

 Yes, openshift_metrics_hawkular_replicas default value is 1, it did not have this issue, and this issue only happens when we set openshift_metrics_hawkular_replicas as non default value and deploy metrics on OCP for the first time, checked the hawkular-metrics pods log, TimeoutException was found.

If we undeploy metrics and re-deploy metrics with the same configurations, the Hawkular Metric instances would start up around 3-4 minutes, and there is no TimeoutException in hawkular-metrics pods log, see the attached file.

 
> Does this long delay also occur if only 1 pod is specified for Hawkular
> Metrics? For the pods that took a long time to get deployed, do you have the
> logs for those?

There is no long delay if only 1 pod is specified for Hawkular Metrics

Comment 3 Junqi Zhao 2017-08-08 05:33:47 UTC
Created attachment 1310439 [details]
events and hawkular_metrics pod log, openshift_metrics_hawkular_replicas as non-default value

Comment 4 Junqi Zhao 2017-08-08 05:34:47 UTC
Created attachment 1310440 [details]
events and hawkular_metrics pod log,undeploy metrics and re-deploy metrics

Comment 5 Junqi Zhao 2017-08-08 05:36:02 UTC
Created attachment 1310441 [details]
ansible inventory file

Comment 6 Matt Wringe 2017-08-08 15:39:34 UTC
This can be reproduced. This looks like its a bug with the jgroups clustering for OpenShift/Kubernetes that need to be resolved. We may also be able to get around this by delaying deploying a second Hawkular Metrics pod.

Hawkular Metrics will eventually kill the pod as expected and the restarted pod will be able to connect to the cluster properly.

Comment 7 John Sanda 2017-10-09 19:25:24 UTC
(In reply to Matt Wringe from comment #6)
> This can be reproduced. This looks like its a bug with the jgroups
> clustering for OpenShift/Kubernetes that need to be resolved. We may also be
> able to get around this by delaying deploying a second Hawkular Metrics pod.
> 
> Hawkular Metrics will eventually kill the pod as expected and the restarted
> pod will be able to connect to the cluster properly.

I was reviewing the configuration in standalone.xml in origin-metrics. I am not familiar with setting up JGroups clustering. I just read http://blog.infinispan.org/2016/08/running-infinispan-cluster-on-openshift.html and am wondering, do we need to declare a kubernetes stack in the jgroups subsystem?

Comment 8 Matt Wringe 2017-10-10 13:44:59 UTC
We do things a bit differently in origin metrics than what we do in OCP (we bring in the jars for origin, in OCP they are provided by the EAP container).

Eg: https://github.com/openshift/origin-metrics/blob/master/hawkular-metrics/Dockerfile#L48

The setup between the two in standalone.xml is about the same, although since they are using different versions, there is a slight difference in naming.

Eg: https://github.com/openshift/origin-metrics/blob/master/hawkular-metrics/standalone.xml#L322

Comment 9 Junqi Zhao 2017-11-03 08:44:18 UTC
It is the same error if set openshift_metrics_cassandra_replicas as non default value.

# openshift version
openshift v3.7.0-0.190.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8

images
metrics-heapster/images/v3.7.0-0.190.0.0
metrics-cassandra/images/v3.7.0-0.190.0.0
metrics-hawkular-metrics/images/v3.7.0-0.190.0.0

Comment 10 Junqi Zhao 2017-11-03 08:45:00 UTC
Created attachment 1347212 [details]
metrics pods log, set openshift_metrics_cassandra_replicas as non default value

Comment 11 John Sanda 2017-11-03 14:02:12 UTC
(In reply to Junqi Zhao from comment #9)
> It is the same error if set openshift_metrics_cassandra_replicas as non
> default value.
> 
> # openshift version
> openshift v3.7.0-0.190.0
> kubernetes v1.7.6+a08f5eeb62
> etcd 3.2.8
> 
> images
> metrics-heapster/images/v3.7.0-0.190.0.0
> metrics-cassandra/images/v3.7.0-0.190.0.0
> metrics-hawkular-metrics/images/v3.7.0-0.190.0.0

When you say it is the same error, are you referring to the long start up time or something else? 

We can reproduce the JGroups exception in hawkular-metrics, although not entirely consistently. Cassandra however does not use JGroups.

I did see Cassandra initialization ran for ~11 minutes and was not finished when the logs cut off.

Comment 12 Junqi Zhao 2017-11-14 00:15:33 UTC
(In reply to John Sanda from comment #11)
> (In reply to Junqi Zhao from comment #9)
> > It is the same error if set openshift_metrics_cassandra_replicas as non
> > default value.
> > 
> > # openshift version
> > openshift v3.7.0-0.190.0
> > kubernetes v1.7.6+a08f5eeb62
> > etcd 3.2.8
> > 
> > images
> > metrics-heapster/images/v3.7.0-0.190.0.0
> > metrics-cassandra/images/v3.7.0-0.190.0.0
> > metrics-hawkular-metrics/images/v3.7.0-0.190.0.0
> 
> When you say it is the same error, are you referring to the long start up
> time or something else? 
 Yes, I mean the start up time is long

Comment 13 John Sanda 2018-05-07 19:43:44 UTC
This is getting fixed in bug 1560695.

Comment 14 John Sanda 2018-08-06 20:20:17 UTC
I am moving to ON_QA since the fix was done in bug 1560695.

Comment 15 Junqi Zhao 2018-08-09 11:38:34 UTC
It takes about 12 minutes to reach Running status for all the pods, one hawkular-metrics pods had been restarted 2 times

NAME                            READY     STATUS      RESTARTS   AGE       IP            NODE
hawkular-cassandra-1-sxwzf      1/1       Running     0          11m       10.129.0.19   *******-qeos-nrr-1
hawkular-metrics-pw9pn          1/1       Running     2          11m       10.130.0.75   *******-qeos-node-1
hawkular-metrics-schema-62gqj   0/1       Completed   0          12m       10.129.0.16   *******-qeos-nrr-1
hawkular-metrics-szgwm          1/1       Running     0          11m       10.128.0.12   *******-qeos-master-etcd-1
heapster-trzrd                  1/1       Running     0          11m       10.129.0.18   *******-qeos-nrr-1

the restarted metrics pod logs, please see the attached file.

parameters:
openshift_metrics_install_metrics=true
openshift_metrics_cassandra_storage_type=dynamic
openshift_metrics_hawkular_replicas=2

metrics version: v3.10.27-1

# openshift version
openshift v3.10.27

Comment 16 Junqi Zhao 2018-08-09 11:41:32 UTC
Created attachment 1474655 [details]
hawkular-metrics restarted pod log

Comment 18 Stephen Cuppett 2019-11-20 18:48:20 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift