Bug 1469423
Summary: | [3.6]hawkular-metrics pod took a long time to become running if set openshift_metrics_hawkular_replicas as non-default value | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | |
Component: | Hawkular | Assignee: | Ruben Vargas Palma <rvargasp> | |
Status: | CLOSED DEFERRED | QA Contact: | Junqi Zhao <juzhao> | |
Severity: | low | Docs Contact: | ||
Priority: | low | |||
Version: | 3.6.0 | CC: | aos-bugs, juzhao | |
Target Milestone: | --- | |||
Target Release: | 3.10.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1540413 (view as bug list) | Environment: | ||
Last Closed: | 2019-11-20 18:48:20 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1560695, 1590449, 1590451, 1592966 | |||
Bug Blocks: | 1540413 | |||
Attachments: |
Description
Junqi Zhao
2017-07-11 08:57:03 UTC
Are you sure this is related to setting the initial number of Hawkular Metric instances to 2? When Hawkular Metrics starts up, it will wait for Cassandra to become ready. If it takes a while to fully start up Cassandra (including downloading the Cassandra image) it will appear that its taking Hawkular Metrics a long time to startup. If Cassandra is already running, then deploying another Hawkular Metrics pod is expected to take much less time to get running. Does this long delay also occur if only 1 pod is specified for Hawkular Metrics? For the pods that took a long time to get deployed, do you have the logs for those? (In reply to Matt Wringe from comment #1) > Are you sure this is related to setting the initial number of Hawkular > Metric instances to 2? Yes, openshift_metrics_hawkular_replicas default value is 1, it did not have this issue, and this issue only happens when we set openshift_metrics_hawkular_replicas as non default value and deploy metrics on OCP for the first time, checked the hawkular-metrics pods log, TimeoutException was found. If we undeploy metrics and re-deploy metrics with the same configurations, the Hawkular Metric instances would start up around 3-4 minutes, and there is no TimeoutException in hawkular-metrics pods log, see the attached file. > Does this long delay also occur if only 1 pod is specified for Hawkular > Metrics? For the pods that took a long time to get deployed, do you have the > logs for those? There is no long delay if only 1 pod is specified for Hawkular Metrics Created attachment 1310439 [details]
events and hawkular_metrics pod log, openshift_metrics_hawkular_replicas as non-default value
Created attachment 1310440 [details]
events and hawkular_metrics pod log,undeploy metrics and re-deploy metrics
Created attachment 1310441 [details]
ansible inventory file
This can be reproduced. This looks like its a bug with the jgroups clustering for OpenShift/Kubernetes that need to be resolved. We may also be able to get around this by delaying deploying a second Hawkular Metrics pod. Hawkular Metrics will eventually kill the pod as expected and the restarted pod will be able to connect to the cluster properly. (In reply to Matt Wringe from comment #6) > This can be reproduced. This looks like its a bug with the jgroups > clustering for OpenShift/Kubernetes that need to be resolved. We may also be > able to get around this by delaying deploying a second Hawkular Metrics pod. > > Hawkular Metrics will eventually kill the pod as expected and the restarted > pod will be able to connect to the cluster properly. I was reviewing the configuration in standalone.xml in origin-metrics. I am not familiar with setting up JGroups clustering. I just read http://blog.infinispan.org/2016/08/running-infinispan-cluster-on-openshift.html and am wondering, do we need to declare a kubernetes stack in the jgroups subsystem? We do things a bit differently in origin metrics than what we do in OCP (we bring in the jars for origin, in OCP they are provided by the EAP container). Eg: https://github.com/openshift/origin-metrics/blob/master/hawkular-metrics/Dockerfile#L48 The setup between the two in standalone.xml is about the same, although since they are using different versions, there is a slight difference in naming. Eg: https://github.com/openshift/origin-metrics/blob/master/hawkular-metrics/standalone.xml#L322 It is the same error if set openshift_metrics_cassandra_replicas as non default value. # openshift version openshift v3.7.0-0.190.0 kubernetes v1.7.6+a08f5eeb62 etcd 3.2.8 images metrics-heapster/images/v3.7.0-0.190.0.0 metrics-cassandra/images/v3.7.0-0.190.0.0 metrics-hawkular-metrics/images/v3.7.0-0.190.0.0 Created attachment 1347212 [details]
metrics pods log, set openshift_metrics_cassandra_replicas as non default value
(In reply to Junqi Zhao from comment #9) > It is the same error if set openshift_metrics_cassandra_replicas as non > default value. > > # openshift version > openshift v3.7.0-0.190.0 > kubernetes v1.7.6+a08f5eeb62 > etcd 3.2.8 > > images > metrics-heapster/images/v3.7.0-0.190.0.0 > metrics-cassandra/images/v3.7.0-0.190.0.0 > metrics-hawkular-metrics/images/v3.7.0-0.190.0.0 When you say it is the same error, are you referring to the long start up time or something else? We can reproduce the JGroups exception in hawkular-metrics, although not entirely consistently. Cassandra however does not use JGroups. I did see Cassandra initialization ran for ~11 minutes and was not finished when the logs cut off. (In reply to John Sanda from comment #11) > (In reply to Junqi Zhao from comment #9) > > It is the same error if set openshift_metrics_cassandra_replicas as non > > default value. > > > > # openshift version > > openshift v3.7.0-0.190.0 > > kubernetes v1.7.6+a08f5eeb62 > > etcd 3.2.8 > > > > images > > metrics-heapster/images/v3.7.0-0.190.0.0 > > metrics-cassandra/images/v3.7.0-0.190.0.0 > > metrics-hawkular-metrics/images/v3.7.0-0.190.0.0 > > When you say it is the same error, are you referring to the long start up > time or something else? Yes, I mean the start up time is long This is getting fixed in bug 1560695. I am moving to ON_QA since the fix was done in bug 1560695. It takes about 12 minutes to reach Running status for all the pods, one hawkular-metrics pods had been restarted 2 times NAME READY STATUS RESTARTS AGE IP NODE hawkular-cassandra-1-sxwzf 1/1 Running 0 11m 10.129.0.19 *******-qeos-nrr-1 hawkular-metrics-pw9pn 1/1 Running 2 11m 10.130.0.75 *******-qeos-node-1 hawkular-metrics-schema-62gqj 0/1 Completed 0 12m 10.129.0.16 *******-qeos-nrr-1 hawkular-metrics-szgwm 1/1 Running 0 11m 10.128.0.12 *******-qeos-master-etcd-1 heapster-trzrd 1/1 Running 0 11m 10.129.0.18 *******-qeos-nrr-1 the restarted metrics pod logs, please see the attached file. parameters: openshift_metrics_install_metrics=true openshift_metrics_cassandra_storage_type=dynamic openshift_metrics_hawkular_replicas=2 metrics version: v3.10.27-1 # openshift version openshift v3.10.27 Created attachment 1474655 [details]
hawkular-metrics restarted pod log
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed. [1]: https://access.redhat.com/support/policy/updates/openshift |