Bug 1613656 - [3.9]sometimes hawkular-metrics would restart a few times to reach Running status
Summary: [3.9]sometimes hawkular-metrics would restart a few times to reach Running st...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Hawkular
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.9.z
Assignee: John Sanda
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-08 05:16 UTC by Junqi Zhao
Modified: 2018-10-08 11:04 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-08 11:04:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
metrics logs (34.15 KB, application/x-gzip)
2018-08-08 05:16 UTC, Junqi Zhao
no flags Details

Description Junqi Zhao 2018-08-08 05:16:27 UTC
Created attachment 1474147 [details]
metrics logs

Description of problem:
Deployed metrics 3.9, hawkular-metrics would restart one or two times to reach Running status
# oc get pod
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-29klh   1/1       Running   0          23m
hawkular-metrics-l7d59       1/1       Running   2          23m
heapster-96f8z               1/1       Running   0          23m

and there are exceptions in the hawkular-metrics pod before it is restarted, this defect does not block the metrics function, after restarting, hawkular-metrics pod becomes running, and metrics works well.


**************************************************************************
2018-08-08 04:51:24,038 FATAL [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200006: An error occurred trying to connect to the Cassandra cluster: java.lang.RuntimeException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.254.162:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.30.254.162:9042] Timed out waiting for server response))
	at org.hawkular.metrics.api.jaxrs.DistributedLock.lockAndThen(DistributedLock.java:111)
	at org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle.initSchema(MetricsServiceLifecycle.java:649)
	at org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle.startMetricsService(MetricsServiceLifecycle.java:406)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.254.162:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.30.254.162:9042] Timed out waiting for server response))
	at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
	at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:37)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
	at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
	at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
	at com.datastax.driver.core.Session$execute$0.call(Unknown Source)
	at org.cassalog.core.CassalogImpl.executeCQL(CassalogImpl.groovy:351)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:384)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1021)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:69)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:166)
	at org.cassalog.core.CassalogImpl$_applyChangeSet_closure16.doCall(CassalogImpl.groovy:323)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:294)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1021)
	at groovy.lang.Closure.call(Closure.java:426)
	at groovy.lang.Closure.call(Closure.java:442)
	at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2030)
	at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2015)
	at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2056)
	at org.codehaus.groovy.runtime.dgm$162.invoke(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoMetaMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:274)
	at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:56)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
	at org.cassalog.core.CassalogImpl.applyChangeSet(CassalogImpl.groovy:323)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:384)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1021)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:69)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:182)
	at org.cassalog.core.CassalogImpl$_execute_closure3.doCall(CassalogImpl.groovy:135)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:294)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1021)
	at groovy.lang.Closure.call(Closure.java:426)
	at org.codehaus.groovy.runtime.DefaultGroovyMethods.eachWithIndex(DefaultGroovyMethods.java:1946)
	at org.codehaus.groovy.runtime.DefaultGroovyMethods.eachWithIndex(DefaultGroovyMethods.java:1926)
	at org.codehaus.groovy.runtime.DefaultGroovyMethods.eachWithIndex(DefaultGroovyMethods.java:1976)
	at org.codehaus.groovy.runtime.dgm$174.invoke(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoMetaMethodSiteNoUnwrapNoCoerce.invoke(PojoMetaMethodSite.java:274)
	at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:56)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
	at org.cassalog.core.CassalogImpl.execute(CassalogImpl.groovy:109)
	at org.hawkular.metrics.schema.SchemaService.run(SchemaService.java:68)
	at org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle.lambda$initSchema$2(MetricsServiceLifecycle.java:651)
	at org.hawkular.metrics.api.jaxrs.DistributedLock.lockAndThen(DistributedLock.java:109)
	... 9 more
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.254.162:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [hawkular-cassandra/172.30.254.162:9042] Timed out waiting for server response))
	at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:220)
	at com.datastax.driver.core.RequestHandler.access$1200(RequestHandler.java:50)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:291)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.retry(RequestHandler.java:459)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.processRetryDecision(RequestHandler.java:437)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:796)
	at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1382)
	at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:663)
	at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:738)
	at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:466)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more

2018-08-08 04:51:48,326 ERROR [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (MSC service thread 1-8) HAWKMETRICS200009: Unexcepted exception while shutting down: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.NullPointerException
	at com.google.common.util.concurrent.Futures.wrapAndThrowExceptionOrError(Futures.java:1464)
	at com.google.common.util.concurrent.Futures.get(Futures.java:1453)
	at org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle.destroy(MetricsServiceLifecycle.java:806)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.invokeMethods(DefaultLifecycleCallbackInvoker.java:97)
	at org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.preDestroy(DefaultLifecycleCallbackInvoker.java:90)
	at org.jboss.weld.injection.producer.BasicInjectionTarget.preDestroy(BasicInjectionTarget.java:127)
	at org.jboss.weld.bean.ManagedBean.destroy(ManagedBean.java:191)
	at org.jboss.weld.util.bean.IsolatedForwardingBean.destroy(IsolatedForwardingBean.java:50)
	at org.jboss.weld.context.AbstractContext.destroyContextualInstance(AbstractContext.java:139)
	at org.jboss.weld.context.AbstractContext.destroy(AbstractContext.java:153)
	at org.jboss.weld.context.AbstractSharedContext.destroy(AbstractSharedContext.java:61)
	at org.jboss.weld.context.AbstractSharedContext.invalidate(AbstractSharedContext.java:56)
	at org.jboss.weld.bootstrap.WeldRuntime.shutdown(WeldRuntime.java:56)
	at org.jboss.weld.bootstrap.WeldBootstrap.shutdown(WeldBootstrap.java:113)
	at org.jboss.as.weld.WeldStartService.stop(WeldStartService.java:128)
	at org.jboss.msc.service.ServiceControllerImpl$StopTask.stopService(ServiceControllerImpl.java:2150)
	at org.jboss.msc.service.ServiceControllerImpl$StopTask.run(ServiceControllerImpl.java:2101)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
	at org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle.stopServices(MetricsServiceLifecycle.java:818)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	... 3 more
Version-Release number of selected component (if applicable):
metrics-cassandra-v3.9.40-11
metrics-hawkular-metrics-v3.9.40-11
metrics-heapster-v3.9.40-11

How reproducible:
Always

Steps to Reproduce:
1. Deploy metrics 3.9, parameters see [Additional info] part
2.
3.

Actual results:
hawkular-metrics would restart a few times to reach Running status

Expected results:


Additional info:
openshift_metrics_install_metrics=true
openshift_metrics_image_prefix=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/
openshift_metrics_image_version=v3.9.40
openshift_metrics_cassandra_storage_type=dynamic

Comment 1 John Sanda 2018-08-08 12:23:35 UTC
Things are working as expected. The hawkular-metrics pod will not reach a Running status until cassandra is ready. I can see in the logs that hawkular-metrics was trying to connect to cassandra before it was ready to receive client connections. Here is the key statement from the cassandra log:


INFO  [main] 2018-08-08 04:49:58,694 Server.java:160 - Starting listening for CQL clients on hawkular-cassandra-1-29klh/10.129.0.13:9042 (encrypted)...


The above statement means that Cassandra is ready to accept client connections. 

From the hawkular-metrics log:


2018-08-08 04:47:37,322 WARN  [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200003: Could not connect to Cassandra cluster - assuming its not up yet: All host(s) tried for query failed (tried: hawkular-cassandra/172.30.254.162:9042 (com.datastax.driver.core.exceptions.TransportException: [hawkular-cassandra/172.30.254.162:9042] Cannot connect))
2018-08-08 04:47:37,322 WARN  [org.hawkular.metrics.api.jaxrs.MetricsServiceLifecycle] (metricsservice-lifecycle-thread) HAWKMETRICS200004: [1] Retrying connecting to Cassandra cluster in [1]s...


Note the timestamps. The hawkular-metrics pod was trying to connect more than two minutes before cassandra was initialized.

Unless there is something else going on I think we can close this.

Comment 2 Junqi Zhao 2018-08-09 02:42:28 UTC
Tested with the same images as Comment 0 from stage repo today, it does not have this issue, so this issue maybe happen occassionally or related to env.

# oc get pod -n openshift-infra
NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-4jzlv   1/1       Running   0          33m
hawkular-metrics-jqnmj       1/1       Running   0          33m
heapster-phbtn               1/1       Running   0          33m

Comment 3 John Sanda 2018-09-24 17:57:42 UTC
Can we close this ticket?

Comment 4 Junqi Zhao 2018-10-08 11:04:35 UTC
(In reply to John Sanda from comment #3)
> Can we close this ticket?

OK, Will close it


Note You need to log in before you can comment on or make changes to this bug.