Bug 1585891
| Summary: | [3.10]Sometimes, hawkular-metrics pod would throw out error: WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||
| Component: | Hawkular | Assignee: | Ruben Vargas Palma <rvargasp> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 3.11.0 | CC: | agawand, aos-bugs, apurty, cruhm, erich, haowang, jkaur, jmalde, juzhao, ktadimar, pescorza, pkanthal, rkshirsa, rvargasp, sburke, sople | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | 3.11.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1606771 1606783 1623025 (view as bug list) | Environment: | |||||
| Last Closed: | 2021-11-09 14:14:22 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1606771, 1606783, 1623025 | ||||||
| Attachments: |
|
||||||
The same issue is observed on a 3.6 cluster as well.
aused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:59)
at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:277)
at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:257)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
... 29 more
----------------
The customer was also hitting the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1561823 but the error does not appear now after performing the workaround mentioned in the bug.
Please advise.
(In reply to Venkata Tadimarri from comment #1) > The same issue is observed on a 3.6 cluster as well. > > > aused by: com.datastax.driver.core.exceptions.WriteTimeoutException: > Cassandra timeout during write query at consistency LOCAL_ONE (1 replica > were required but only 0 acknowledged the write) > at > com.datastax.driver.core.Responses$Error$1.decode(Responses.java:59) > at > com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37) > at > com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:277) > at > com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:257) > at > io.netty.handler.codec.MessageToMessageDecoder. > channelRead(MessageToMessageDecoder.java:88) > ... 29 more > > ---------------- > > The customer was also hitting the issue in > https://bugzilla.redhat.com/show_bug.cgi?id=1561823 but the error does not > appear now after performing the workaround mentioned in the bug. > > Please advise. Update: After adding new PV to cassandra and starting the containers (i.e with no previous data), the pods startup OK without any of the errors. (In reply to Venkata Tadimarri from comment #1) > The same issue is observed on a 3.6 cluster as well. > > > aused by: com.datastax.driver.core.exceptions.WriteTimeoutException: > Cassandra timeout during write query at consistency LOCAL_ONE (1 replica > were required but only 0 acknowledged the write) > at > com.datastax.driver.core.Responses$Error$1.decode(Responses.java:59) > at > com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37) > at > com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:277) > at > com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:257) > at > io.netty.handler.codec.MessageToMessageDecoder. > channelRead(MessageToMessageDecoder.java:88) > ... 29 more > > ---------------- > > The customer was also hitting the issue in > https://bugzilla.redhat.com/show_bug.cgi?id=1561823 but the error does not > appear now after performing the workaround mentioned in the bug. > > Please advise. Please open a separate ticket for this. There are quite a few changes between metrics in 3.6 and metrics in 3.10. I see similar behavior in free-int, and I doubt it is simply coincidence. Can you double the memory for the cassandra pods? How much cpu is allocated to cassandra pods? Can you also provide the output of `oc get pods --all-namespaces | wc -l`? (In reply to John Sanda from comment #5) > I see similar behavior in free-int, and I doubt it is simply coincidence. > Can you double the memory for the cassandra pods? Yes, changed to openshift_metrics_cassandra_limits_memory=4G > How much cpu is allocated to cassandra pods? Used the default value, not set cpu limit for cassandra pods > Can you also provide the output of `oc get pods --all-namespaces | wc -l`? # oc get pods --all-namespaces | wc -l 28 There is another job that also runs every two hours like the compression job, but is staggered to run one hour before, which creates the "temp" tables. When a new table is created, hawkular-metrics receives an event notification from the Cassandra driver about the schema change and then proceeds to create the prepared statements for the new tables. There is a race condition which is causing that to fail. Here it is in the provided logs: 2018-06-04 15:08:48,753 ERROR [stderr] (RxIoScheduler-9) Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: unconfigured table data_temp_2018060512 ^[[0m^[[31m2018-06-04 15:08:48,753 ERROR [stderr] (RxIoScheduler-9) at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:50) ^[[0m^[[31m2018-06-04 15:08:48,753 ERROR [stderr] (RxIoScheduler-9) at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37) ^[[0m^[[31m2018-06-04 15:08:48,754 ERROR [stderr] (RxIoScheduler-9) at com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:104) ^[[0m^[[31m2018-06-04 15:08:48,754 ERROR [stderr] (RxIoScheduler-9) at org.hawkular.metrics.core.service.DataAccessImpl.prepareTempStatements(DataAccessImpl.java:392) ^[[0m^[[31m2018-06-04 15:08:48,754 ERROR [stderr] (RxIoScheduler-9) at org.hawkular.metrics.core.service.DataAccessImpl$TemporaryTableStatementCreator.lambda$onTableAdded$0(DataAccessImpl.java:1373) ^[[0m^[[31m2018-06-04 15:08:48,754 ERROR [stderr] (RxIoScheduler-9) at rx.internal.operators.OnSubscribeFromCallable.call(OnSubscribeFromCallable.java:48) I think that this may be causing the write timeouts. The timeouts are occurring when inserting data points. Let's say hawkular-metrics sends the query to cassandra-1 but cassandra-2 is the owner. cassandra-1 basically forwards the request to cassandra-2 and waits for a response. cassandra-2 will parse the queries, an expensive step which is only done upfront one time when creating a prepared statement. cassandra-2 is having to occasionally do this parsing because of the above error. The extra processing time causes cassandra-1 to timeout. I have not had a chance to try and reproduce this theory in a more isolated environment, but I do think that there is a good chance this is the cause. Can you test again with only a single cassandra pod? I am curious to see if we will get the same behavior. I was able to easily reproduce the exception in comment 8 with an automated test running locally on my dev machine. I reproduced the error with a three node Cassandra cluster. I did not get any errors when running just a single node. (In reply to John Sanda from comment #8) > Can you test again with only a single cassandra pod? I am curious to see if > we will get the same behavior. This error does not happen every time, and tested with a single cassandra pod, did not find this error, but this does not mean the error won't happen with single cassandra pod. Tested on a public env with a single cassandra pod today(many people use the env to test), issue is reproduced, image version is v3.10.28-1. Maybe if there are a lot of read/write, this issue would be happen. # oc get pod -n openshift-infra NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-czw5r 1/1 Running 0 1h hawkular-metrics-fh7pm 1/1 Running 0 1h hawkular-metrics-schema-jhmzv 0/1 Completed 0 1h heapster-8fnfl 1/1 Running 0 1h 2018-08-15 07:40:14,645 ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] (RxComputationScheduler-1) HAWKMETRICS200010: Failed to process request: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) at rx.exceptions.Exceptions.propagate(Exceptions.java:57) at org.hawkular.metrics.core.service.DataAccessImpl.lambda$null$21(DataAccessImpl.java:920) at rx.functions.Functions$3.call(Functions.java:88) at rx.internal.operators.OperatorZip$Zip.tick(OperatorZip.java:252) at rx.internal.operators.OperatorZip$Zip$InnerSubscriber.onNext(OperatorZip.java:323) at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:77) at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:302) at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:284) at rx.internal.operators.NotificationLite.accept(NotificationLite.java:135) at rx.subjects.SubjectSubscriptionManager$SubjectObserver.emitNext(SubjectSubscriptionManager.java:253) at rx.subjects.BehaviorSubject.onNext(BehaviorSubject.java:160) at rx.observers.SerializedObserver.onNext(SerializedObserver.java:91) at rx.subjects.SerializedSubject.onNext(SerializedSubject.java:67) at rx.internal.operators.OnSubscribeRedo$2$1.onError(OnSubscribeRedo.java:237) at rx.observable.ListenableFutureObservable$2$1.run(ListenableFutureObservable.java:81) at rx.observable.ListenableFutureObservable$1$1.call(ListenableFutureObservable.java:50) at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$1.call(EventLoopsScheduler.java:172) at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at rx.observable.ListenableFutureObservable$2$1.run(ListenableFutureObservable.java:78) ... 10 more Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:100) at com.datastax.driver.core.Responses$Error.asException(Responses.java:134) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:525) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1077) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1000) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1267) at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1078) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:491) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:430) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) I still think https://issues.jboss.org/browse/HWKMETRICS-797 is the root cause. OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed. [1]: https://access.redhat.com/support/policy/updates/openshift |
Created attachment 1447678 [details] metrics pods log Description of problem: Sometimes, hawkular-metrics pod would throw out error: WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write) at rx.exceptions.Exceptions.propagate(Exceptions.java:57) at org.hawkular.metrics.core.service.DataAccessImpl.lambda$null$21(DataAccessImpl.java:920) at rx.functions.Functions$3.call(Functions.java:88) at rx.internal.operators.OperatorZip$Zip.tick(OperatorZip.java:252) at rx.internal.operators.OperatorZip$Zip$InnerSubscriber.onNext(OperatorZip.java:323) at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:77) at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:302) at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:284) at rx.internal.operators.NotificationLite.accept(NotificationLite.java:135) at rx.subjects.SubjectSubscriptionManager$SubjectObserver.emitNext(SubjectSubscriptionManager.java:253) at rx.subjects.BehaviorSubject.onNext(BehaviorSubject.java:160) at rx.observers.SerializedObserver.onNext(SerializedObserver.java:91) at rx.subjects.SerializedSubject.onNext(SerializedSubject.java:67) at rx.internal.operators.OnSubscribeRedo$2$1.onError(OnSubscribeRedo.java:237) at rx.observable.ListenableFutureObservable$2$1.run(ListenableFutureObservable.java:81) at rx.observable.ListenableFutureObservable$1$1.call(ListenableFutureObservable.java:50) at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$1.call(EventLoopsScheduler.java:172) at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Version-Release number of selected component (if applicable): v3.10.0-0.58.0 How reproducible: Sometimes Steps to Reproduce: 1. Deploy metrics 3.10 and let it run for a few hours 2. 3. Actual results: WriteTimeoutException throws out:Cassandra timeout during write query at consistency LOCAL_ONE Expected results: Additional info: