Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1585891

Summary: [3.10]Sometimes, hawkular-metrics pod would throw out error: WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: HawkularAssignee: Ruben Vargas Palma <rvargasp>
Status: CLOSED WONTFIX QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: medium    
Version: 3.11.0CC: agawand, aos-bugs, apurty, cruhm, erich, haowang, jkaur, jmalde, juzhao, ktadimar, pescorza, pkanthal, rkshirsa, rvargasp, sburke, sople
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1606771 1606783 1623025 (view as bug list) Environment:
Last Closed: 2021-11-09 14:14:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1606771, 1606783, 1623025    
Attachments:
Description Flags
metrics pods log none

Description Junqi Zhao 2018-06-05 02:27:55 UTC
Created attachment 1447678 [details]
metrics pods log

Description of problem:
Sometimes, hawkular-metrics pod would throw out error: WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE
java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
    at rx.exceptions.Exceptions.propagate(Exceptions.java:57)
    at org.hawkular.metrics.core.service.DataAccessImpl.lambda$null$21(DataAccessImpl.java:920)
    at rx.functions.Functions$3.call(Functions.java:88)
    at rx.internal.operators.OperatorZip$Zip.tick(OperatorZip.java:252)
    at rx.internal.operators.OperatorZip$Zip$InnerSubscriber.onNext(OperatorZip.java:323)
    at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:77)
    at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:302)
    at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:284)
    at rx.internal.operators.NotificationLite.accept(NotificationLite.java:135)
    at rx.subjects.SubjectSubscriptionManager$SubjectObserver.emitNext(SubjectSubscriptionManager.java:253)
    at rx.subjects.BehaviorSubject.onNext(BehaviorSubject.java:160)
    at rx.observers.SerializedObserver.onNext(SerializedObserver.java:91)
    at rx.subjects.SerializedSubject.onNext(SerializedSubject.java:67)
    at rx.internal.operators.OnSubscribeRedo$2$1.onError(OnSubscribeRedo.java:237)
    at rx.observable.ListenableFutureObservable$2$1.run(ListenableFutureObservable.java:81)
    at rx.observable.ListenableFutureObservable$1$1.call(ListenableFutureObservable.java:50)
    at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$1.call(EventLoopsScheduler.java:172)
    at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Version-Release number of selected component (if applicable):
v3.10.0-0.58.0

How reproducible:
Sometimes

Steps to Reproduce:
1. Deploy metrics 3.10 and let it run for a few hours
2.
3.

Actual results:
WriteTimeoutException throws out:Cassandra timeout during write query at consistency LOCAL_ONE 

Expected results:


Additional info:

Comment 1 Venkata Tadimarri 2018-07-18 03:55:15 UTC
The same issue is observed on a 3.6 cluster as well.


aused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
        at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:59)
        at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
        at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:277)
        at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:257)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
        ... 29 more

----------------

The customer was also hitting the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1561823  but the error does not appear now after performing the workaround mentioned in the bug. 

Please advise.

Comment 2 Venkata Tadimarri 2018-07-18 04:00:21 UTC
(In reply to Venkata Tadimarri from comment #1)
> The same issue is observed on a 3.6 cluster as well.
> 
> 
> aused by: com.datastax.driver.core.exceptions.WriteTimeoutException:
> Cassandra timeout during write query at consistency LOCAL_ONE (1 replica
> were required but only 0 acknowledged the write)
>         at
> com.datastax.driver.core.Responses$Error$1.decode(Responses.java:59)
>         at
> com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
>         at
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:277)
>         at
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:257)
>         at
> io.netty.handler.codec.MessageToMessageDecoder.
> channelRead(MessageToMessageDecoder.java:88)
>         ... 29 more
> 
> ----------------
> 
> The customer was also hitting the issue in
> https://bugzilla.redhat.com/show_bug.cgi?id=1561823  but the error does not
> appear now after performing the workaround mentioned in the bug. 
> 
> Please advise.



Update:

After adding new PV to cassandra and starting the containers (i.e with no previous data), the pods startup OK without any of the errors.

Comment 3 John Sanda 2018-07-18 13:34:06 UTC
(In reply to Venkata Tadimarri from comment #1)
> The same issue is observed on a 3.6 cluster as well.
> 
> 
> aused by: com.datastax.driver.core.exceptions.WriteTimeoutException:
> Cassandra timeout during write query at consistency LOCAL_ONE (1 replica
> were required but only 0 acknowledged the write)
>         at
> com.datastax.driver.core.Responses$Error$1.decode(Responses.java:59)
>         at
> com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
>         at
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:277)
>         at
> com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:257)
>         at
> io.netty.handler.codec.MessageToMessageDecoder.
> channelRead(MessageToMessageDecoder.java:88)
>         ... 29 more
> 
> ----------------
> 
> The customer was also hitting the issue in
> https://bugzilla.redhat.com/show_bug.cgi?id=1561823  but the error does not
> appear now after performing the workaround mentioned in the bug. 
> 
> Please advise.

Please open a separate ticket for this. There are quite a few changes between metrics in 3.6 and metrics in 3.10.

Comment 5 John Sanda 2018-07-18 21:42:52 UTC
I see similar behavior in free-int, and I doubt it is simply coincidence.

Can you double the memory for the cassandra pods?

How much cpu is allocated to cassandra pods?

Can you also provide the output of `oc get pods --all-namespaces | wc -l`?

Comment 7 Junqi Zhao 2018-07-19 05:40:58 UTC
(In reply to John Sanda from comment #5)
> I see similar behavior in free-int, and I doubt it is simply coincidence.

> Can you double the memory for the cassandra pods?
Yes, changed to openshift_metrics_cassandra_limits_memory=4G 
> How much cpu is allocated to cassandra pods?
Used the default value, not set cpu limit for cassandra pods

> Can you also provide the output of `oc get pods --all-namespaces | wc -l`?

# oc get pods --all-namespaces | wc -l
28

Comment 8 John Sanda 2018-07-20 02:36:56 UTC
There is another job that also runs every two hours like the compression job, but is staggered to run one hour before, which creates the "temp" tables. When a new table is created, hawkular-metrics receives an event notification from the Cassandra driver about the schema change and then proceeds to create the prepared statements for the new tables. There is a race condition which is causing that to fail. Here it is in the provided logs:


2018-06-04 15:08:48,753 ERROR [stderr] (RxIoScheduler-9) Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: unconfigured table data_temp_2018060512
^[[0m^[[31m2018-06-04 15:08:48,753 ERROR [stderr] (RxIoScheduler-9)     at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:50)
^[[0m^[[31m2018-06-04 15:08:48,753 ERROR [stderr] (RxIoScheduler-9)     at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
^[[0m^[[31m2018-06-04 15:08:48,754 ERROR [stderr] (RxIoScheduler-9)     at com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:104)
^[[0m^[[31m2018-06-04 15:08:48,754 ERROR [stderr] (RxIoScheduler-9)     at org.hawkular.metrics.core.service.DataAccessImpl.prepareTempStatements(DataAccessImpl.java:392)
^[[0m^[[31m2018-06-04 15:08:48,754 ERROR [stderr] (RxIoScheduler-9)     at org.hawkular.metrics.core.service.DataAccessImpl$TemporaryTableStatementCreator.lambda$onTableAdded$0(DataAccessImpl.java:1373)
^[[0m^[[31m2018-06-04 15:08:48,754 ERROR [stderr] (RxIoScheduler-9)     at rx.internal.operators.OnSubscribeFromCallable.call(OnSubscribeFromCallable.java:48)



I think that this may be causing the write timeouts. The timeouts are occurring when inserting data points. Let's say hawkular-metrics sends the query to cassandra-1 but cassandra-2 is the owner. cassandra-1 basically forwards the request to cassandra-2 and waits for a response. cassandra-2 will parse the queries, an expensive step which is only done upfront one time when creating a prepared statement. cassandra-2 is having to occasionally do this parsing because of the above error. The extra processing time causes cassandra-1 to timeout. 

I have not had a chance to try and reproduce this theory in a more isolated environment, but I do think that there is a good chance this is the cause.

Can you test again with only a single cassandra pod? I am curious to see if we will get the same behavior.

Comment 9 John Sanda 2018-07-20 21:16:03 UTC
I was able to easily reproduce the exception in comment 8 with an automated test running locally on my dev machine. I reproduced the error with a three node Cassandra cluster. I did not get any errors when running just a single node.

Comment 10 Junqi Zhao 2018-08-14 07:17:59 UTC
(In reply to John Sanda from comment #8)
> Can you test again with only a single cassandra pod? I am curious to see if
> we will get the same behavior.

This error does not happen every time, and tested with a single cassandra pod, did not find this error, but this does not mean the error won't happen with single cassandra pod.

Comment 11 Junqi Zhao 2018-08-15 07:58:54 UTC
Tested on a public env with a single cassandra pod today(many people use the env to test), issue is reproduced, image version is v3.10.28-1.

Maybe if there are a lot of read/write, this issue would be happen.
# oc get pod -n openshift-infra
NAME                            READY     STATUS      RESTARTS   AGE
hawkular-cassandra-1-czw5r      1/1       Running     0          1h
hawkular-metrics-fh7pm          1/1       Running     0          1h
hawkular-metrics-schema-jhmzv   0/1       Completed   0          1h
heapster-8fnfl                  1/1       Running     0          1h


2018-08-15 07:40:14,645 ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] (RxComputationScheduler-1) HAWKMETRICS200010: Failed to process request: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
	at rx.exceptions.Exceptions.propagate(Exceptions.java:57)
	at org.hawkular.metrics.core.service.DataAccessImpl.lambda$null$21(DataAccessImpl.java:920)
	at rx.functions.Functions$3.call(Functions.java:88)
	at rx.internal.operators.OperatorZip$Zip.tick(OperatorZip.java:252)
	at rx.internal.operators.OperatorZip$Zip$InnerSubscriber.onNext(OperatorZip.java:323)
	at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:77)
	at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:302)
	at rx.internal.operators.OnSubscribeRedo$3$1.onNext(OnSubscribeRedo.java:284)
	at rx.internal.operators.NotificationLite.accept(NotificationLite.java:135)
	at rx.subjects.SubjectSubscriptionManager$SubjectObserver.emitNext(SubjectSubscriptionManager.java:253)
	at rx.subjects.BehaviorSubject.onNext(BehaviorSubject.java:160)
	at rx.observers.SerializedObserver.onNext(SerializedObserver.java:91)
	at rx.subjects.SerializedSubject.onNext(SerializedSubject.java:67)
	at rx.internal.operators.OnSubscribeRedo$2$1.onError(OnSubscribeRedo.java:237)
	at rx.observable.ListenableFutureObservable$2$1.run(ListenableFutureObservable.java:81)
	at rx.observable.ListenableFutureObservable$1$1.call(ListenableFutureObservable.java:50)
	at rx.internal.schedulers.EventLoopsScheduler$EventLoopWorker$1.call(EventLoopsScheduler.java:172)
	at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
	at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
	at rx.observable.ListenableFutureObservable$2$1.run(ListenableFutureObservable.java:78)
	... 10 more
Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_ONE (1 replica were required but only 0 acknowledged the write)
	at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:100)
	at com.datastax.driver.core.Responses$Error.asException(Responses.java:134)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:525)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1077)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1000)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:312)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1267)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1078)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:491)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:430)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

Comment 12 John Sanda 2018-08-22 13:39:25 UTC
I still think https://issues.jboss.org/browse/HWKMETRICS-797 is the root cause.

Comment 32 Stephen Cuppett 2019-11-20 18:58:10 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift