1120775 – Handle inconsistent reads during data aggregation

Bug 1120775 - Handle inconsistent reads during data aggregation

Summary: Handle inconsistent reads during data aggregation

Keywords:
Status:	NEW
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core Server, Storage Node
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHQ 4.13
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1133605 1120441 1120442
TreeView+	depends on / blocked

Reported:	2014-07-17 16:00 UTC by John Sanda
Modified:	2022-03-31 04:28 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Description John Sanda 2014-07-17 16:00:23 UTC

Description of problem:
We use consistency level (CL) 1 for all reads and writes of metric data, which means it is entirely possible to have inconsistent reads. In general this is not a problem. There are places though in the aggregation code where we just assume data will be returned from queries when it might not be. 

Both bug 1120442 and bug 1120441 inconsistent reads. They each involves queries that we except to return data that does not get returned. This can happen when the request is performed by a replica that has not yet received or applied the mutation that contains the requested data.

For debugging purposes we should log these situations when they occur. We do quorum reads to ensure consistency, but I think we can avoid that overhead. Alternatively we can reschedule the aggregation for the data in question to allow time for the mutations to be applied across replicas.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 John Sanda 2014-07-17 18:14:14 UTC

A slight correction from the description. I meant to say that we *can* do quorum reads to ensure consistency; however, that is not even always the case. With our current replication strategy, described at https://docs.jboss.org/author/display/RHQ/Data+Replication+and+Consistency, quorum reads would ensure consistency only for a two or three node cluster. For four or more nodes, we would need to use a CL of 3 or all.

Comment 2 Elias Ross 2014-07-17 23:15:38 UTC

So yes, I have a three node cluster and the behavior can be inconsistent across hosts.

I've also seen this:

23:00:36,561 WARN  [org.rhq.server.metrics.StorageSession] (RHQScheduler_Worker-1) Encountered NoHostAvailableException due to following error(s): {/17.176.20
8.117=Timeout during read, /17.176.208.118=Timeout during read, /17.176.208.119=Timeout during read}
23:00:36,562 INFO  [org.rhq.server.metrics.StorageSession] (RHQScheduler_Worker-1) Changing request throughput from 90000.0 request/sec to 90000.0 requests/se
c
23:00:36,562 WARN  [org.rhq.server.metrics.aggregation.PastDataAggregator] (RHQScheduler_Worker-1) There was an error querying the cache index: org.rhq.server
.metrics.aggregation.CacheIndexQueryException: Failed to load cache index entries prior to current time slice 2014-07-17T22:00:00.000Z
        at org.rhq.server.metrics.aggregation.IndexEntriesLoader.loadPastIndexEntries(IndexEntriesLoader.java:72) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.PastDataAggregator.getIndexEntries(PastDataAggregator.java:74) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.BaseAggregator.execute(BaseAggregator.java:167) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.AggregationManager.run(AggregationManager.java:101) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.MetricsServer.calculateAggregates(MetricsServer.java:619) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.compressMeasurementData(DataPurgeJob.java:114) [rhq-server.jar:4.12.0]
        at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.executeJobCode(DataPurgeJob.java:92) [rhq-server.jar:4.12.0]
        at org.rhq.enterprise.server.scheduler.jobs.AbstractStatefulJob.execute(AbstractStatefulJob.java:48) [rhq-server.jar:4.12.0]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) [quartz-1.6.5.jar:1.6.5]
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /17.176.208.117 (Timeout during read), /17
.176.208.118 (Timeout during read), /17.176.208.119 (Timeout during read))
        at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:64) [cassandra-driver-core-1.0.5.jar:]
        at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:269) [cassandra-driver-core-1.0.5.jar:]
        at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:183) [cassandra-driver-core-1.0.5.jar:]
        at org.rhq.server.metrics.StorageResultSetFuture.get(StorageResultSetFuture.java:57) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.IndexEntriesLoader.addResultSet(IndexEntriesLoader.java:116) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.IndexEntriesLoader.loadPastIndexEntries(IndexEntriesLoader.java:68) [rhq-server-metrics-4.12.0.jar:4.12.0]
        ... 9 more
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /17.176.208.117 (Timeout during read), /17
.176.208.118 (Timeout during read), /17.176.208.119 (Timeout during read))
        at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:106) [cassandra-driver-core-1.0.5.jar:]
        at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:177) [cassandra-driver-core-1.0.5.jar:]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_40]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_40]
        at java.lang.Thread.run(Thread.java:724) [rt.jar:1.7.0_40]

Note You need to log in before you can comment on or make changes to this bug.