Bug 1120775

Summary:	Handle inconsistent reads during data aggregation
Product:	[Other] RHQ Project	Reporter:	John Sanda <jsanda>
Component:	Core Server, Storage Node	Assignee:	Nobody <nobody>
Status:	NEW ---	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.12	CC:	genman, hrupp
Target Milestone:	---
Target Release:	RHQ 4.13
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1133605, 1120441, 1120442

Description John Sanda 2014-07-17 16:00:23 UTC

Description of problem:
We use consistency level (CL) 1 for all reads and writes of metric data, which means it is entirely possible to have inconsistent reads. In general this is not a problem. There are places though in the aggregation code where we just assume data will be returned from queries when it might not be. 

Both bug 1120442 and bug 1120441 inconsistent reads. They each involves queries that we except to return data that does not get returned. This can happen when the request is performed by a replica that has not yet received or applied the mutation that contains the requested data.

For debugging purposes we should log these situations when they occur. We do quorum reads to ensure consistency, but I think we can avoid that overhead. Alternatively we can reschedule the aggregation for the data in question to allow time for the mutations to be applied across replicas.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 John Sanda 2014-07-17 18:14:14 UTC

A slight correction from the description. I meant to say that we *can* do quorum reads to ensure consistency; however, that is not even always the case. With our current replication strategy, described at https://docs.jboss.org/author/display/RHQ/Data+Replication+and+Consistency, quorum reads would ensure consistency only for a two or three node cluster. For four or more nodes, we would need to use a CL of 3 or all.

Comment 2 Elias Ross 2014-07-17 23:15:38 UTC

So yes, I have a three node cluster and the behavior can be inconsistent across hosts.

I've also seen this:

23:00:36,561 WARN  [org.rhq.server.metrics.StorageSession] (RHQScheduler_Worker-1) Encountered NoHostAvailableException due to following error(s): {/17.176.20
8.117=Timeout during read, /17.176.208.118=Timeout during read, /17.176.208.119=Timeout during read}
23:00:36,562 INFO  [org.rhq.server.metrics.StorageSession] (RHQScheduler_Worker-1) Changing request throughput from 90000.0 request/sec to 90000.0 requests/se
c
23:00:36,562 WARN  [org.rhq.server.metrics.aggregation.PastDataAggregator] (RHQScheduler_Worker-1) There was an error querying the cache index: org.rhq.server
.metrics.aggregation.CacheIndexQueryException: Failed to load cache index entries prior to current time slice 2014-07-17T22:00:00.000Z
        at org.rhq.server.metrics.aggregation.IndexEntriesLoader.loadPastIndexEntries(IndexEntriesLoader.java:72) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.PastDataAggregator.getIndexEntries(PastDataAggregator.java:74) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.BaseAggregator.execute(BaseAggregator.java:167) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.AggregationManager.run(AggregationManager.java:101) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.MetricsServer.calculateAggregates(MetricsServer.java:619) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.compressMeasurementData(DataPurgeJob.java:114) [rhq-server.jar:4.12.0]
        at org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob.executeJobCode(DataPurgeJob.java:92) [rhq-server.jar:4.12.0]
        at org.rhq.enterprise.server.scheduler.jobs.AbstractStatefulJob.execute(AbstractStatefulJob.java:48) [rhq-server.jar:4.12.0]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525) [quartz-1.6.5.jar:1.6.5]
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /17.176.208.117 (Timeout during read), /17
.176.208.118 (Timeout during read), /17.176.208.119 (Timeout during read))
        at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:64) [cassandra-driver-core-1.0.5.jar:]
        at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:269) [cassandra-driver-core-1.0.5.jar:]
        at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:183) [cassandra-driver-core-1.0.5.jar:]
        at org.rhq.server.metrics.StorageResultSetFuture.get(StorageResultSetFuture.java:57) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.IndexEntriesLoader.addResultSet(IndexEntriesLoader.java:116) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.IndexEntriesLoader.loadPastIndexEntries(IndexEntriesLoader.java:68) [rhq-server-metrics-4.12.0.jar:4.12.0]
        ... 9 more
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /17.176.208.117 (Timeout during read), /17
.176.208.118 (Timeout during read), /17.176.208.119 (Timeout during read))
        at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:106) [cassandra-driver-core-1.0.5.jar:]
        at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:177) [cassandra-driver-core-1.0.5.jar:]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_40]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_40]
        at java.lang.Thread.run(Thread.java:724) [rt.jar:1.7.0_40]