Bug 1127921

Summary: Metrics aggregation does not retry after timeout
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: Storage NodeAssignee: Nobody <nobody>
Status: NEW --- QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.12   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1133605    

Description Elias Ross 2014-08-07 20:32:54 UTC
Description of problem:

If there is a timeout doing the query to Cassandra for past data, then what happens is the aggregation completes prematurely, although it seems to at least complete 500 or so schedules.

20:01:00,152 WARN  [org.rhq.server.metrics.aggregation.PastDataAggregator] (RHQScheduler_Worker-5) There was an error querying the cache index: org.rhq.server.metrics.aggregation
.CacheIndexQueryException: Failed to load cache index entries prior to current time slice 2014-08-07T19:00:00.000Z
        at org.rhq.server.metrics.aggregation.IndexEntriesLoader.loadPastIndexEntries(IndexEntriesLoader.java:72) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.PastDataAggregator.getIndexEntries(PastDataAggregator.java:75) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.BaseAggregator.execute(BaseAggregator.java:168) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.AggregationManager.run(AggregationManager.java:107) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.MetricsServer.calculateAggregates(MetricsServer.java:641) [rhq-server-metrics-4.12.0.jar:4.12.0]
Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responde
d)
        at com.datastax.driver.core.exceptions.ReadTimeoutException.copy(ReadTimeoutException.java:69) [cassandra-driver-core-1.0.5.jar:]
        at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:269) [cassandra-driver-core-1.0.5.jar:]
        at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:183) [cassandra-driver-core-1.0.5.jar:]
        at org.rhq.server.metrics.StorageResultSetFuture.get(StorageResultSetFuture.java:57) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.IndexEntriesLoader.addResultSet(IndexEntriesLoader.java:116) [rhq-server-metrics-4.12.0.jar:4.12.0]
        at org.rhq.server.metrics.aggregation.IndexEntriesLoader.loadPastIndexEntries(IndexEntriesLoader.java:68) [rhq-server-metrics-4.12.0.jar:4.12.0]
        ... 9 more
Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
...
20:01:06,571 INFO  [org.rhq.server.metrics.aggregation.AggregationManager] (RHQScheduler_Worker-5) Finished aggregation of {"raw schedules": 500, "1 hour schedules": 0, "6 hour schedules": 0} in 66542 ms
20:01:06,571 INFO  [org.rhq.server.metrics.MetricsServer] (RHQScheduler_Worker-5) Finished metrics aggregation in 66543 ms

Expected behavior: It probably makes sense to continue the rest of aggregation at least. It might make sense to simply retry the query a few times, or possibly the results are simply too large.

Version-Release number of selected component (if applicable): 4.12