Bug 1288469

Summary: The data aggregation job in JBoss ON stopped due to unreachable storage node
Product: [Other] RHQ Project Reporter: bkramer <bkramer>
Component: Core ServerAssignee: Nobody <nobody>
Status: NEW --- QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.12CC: hrupp
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1288455    

Description bkramer 2015-12-04 10:48:48 UTC
Description of problem:

In this environment, graphs would show data only for first 5-6 days. As soon as the time interval was set to longer time period - i.e. one month, three months... graphs would not contain any data (except again for last few days). The server.log files showed that DataCalcJob was not running and measurement data compression was not executed. 

Looking at the earlier server.log files, we noticed that at some point, RHQ Server started to execute measurement data compression but then the storage node became unreachable: 

**********************************************************
0:00:00,010 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Data Calc Job STARTING
10:00:00,010 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Measurement data compression starting at Thu Jul 02 10:00:00 CEST 2015
10:00:00,010 INFO  [org.rhq.server.metrics.aggregation.AggregationManager] (RHQScheduler_Worker-1) Starting metrics data aggregation
10:00:00,011 INFO  [org.rhq.server.metrics.aggregation.DataAggregator] (RHQScheduler_Worker-1) Starting raw data aggregation
10:00:15,828 INFO  [org.rhq.enterprise.server.util.concurrent.InventoryReportSerializer] (example.server.com/127.0.0.1:7080-9) tid=294; agent=agent2: releasing write lock after being locked for millis=10513
10:00:50,395 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read}
10:00:50,676 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Reset warmup period to 12 minutes after a timeout
10:00:50,744 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-8) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read}
10:00:50,468 INFO  [org.rhq.enterprise.server.util.concurrent.AvailabilityReportSerializer] (example.server.com/127.0.0.1:7080-28) tid=313; agent=lxaplint3: releasing write lock after being locked for millis=10746
10:00:50,676 WARN  [org.rhq.server.metrics.aggregation.DataAggregator] (AggregationTasks-1) There was an error aggregating data: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: example.server.com/127.0.0.1 (Timeout during read))
...
**********************************************************

After that, the storage node recovered as well as the RHQ server (without server restart) but DataCalcJob just died and "Measurement data compression starting at... " messages were not logged any more. 

How reproducible:
Sometimes.

Steps to Reproduce:
Not sure. I made a few attempts to reproduce the error but without success.

Actual results:
DataCalcJob stopped executing and measurement data compression job was not done any more. So, graphs would only show data for last few days and any attempt to see data for more then 6 days would show empty graph.

Expected results:
DataCalcJob failed to execute due to a timeout or load but once the server and storage node recovered and started to run properly again, the DataCalcJob is recovered too.

Additional info: