Bug 1288469 - The data aggregation job in JBoss ON stopped due to unreachable storage node
The data aggregation job in JBoss ON stopped due to unreachable storage node
Status: NEW
Product: RHQ Project
Classification: Other
Component: Core Server (Show other bugs)
4.12
Unspecified Unspecified
unspecified Severity unspecified (vote)
: ---
: ---
Assigned To: RHQ Project Maintainer
Mike Foley
:
Depends On:
Blocks: 1288455
  Show dependency treegraph
 
Reported: 2015-12-04 05:48 EST by bkramer
Modified: 2015-12-04 05:48 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description bkramer 2015-12-04 05:48:48 EST
Description of problem:

In this environment, graphs would show data only for first 5-6 days. As soon as the time interval was set to longer time period - i.e. one month, three months... graphs would not contain any data (except again for last few days). The server.log files showed that DataCalcJob was not running and measurement data compression was not executed. 

Looking at the earlier server.log files, we noticed that at some point, RHQ Server started to execute measurement data compression but then the storage node became unreachable: 

**********************************************************
0:00:00,010 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Data Calc Job STARTING
10:00:00,010 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Measurement data compression starting at Thu Jul 02 10:00:00 CEST 2015
10:00:00,010 INFO  [org.rhq.server.metrics.aggregation.AggregationManager] (RHQScheduler_Worker-1) Starting metrics data aggregation
10:00:00,011 INFO  [org.rhq.server.metrics.aggregation.DataAggregator] (RHQScheduler_Worker-1) Starting raw data aggregation
10:00:15,828 INFO  [org.rhq.enterprise.server.util.concurrent.InventoryReportSerializer] (example.server.com/127.0.0.1:7080-9) tid=294; agent=agent2: releasing write lock after being locked for millis=10513
10:00:50,395 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read}
10:00:50,676 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Reset warmup period to 12 minutes after a timeout
10:00:50,744 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-8) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read}
10:00:50,468 INFO  [org.rhq.enterprise.server.util.concurrent.AvailabilityReportSerializer] (example.server.com/127.0.0.1:7080-28) tid=313; agent=lxaplint3: releasing write lock after being locked for millis=10746
10:00:50,676 WARN  [org.rhq.server.metrics.aggregation.DataAggregator] (AggregationTasks-1) There was an error aggregating data: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: example.server.com/127.0.0.1 (Timeout during read))
...
**********************************************************

After that, the storage node recovered as well as the RHQ server (without server restart) but DataCalcJob just died and "Measurement data compression starting at... " messages were not logged any more. 

How reproducible:
Sometimes.

Steps to Reproduce:
Not sure. I made a few attempts to reproduce the error but without success.

Actual results:
DataCalcJob stopped executing and measurement data compression job was not done any more. So, graphs would only show data for last few days and any attempt to see data for more then 6 days would show empty graph.

Expected results:
DataCalcJob failed to execute due to a timeout or load but once the server and storage node recovered and started to run properly again, the DataCalcJob is recovered too.

Additional info:

Note You need to log in before you can comment on or make changes to this bug.