Bug 1288469 - The data aggregation job in JBoss ON stopped due to unreachable storage node
Summary: The data aggregation job in JBoss ON stopped due to unreachable storage node
Keywords:
Status: NEW
Alias: None
Product: RHQ Project
Classification: Other
Component: Core Server
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1288455
TreeView+ depends on / blocked
 
Reported: 2015-12-04 10:48 UTC by bkramer
Modified: 2022-03-31 04:28 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description bkramer 2015-12-04 10:48:48 UTC
Description of problem:

In this environment, graphs would show data only for first 5-6 days. As soon as the time interval was set to longer time period - i.e. one month, three months... graphs would not contain any data (except again for last few days). The server.log files showed that DataCalcJob was not running and measurement data compression was not executed. 

Looking at the earlier server.log files, we noticed that at some point, RHQ Server started to execute measurement data compression but then the storage node became unreachable: 

**********************************************************
0:00:00,010 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Data Calc Job STARTING
10:00:00,010 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Measurement data compression starting at Thu Jul 02 10:00:00 CEST 2015
10:00:00,010 INFO  [org.rhq.server.metrics.aggregation.AggregationManager] (RHQScheduler_Worker-1) Starting metrics data aggregation
10:00:00,011 INFO  [org.rhq.server.metrics.aggregation.DataAggregator] (RHQScheduler_Worker-1) Starting raw data aggregation
10:00:15,828 INFO  [org.rhq.enterprise.server.util.concurrent.InventoryReportSerializer] (example.server.com/127.0.0.1:7080-9) tid=294; agent=agent2: releasing write lock after being locked for millis=10513
10:00:50,395 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read}
10:00:50,676 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Reset warmup period to 12 minutes after a timeout
10:00:50,744 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-8) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read}
10:00:50,468 INFO  [org.rhq.enterprise.server.util.concurrent.AvailabilityReportSerializer] (example.server.com/127.0.0.1:7080-28) tid=313; agent=lxaplint3: releasing write lock after being locked for millis=10746
10:00:50,676 WARN  [org.rhq.server.metrics.aggregation.DataAggregator] (AggregationTasks-1) There was an error aggregating data: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: example.server.com/127.0.0.1 (Timeout during read))
...
**********************************************************

After that, the storage node recovered as well as the RHQ server (without server restart) but DataCalcJob just died and "Measurement data compression starting at... " messages were not logged any more. 

How reproducible:
Sometimes.

Steps to Reproduce:
Not sure. I made a few attempts to reproduce the error but without success.

Actual results:
DataCalcJob stopped executing and measurement data compression job was not done any more. So, graphs would only show data for last few days and any attempt to see data for more then 6 days would show empty graph.

Expected results:
DataCalcJob failed to execute due to a timeout or load but once the server and storage node recovered and started to run properly again, the DataCalcJob is recovered too.

Additional info:


Note You need to log in before you can comment on or make changes to this bug.