Description of problem: In this environment, graphs would show data only for first 5-6 days. As soon as the time interval was set to longer time period - i.e. one month, three months... graphs would not contain any data (except again for last few days). The server.log files showed that DataCalcJob was not running and measurement data compression was not executed. Looking at the earlier server.log files, we noticed that at some point, JBoss ON Server started to execute measurement data compression but then the storage node became unreachable: ********************************************************** 0:00:00,010 INFO [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Data Calc Job STARTING 10:00:00,010 INFO [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Measurement data compression starting at Thu Jul 02 10:00:00 CEST 2015 10:00:00,010 INFO [org.rhq.server.metrics.aggregation.AggregationManager] (RHQScheduler_Worker-1) Starting metrics data aggregation 10:00:00,011 INFO [org.rhq.server.metrics.aggregation.DataAggregator] (RHQScheduler_Worker-1) Starting raw data aggregation 10:00:15,828 INFO [org.rhq.enterprise.server.util.concurrent.InventoryReportSerializer] (example.server.com/127.0.0.1:7080-9) tid=294; agent=agent2: releasing write lock after being locked for millis=10513 10:00:50,395 WARN [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read} 10:00:50,676 WARN [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Reset warmup period to 12 minutes after a timeout 10:00:50,744 WARN [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-8) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read} 10:00:50,468 INFO [org.rhq.enterprise.server.util.concurrent.AvailabilityReportSerializer] (example.server.com/127.0.0.1:7080-28) tid=313; agent=lxaplint3: releasing write lock after being locked for millis=10746 10:00:50,676 WARN [org.rhq.server.metrics.aggregation.DataAggregator] (AggregationTasks-1) There was an error aggregating data: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: example.server.com/127.0.0.1 (Timeout during read)) ... ********************************************************** After that, the storage node recovered as well as the JBoss ON server (without server restart) but DataCalcJob just died and "Measurement data compression starting at... " messages were not logged any more. How reproducible: Sometimes. Steps to Reproduce: Not sure. I made a few attempts to reproduce the error but without success. Actual results: DataCalcJob stopped executing and measurement data compression job was not done any more. So, graphs would only show data for last few days and any attempt to see data for more then 6 days would show empty graph. Expected results: DataCalcJob failed to execute due to a timeout or load but once the server and storage node recovered and started to run properly again, the DataCalcJob is recovered too. Additional info:
This can happen because DataCalcJob is guaranteed to run as a single instance only, so if the previous DataCalcJob somehow got stuck then no new instance of it would be run. But without good reproduce it's difficult to say if unreachable storage node had anything to do with the error or if it was just bad luck. A thread dump showing where the thread was stuck would have been helpful.
Perhaps to mitigate such an issue a timeout would need to be imposed? In this case, the data calc job did not get executed for several weeks/months. If a previous job was still executing it would have to have been in a bad state. The description of this bug may be a bit misleading or incomplete. To summarize the issue: The storage node became unavailable during a data calc job. The data calc job never ran again after that.
commit e81051e97a160aaf1513f4d53aaa3ae083586362 Merge: 6c1b693 d3f46c1 Author: Michael Burman <yak> Date: Tue May 17 23:49:47 2016 +0300 Merge pull request #246 from josejulio/bugs/1288455 Bug 1288455 - The data aggregation job in JBoss ON stopped due to unr… commit d3f46c11f591e19d6b98ff972b7b562cef22f213 Author: Josejulio Martínez <jmartine> Date: Thu Apr 28 11:44:37 2016 -0500 Bug 1288455 - Check if the FutureCallback is not scheduled to release the permit on exception. commit 992e403d62848c5a8a3a8590709ad9e423c89bf5 Author: Josejulio Martínez <jmartine> Date: Tue Apr 26 11:38:46 2016 -0500 Bug 1288455 - The data aggregation job in JBoss ON stopped due to unreachable storage node - Moved permits acquisition inside AggregationTask. - Release permits upon exception on AggregationTask. - Added a timeout to DataAggregation.
Moving to ON_QA as available to test with the following build: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=502442 Note: jon-server-patch-3.3.0.GA.zip maps to JON 3.3.6(jon-server-3.3.0.GA-update-06.zip)
From email (Larry O'Leary) Yes. This one is a hard one to reproduce and test. The storage node implementation used in JBoss ON really prevents us from testing time based metric data. This means that you will not be able to speed up the test. In theory, altering the system date/time would work but you would need essentially accelerate the clock. This is because you must have metric data be collected at the expected "sped up" interval while slowing it down to allow the normal maintenance jobs to run. In addition to the complexity and lack of adequate testing infrastructure for all things related to storage nodes, the other issue with this bug is that we do not have a reproducer. To my knowledge, we do not understand what could have cause the job to be incorrectly terminated. The fix is to ensure that the job, if terminated, gets rescheduled to prevent the issue/symptom raised by the bug. It is therefore my opinion that we only confirm that this fix does not break anything -- which it shouldn't but development can perhaps take another look and comment -- and document this is untestable but verified. Thoughts?
That's right, we couldn't find why it happened. I only added means for it to timeout in case it got stuck. There isn't any know reproducer to this issue. Testing that this is still working as expected would be the best.
Created attachment 1181762 [details] Last server.log before hanging up while ON_QA
To: Michael Burman Last server's log was of 14.Jul.2016 I did nothing manually on this VM on this day. ulimit was default after installation. Relation to this BZ: difficulties to reproduce mentioned steps (indeed, to use GUI) due to that issues.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-1519.html