Bug 1288455 - The data aggregation job in JBoss ON stopped due to unreachable storage node
Summary: The data aggregation job in JBoss ON stopped due to unreachable storage node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Core Server
Version: JON 3.3.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ER01
: JON 3.3.6
Assignee: Josejulio Martínez
QA Contact: Filip Brychta
URL:
Whiteboard:
Depends On: 1288469
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-04 09:54 UTC by bkramer
Modified: 2020-05-14 15:04 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2016-07-27 15:31:38 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Last server.log before hanging up while ON_QA (14.21 MB, text/plain)
2016-07-19 18:31 UTC, vsorokin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2068303 0 None None None Never
Red Hat Product Errata RHSA-2016:1519 0 normal SHIPPED_LIVE Critical: Red Hat JBoss Operations Network 3.3.6 update 2016-08-26 00:44:36 UTC

Description bkramer 2015-12-04 09:54:54 UTC
Description of problem:

In this environment, graphs would show data only for first 5-6 days. As soon as the time interval was set to longer time period - i.e. one month, three months... graphs would not contain any data (except again for last few days). The server.log files showed that DataCalcJob was not running and measurement data compression was not executed. 

Looking at the earlier server.log files, we noticed that at some point, JBoss ON Server started to execute measurement data compression but then the storage node became unreachable: 

**********************************************************
0:00:00,010 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Data Calc Job STARTING
10:00:00,010 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataCalcJob] (RHQScheduler_Worker-1) Measurement data compression starting at Thu Jul 02 10:00:00 CEST 2015
10:00:00,010 INFO  [org.rhq.server.metrics.aggregation.AggregationManager] (RHQScheduler_Worker-1) Starting metrics data aggregation
10:00:00,011 INFO  [org.rhq.server.metrics.aggregation.DataAggregator] (RHQScheduler_Worker-1) Starting raw data aggregation
10:00:15,828 INFO  [org.rhq.enterprise.server.util.concurrent.InventoryReportSerializer] (example.server.com/127.0.0.1:7080-9) tid=294; agent=agent2: releasing write lock after being locked for millis=10513
10:00:50,395 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read}
10:00:50,676 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-7) Reset warmup period to 12 minutes after a timeout
10:00:50,744 WARN  [org.rhq.server.metrics.StorageSession] (Cassandra Java Driver worker-8) Encountered NoHostAvailableException due to following error(s): {example.server.com/127.0.0.1=Timeout during read}
10:00:50,468 INFO  [org.rhq.enterprise.server.util.concurrent.AvailabilityReportSerializer] (example.server.com/127.0.0.1:7080-28) tid=313; agent=lxaplint3: releasing write lock after being locked for millis=10746
10:00:50,676 WARN  [org.rhq.server.metrics.aggregation.DataAggregator] (AggregationTasks-1) There was an error aggregating data: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: example.server.com/127.0.0.1 (Timeout during read))
...
**********************************************************

After that, the storage node recovered as well as the JBoss ON server (without server restart) but DataCalcJob just died and "Measurement data compression starting at... " messages were not logged any more. 

How reproducible:
Sometimes.

Steps to Reproduce:
Not sure. I made a few attempts to reproduce the error but without success.

Actual results:
DataCalcJob stopped executing and measurement data compression job was not done any more. So, graphs would only show data for last few days and any attempt to see data for more then 6 days would show empty graph.

Expected results:
DataCalcJob failed to execute due to a timeout or load but once the server and storage node recovered and started to run properly again, the DataCalcJob is recovered too.

Additional info:

Comment 1 Michael Burman 2016-01-26 12:59:49 UTC
This can happen because DataCalcJob is guaranteed to run as a single instance only, so if the previous DataCalcJob somehow got stuck then no new instance of it would be run. But without good reproduce it's difficult to say if unreachable storage node had anything to do with the error or if it was just bad luck. A thread dump showing where the thread was stuck would have been helpful.

Comment 2 Larry O'Leary 2016-02-04 20:11:34 UTC
Perhaps to mitigate such an issue a timeout would need to be imposed?

In this case, the data calc job did not get executed for several weeks/months. If a previous job was still executing it would have to have been in a bad state.

The description of this bug may be a bit misleading or incomplete. To summarize the issue:

The storage node became unavailable during a data calc job. The data calc job never ran again after that.

Comment 3 Josejulio Martínez 2016-05-17 22:22:33 UTC
commit e81051e97a160aaf1513f4d53aaa3ae083586362
Merge: 6c1b693 d3f46c1
Author: Michael Burman <yak>
Date:   Tue May 17 23:49:47 2016 +0300

    Merge pull request #246 from josejulio/bugs/1288455

    Bug 1288455 - The data aggregation job in JBoss ON stopped due to unr…

commit d3f46c11f591e19d6b98ff972b7b562cef22f213
Author: Josejulio Martínez <jmartine>
Date:   Thu Apr 28 11:44:37 2016 -0500

    Bug 1288455 - Check if the FutureCallback is not scheduled to release the permit on exception.

commit 992e403d62848c5a8a3a8590709ad9e423c89bf5
Author: Josejulio Martínez <jmartine>
Date:   Tue Apr 26 11:38:46 2016 -0500

    Bug 1288455 - The data aggregation job in JBoss ON stopped due to unreachable storage node

     - Moved permits acquisition inside AggregationTask.
     - Release permits upon exception on AggregationTask.
     - Added a timeout to DataAggregation.

Comment 5 Simeon Pinder 2016-07-07 08:22:41 UTC
Moving to ON_QA as available to test with the following build:
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=502442

Note: 	jon-server-patch-3.3.0.GA.zip maps to JON 3.3.6(jon-server-3.3.0.GA-update-06.zip)

Comment 6 vsorokin 2016-07-13 14:48:20 UTC
From email (Larry O'Leary)

Yes. This one is a hard one to reproduce and test.

The storage node implementation used in JBoss ON really prevents us from testing time based metric data. This means that you will not be able to speed up the test.

In theory, altering the system date/time would work but you would need essentially accelerate the clock. This is because you must have metric data be collected at the expected "sped up" interval while slowing it down to allow the normal maintenance jobs to run. 

In addition to the complexity and lack of adequate testing infrastructure for all things related to storage nodes, the other issue with this bug is that we do not have a reproducer. To my knowledge, we do not understand what could have cause the job to be incorrectly terminated. The fix is to ensure that the job, if terminated, gets rescheduled to prevent the issue/symptom raised by the bug.

It is therefore my opinion that we only confirm that this fix does not break anything -- which it shouldn't but development can perhaps take another look and comment -- and document this is untestable but verified.

Thoughts?

Comment 7 Josejulio Martínez 2016-07-13 23:40:48 UTC
That's right, we couldn't find why it happened. I only added means for it to timeout in case it got stuck. 

There isn't any know reproducer to this issue. Testing that this is still working as expected would be the best.

Comment 8 vsorokin 2016-07-19 18:31:00 UTC
Created attachment 1181762 [details]
Last server.log before hanging up while ON_QA

Comment 11 vsorokin 2016-07-20 17:47:46 UTC
To: Michael Burman

Last server's log was of 14.Jul.2016
I did nothing manually on this VM on this day.
ulimit was default after installation.

Relation to this BZ: difficulties to reproduce mentioned steps (indeed, to use GUI) due to that issues.

Comment 14 errata-xmlrpc 2016-07-27 15:31:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1519.html


Note You need to log in before you can comment on or make changes to this bug.