Bug 1054165 - JON (3.2.0) -- lib plug-in AVAILABILITY CHECK for cache manager failed -- app deployed on EAP 6.2
Summary: JON (3.2.0) -- lib plug-in AVAILABILITY CHECK for cache manager failed -- app...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: JON Plugin
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ER6
: 6.3.0
Assignee: William Burns
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On:
Blocks: 1104639 1105680
TreeView+ depends on / blocked
 
Reported: 2014-01-16 11:41 UTC by Tomas Sykora
Modified: 2018-12-06 15:41 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
In the JBoss Data Grid Library and JBoss Operations Network integration, the default <literal>5000</literal> millisecond timeout period for an availability check operation is not sufficient to monitor an application that uses JBoss Data Grid libraries deployed on a JBoss EAP server. As a result, the availability of the embedded Infinispan cache manager is inaccurately reported as alternating between <literal>UP</literal> and <literal>DOWN</literal>. Additionally, the monitoring of cache statistics is affected by this issue. In order to avoid this issue, use JBoss Data Grid library plugin 6.3.0 and also a patch (Update 02) for JBoss Operations Network 3.2.0. The patch can be found on customer portal in respective section (JBoss Operations Network product, Version 3.2.0, tab Patches) and is named: Red Hat JBoss Operations Network 3.2 Update 02.
Clone Of:
Environment:
Last Closed: 2015-01-26 14:05:13 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-4235 0 Major Resolved RHQ JMX does not support multiple cache managers properly 2018-03-13 07:28:55 UTC
Red Hat Issue Tracker ISPN-4236 0 Major Resolved RHQ JMX plugin causes Cache/CacheManager to report as not available 2018-03-13 07:28:55 UTC

Description Tomas Sykora 2014-01-16 11:41:01 UTC
Used the latest productized brew build + EAP 6.2. (i.e. https://brewweb.devel.redhat.com/buildinfo?buildID=327624 + http://download.devel.redhat.com/devel/candidates/JBEAP/JBEAP-6.2.0/)

A support application using JDG libraries is deployed on EAP 6.2 server and normally should be detected as a JMX Server resource in JON by JDG library plugin. Then, cache manager resource is detected together with all underlying caches. 

NOTE: this ^ scenario works for EAP 6.0, 6.1 and 6.1.1.

However, when using EAP 6.2 for application deployment, availability of Infinispan cache manager can't be examined properly and it behaves as unavailable resource (red exclamation mark sign).

NOTE 2: It is possible to run operation on cache manager resource, for example create new cache with given name and this operation result is successful. However, child cache resources are not shown properly.


Extracted error message:

org.rhq.core.pc.inventory.TimeoutException: Call to [org.infinispan.rhq.CacheManagerComponent.getAvailability()] with args [] timed out after 5000 milliseconds - invocation thread will be interrupted.
	at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:583)
	at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invoke(ResourceContainer.java:549)
	at $Proxy52.getAvailability(Unknown Source)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.safeGetAvailability(AvailabilityExecutor.java:436)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:359)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.startScan(AvailabilityExecutor.java:195)
	at org.rhq.core.pc.inventory.CustomScanRootAvailabilityExecutor.startScan(CustomScanRootAvailabilityExecutor.java:71)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.call(AvailabilityExecutor.java:152)
	at org.rhq.core.pc.inventory.AvailabilityExecutor.call(AvailabilityExecutor.java:62)

Comment 2 Tomas Sykora 2014-01-16 12:26:36 UTC
New investigation:

passing -Drhq.agent.plugins.availability-scan.timeout=30000 during an agent start is a possible workaround for this issue. All respective child resources under JMX Server are properly UP. (i.e. Cache Manager and Caches)

The question is why that availability check takes so long and 5s is not enough for app deployed in EAP 6.2. What changed?

This still has an impact on customer which is expecting immediate resource operational possibilities without passing any other special parameters during agent start, therefore, severity still remains HIGH.

Comment 3 Tomas Sykora 2014-02-27 12:31:18 UTC
Current info:

It seems this is not a problem when deploying on released EAP 6.2.1.
Experiencing the same problem with EAP 6.3 Alpha1.

Comment 4 John Mazzitelli 2014-02-27 13:50:26 UTC
FWIW, side note that the availability checking has been drastically refactored in the latest code (I think it made it into JON 3.2.0 but not sure).

rhq.agent.plugins.availability-scan.timeout (IIRC) is no longer used.

Jay S. should comment more on this.

Comment 5 Jay Shaughnessy 2014-02-27 14:42:53 UTC
This does not sound like an issue with RHQ infrastructure or availability checking, but rather with the implementation of the component's getAvailability() method.  It seems that the mechanism used to determine availability is sensitive to changes occurring in the EAP release.

For the JON 3.2.0 release you can set rhq.agent.plugins.availability-scan.timeout to accommodate slow implementations of getAvailability(). But recognize that it can have a downside for availability timeliness, because the avail checker will now wait up to 30 seconds (given the example setting above) for *each* check that is slow to return.  So, a couple of slow resources can cause a very slow availability scan.

Another alternative/workaround is to disable the 'Availability' metric for the problem resource (or resource type, via a template).  This will cause the avail to just defer to the parent resource's availability.  No call to the bad resource's getAvailability() method will be made.  Of course this also means that you won't get independent availability values for the given resources.

Or, you could change the type to support async avail checking.  But that is deprecated in 3.2.0 because in 3.3.0 all avail checking will be inherently async in order to better accommodate slow availability checks without slowing down the avail scan.

Overall, it seems the root cause of this issue is the mechanism used for the getAvailability()for  Infinispan cache manager resources.

Comment 6 Martin Gencur 2014-04-04 09:45:14 UTC
I'm increasing severity of this BZ. The long waiting interval for statistics is a real problem for QE as this completely degrades the automated test suite that we use for testing. Currently, we have to test everything manually which takes a lot of time, and it will be difficult to test everything in time for JDG 6.3.

Comment 7 William Burns 2014-04-24 18:22:32 UTC
Looking into this a bit closer it appears there are 2 places of time being spent:

{code}
   public AvailabilityType getAvailability() {
      boolean trace = log.isTraceEnabled();
      EmsConnection conn = getEmsConnection();        <------ here
      try {
         conn.refresh();                              <------ here
         EmsBean bean = queryCacheManagerBean(conn);
         if (bean != null) {
            bean.refreshAttributes();
            if (trace) log.trace("Cache manager "+bean+" could be found and attributes where refreshed, so it's up.");
            return AvailabilityType.UP;
         }
         if (trace) log.trace("Cache manager could not be found, so cache manager is down");
         return AvailabilityType.DOWN;
      } catch (Exception e) {
         if (trace) log.trace("There was an exception checking availability, so cache manager is down");
         return AvailabilityType.DOWN;
      }
   }
{code}

Looking into those more closely it appears that getEmsConnection of JMXServerComponent.java has changed in newer version of the rhq jmx plugin to have additional reconnection checks.

{code}
    public EmsConnection getEmsConnection() {
        this.reconnectIfDisconnected();   <----- This is new

        return this.connection;
    }

<---- This whole method is new ---->
    private synchronized void reconnectIfDisconnected() {
        if ((connectionProvider) == null || !connectionProvider.isConnected()) {
            if (connection != null) {
                try {
                    connection.close();
                } catch (Exception e) {
                    log.debug("Unable to close existing EmsConnection for " + context.getResourceType() + "["
                        + context.getResourceKey() + "] due to error: " + e);
                }
            }

            try {
                internalStart();
            } catch (Exception e) {
                log.debug("Unable to reconnect to " + context.getResourceType() + "[" + context.getResourceKey()
                    + "] due to error: " + e);
            }
        }
    }
{code}

The new method calls to connectionProvider.isConnected() which calls into an implmentation from mc4j which has the source here https://github.com/rhq-project/ems/blob/master/src/ems-impl/org/mc4j/ems/impl/jmx/connection/support/providers/JMXRemotingConnectionProvider.java#L169  If you notice the isConnected method queries every single mbean in the server, which will get slower as more mbeans are added.

Also the refresh method above calls into mc4j which does the exact same thing, querying all mbeans.  https://github.com/rhq-project/ems/blob/master/src/ems-impl/org/mc4j/ems/impl/jmx/connection/DConnection.java#L133

So what happened is the additional querying of all mbeans is just enough to take the operation to take longer then the default timeout.

Unfortunately with the way the code is, it is not possible to remove the additional mbean querying that is now occurring, and I am not sure if we can remove the refresh or not.  I need to look into it further.

Comment 8 JBoss JIRA Server 2014-04-25 11:57:58 UTC
William Burns <wburns> updated the status of jira ISPN-4236 to Coding In Progress

Comment 9 Tomas Sykora 2014-05-19 12:42:02 UTC
I can still see those WARNings in agent LOG:

2014-05-19 08:38:04,046 WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=11995, uuid=07982a3d-3546-45ef-822d-c07b34033b92, type={Infinispan}Infinispan Cache Manager, key=DefaultCacheManager, name=DefaultCacheManager, parent=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar MyEAP_2000, version=6.1.0.ER3-redhat-1] - cause: java.lang.NullPointerException:null
2014-05-19 08:38:05,843 WARN  [InventoryManager.availability-1] (rhq.core.pc.inventory.ResourceContainer$ComponentInvocation)- Invocation has been marked interrupted for method [AvailabilityFacet.getAvailability] on resource [Resource[id=12041, uuid=5c3f1183-210f-4573-a713-ba5a3fe0b71b, type={JMX}JMX Server, key=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar{MyEAP_2000}, name=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar MyEAP_2000, parent=auto-jdg-agent1.bc.jonqe.lab.eng.bos.redhat.com, version=1.6.0_24]]
2014-05-19 08:38:05,843 WARN  [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Availability collection timed out on Resource[id=12041, uuid=5c3f1183-210f-4573-a713-ba5a3fe0b71b, type={JMX}JMX Server, key=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar{MyEAP_2000}, name=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar MyEAP_2000, parent=auto-jdg-agent1.bc.jonqe.lab.eng.bos.redhat.com, version=1.6.0_24], availability will be reported as DOWN


Tried with JON 3.2.0, re-spin number 29. 
Will, do we need some changes implemented in JON itself?

I am looking for the information whether I should obtain very latest JON build. But still, we want this to work with JON 3.2.0.GA (x-respin). Maybe next re-spin would need to contain particular fix from JON side?

Comment 10 Tomas Sykora 2014-05-19 12:51:24 UTC
Version of JDG product plug-in is 6.3.0.ER3...

Strange, now I can see main resources (cache manager, caches) UP and available, however, JON is not able to monitor statistics properly due to:

2014-05-19 08:46:43,575 ERROR [WorkerThread#0[10.16.23.101:47298]] (rhq.core.pc.measurement.MeasurementManager)- Could not get measurement values
java.lang.NullPointerException
	at org.rhq.plugins.jmx.MBeanResourceComponent.getEmsConnection(MBeanResourceComponent.java:574)
	at org.infinispan.rhq.CacheComponent.getValues(CacheComponent.java:97)
	at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:616)
	at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocation.call(ResourceContainer.java:654)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:679)


The result is that availability scan and gathering of statistics is not stable. This is regression as for later releases the process of discovering was quite smooth.

Another relevant information:

2014-05-19 08:35:41,467 INFO  [ResourceContainer.invoker.daemon-1672] (ems.impl.jmx.connection.DConnection)- Querying MBeanServer for all MBeans...
2014-05-19 08:35:42,834 INFO  [ResourceContainer.invoker.daemon-1672] (ems.impl.jmx.connection.DConnection)- Found 1079 MBeans - starting load...

this is for one EAP 6.2.1.GA server with deployed Infinispan library application on it. 1079 seems to be so much for smooth availability check, but not sure about exact number in previous releases.

Comment 11 Tomas Sykora 2014-05-20 09:16:18 UTC
Quick report for 6.3.0.ER4 -- availability scan is still unstable.

Main info: The situation is a little bit better as scan is UNSTABLE. Before a fix from Will, the scan was not working almost completely. 

Now we can see some intervals when resource is reported as UP, DOWN, UP... etc. So fix leads to improvement, but unfortunately didn't fix it completely.

Comment 12 William Burns 2014-05-20 15:02:08 UTC
Which version of JON are you using btw?  As I mentioned there are changes in a third party library within JON that also fixes an issue.  Without that as more mbeans are added it becomes more likely to not get a response.

Need to have 1.3.5 version of mc4j https://github.com/rhq-project/ems/commit/7fcd48d80dee7fed102813afb8e365163901f3e7

Comment 13 Tomas Sykora 2014-05-21 11:25:42 UTC
Hi Will,
now, I know more.

As I indicated a few comments before:

"Tried with JON 3.2.0, re-spin number 29. 
Will, do we need some changes implemented in JON itself?
I am looking for the information whether I should obtain very latest JON build. But still, we want this to work with JON 3.2.0.GA (x-respin). Maybe next re-spin would need to contain particular fix from JON side?"

So in most cases I test with RELEASED JON 3.2.0-GA latest re-spin.

Now the important thing is that JON guys are doing something they call CUMULATIVE PATCH (internal name is JON 3.2.1 -- but, there is NO actual release named JON 3.2.1).

A CUMULATIVE PATCH which is applied on JON 3.2.0 can be recognized as:

3.2.0.GA Update 01 -- Update X string is added to the JON version label, under Help -> About.

This was a comment about how I found out that releases work now.
Next comment is technical related.

Comment 14 Tomas Sykora 2014-05-21 11:30:22 UTC
Now I have tried our 6.3.0-ER4 library plug-in with JON 3.2.0.GA Update 01.

I.e. there should be the fix you have mentioned but maybe we should check with JON DEV for sure. Anyway, this is the latest possible productized thing I can use for testing.

Results:

1) For small amount of a time, JON reports JMX Server as a DOWN
2) Then, it discovers it, availability goes to UP 
3) It is stable for 2 hours now = (no UP, DOWN, UP, DOWN... problems like with pure 3.2.0)

However, the problem with monitoring statistics persists. Opening new bugzilla for it.

We have null pointer here, don't know why, yet:

   @Override
   public void getValues(MeasurementReport report, Set<MeasurementScheduleRequest> metrics) {
      boolean trace = log.isTraceEnabled();
      if (trace) log.trace("Get values metrics");
>>>>>>      EmsConnection conn = getEmsConnection(); <<<<<<<<<

Comment 15 Tomas Sykora 2014-05-21 11:36:49 UTC
Mentioned issue is tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1099864 (JON (3.2.0) -- lib plug-in since 6.3.0.ER4 -- Could not get measurement values)

Comment 16 Tomas Sykora 2014-05-23 09:46:57 UTC
Added DOC text for this issue.

Comment 17 Martin Gencur 2014-05-23 10:01:30 UTC
This hasn't been fixed but I don't consider it a blocker for Beta anymore. This should be fixed for GA, though.

Comment 18 Tomas Sykora 2014-06-06 17:47:44 UTC
Thanks William!

Things are in a far more better state now + it fixed also the problem with statistics we had. We can monitor them now again! Great work.

Opening a new issue which covers problems with extensive deployments: https://bugzilla.redhat.com/show_bug.cgi?id=1105680

Comment 19 Martin Gencur 2014-07-04 08:17:24 UTC
This was fixed. We don't need to mention it in release notes unless we want to mention issues that were in Beta release notes and fixed for GA.

Comment 20 dsteigne 2014-07-07 13:20:40 UTC
Where can I get the latest JDG plugin that fixes this issue?  I have a customer wanting to test this before GA

Comment 21 Tomas Sykora 2014-07-07 13:32:14 UTC
Hello, let me point you to the right direction:

JDG Library plugin for JON is here: download.lab.bos.redhat.com/devel/candidates/JDG/JDG-6.3.0-CR1/jboss-datagrid-6.3.0.CR1-jon-library-plugin.zip

But, this is very important, the plugin is working properly with JON 3.2.2 (which was not released yet and is planned for July 16th, same as JDG)

This is the process of application cumulative patch for JON. The cumulative patch is here: http://jon01.mw.lab.eng.bos.redhat.com:8042/dist/release/jon/3.2.2.GA/6-28-2014/

This is "something" which makes from JON 3.2.0 --> JON 3.2.2.

That cumulative patch contains hot fixes from RHQ 4.11... those fixes were backported to JON product and will go out with JON 3.2.2


So, to summarize that, a customer needs ALSO JON 3.2.2 to make our JDG library plugin work properly.

I hope my comment helped. If anything ask, I will respond asap + we maybe need some guys from JON prod mgmt involved in this kind of situation.

Comment 25 Tomas Sykora 2014-07-10 13:21:05 UTC
I updated  DOC text for release notes to appropriately reflect that we will be providing a FIX for this issue and where can be found. 

Misha, please, take a look, whether this follows DOC conventions from your side! ;) Thanks!! :)

Comment 26 Misha H. Ali 2014-07-10 18:14:20 UTC
Looks good, thank you Tomas! :)

Comment 27 Tomas Sykora 2014-07-14 06:42:06 UTC
Hi Misha,
sorry for bothering you again but we needed to add an important piece of information here:

"... use JBoss Data Grid library plugin 6.3.0 and also a patch..."

To get you in context, using only JON update (patch) is not enough and we have to let customers know in this release note that they ALSO!! need library plugin from JDG 6.3.0 as it contains performance fixes.

Thank you very much for incorporating this.

Comment 28 Misha H. Ali 2014-07-14 07:05:43 UTC
Hi Tomas,

Could you have a look at the text in the Doc Text field to the top right of this bug please? In the second para, it looks to me like we have mentioned both the lib plugin and the patch, so this should satisfy your requirement in comment #27. Please let me know if this is not the case. :)

Comment 29 Misha H. Ali 2014-07-14 07:11:15 UTC
Had a chat with Tomas and cleared this up. He has already made the required change. Thanks, Tomas!

Comment 30 Tomas Sykora 2014-07-14 11:29:16 UTC
Added "s" in JBoss Operation+s ==> Operations product... to be correct.


Note You need to log in before you can comment on or make changes to this bug.