Used the latest productized brew build + EAP 6.2. (i.e. https://brewweb.devel.redhat.com/buildinfo?buildID=327624 + http://download.devel.redhat.com/devel/candidates/JBEAP/JBEAP-6.2.0/) A support application using JDG libraries is deployed on EAP 6.2 server and normally should be detected as a JMX Server resource in JON by JDG library plugin. Then, cache manager resource is detected together with all underlying caches. NOTE: this ^ scenario works for EAP 6.0, 6.1 and 6.1.1. However, when using EAP 6.2 for application deployment, availability of Infinispan cache manager can't be examined properly and it behaves as unavailable resource (red exclamation mark sign). NOTE 2: It is possible to run operation on cache manager resource, for example create new cache with given name and this operation result is successful. However, child cache resources are not shown properly. Extracted error message: org.rhq.core.pc.inventory.TimeoutException: Call to [org.infinispan.rhq.CacheManagerComponent.getAvailability()] with args [] timed out after 5000 milliseconds - invocation thread will be interrupted. at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:583) at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invoke(ResourceContainer.java:549) at $Proxy52.getAvailability(Unknown Source) at org.rhq.core.pc.inventory.AvailabilityExecutor.safeGetAvailability(AvailabilityExecutor.java:436) at org.rhq.core.pc.inventory.AvailabilityExecutor.checkInventory(AvailabilityExecutor.java:359) at org.rhq.core.pc.inventory.AvailabilityExecutor.startScan(AvailabilityExecutor.java:195) at org.rhq.core.pc.inventory.CustomScanRootAvailabilityExecutor.startScan(CustomScanRootAvailabilityExecutor.java:71) at org.rhq.core.pc.inventory.AvailabilityExecutor.call(AvailabilityExecutor.java:152) at org.rhq.core.pc.inventory.AvailabilityExecutor.call(AvailabilityExecutor.java:62)
New investigation: passing -Drhq.agent.plugins.availability-scan.timeout=30000 during an agent start is a possible workaround for this issue. All respective child resources under JMX Server are properly UP. (i.e. Cache Manager and Caches) The question is why that availability check takes so long and 5s is not enough for app deployed in EAP 6.2. What changed? This still has an impact on customer which is expecting immediate resource operational possibilities without passing any other special parameters during agent start, therefore, severity still remains HIGH.
Current info: It seems this is not a problem when deploying on released EAP 6.2.1. Experiencing the same problem with EAP 6.3 Alpha1.
FWIW, side note that the availability checking has been drastically refactored in the latest code (I think it made it into JON 3.2.0 but not sure). rhq.agent.plugins.availability-scan.timeout (IIRC) is no longer used. Jay S. should comment more on this.
This does not sound like an issue with RHQ infrastructure or availability checking, but rather with the implementation of the component's getAvailability() method. It seems that the mechanism used to determine availability is sensitive to changes occurring in the EAP release. For the JON 3.2.0 release you can set rhq.agent.plugins.availability-scan.timeout to accommodate slow implementations of getAvailability(). But recognize that it can have a downside for availability timeliness, because the avail checker will now wait up to 30 seconds (given the example setting above) for *each* check that is slow to return. So, a couple of slow resources can cause a very slow availability scan. Another alternative/workaround is to disable the 'Availability' metric for the problem resource (or resource type, via a template). This will cause the avail to just defer to the parent resource's availability. No call to the bad resource's getAvailability() method will be made. Of course this also means that you won't get independent availability values for the given resources. Or, you could change the type to support async avail checking. But that is deprecated in 3.2.0 because in 3.3.0 all avail checking will be inherently async in order to better accommodate slow availability checks without slowing down the avail scan. Overall, it seems the root cause of this issue is the mechanism used for the getAvailability()for Infinispan cache manager resources.
I'm increasing severity of this BZ. The long waiting interval for statistics is a real problem for QE as this completely degrades the automated test suite that we use for testing. Currently, we have to test everything manually which takes a lot of time, and it will be difficult to test everything in time for JDG 6.3.
Looking into this a bit closer it appears there are 2 places of time being spent: {code} public AvailabilityType getAvailability() { boolean trace = log.isTraceEnabled(); EmsConnection conn = getEmsConnection(); <------ here try { conn.refresh(); <------ here EmsBean bean = queryCacheManagerBean(conn); if (bean != null) { bean.refreshAttributes(); if (trace) log.trace("Cache manager "+bean+" could be found and attributes where refreshed, so it's up."); return AvailabilityType.UP; } if (trace) log.trace("Cache manager could not be found, so cache manager is down"); return AvailabilityType.DOWN; } catch (Exception e) { if (trace) log.trace("There was an exception checking availability, so cache manager is down"); return AvailabilityType.DOWN; } } {code} Looking into those more closely it appears that getEmsConnection of JMXServerComponent.java has changed in newer version of the rhq jmx plugin to have additional reconnection checks. {code} public EmsConnection getEmsConnection() { this.reconnectIfDisconnected(); <----- This is new return this.connection; } <---- This whole method is new ----> private synchronized void reconnectIfDisconnected() { if ((connectionProvider) == null || !connectionProvider.isConnected()) { if (connection != null) { try { connection.close(); } catch (Exception e) { log.debug("Unable to close existing EmsConnection for " + context.getResourceType() + "[" + context.getResourceKey() + "] due to error: " + e); } } try { internalStart(); } catch (Exception e) { log.debug("Unable to reconnect to " + context.getResourceType() + "[" + context.getResourceKey() + "] due to error: " + e); } } } {code} The new method calls to connectionProvider.isConnected() which calls into an implmentation from mc4j which has the source here https://github.com/rhq-project/ems/blob/master/src/ems-impl/org/mc4j/ems/impl/jmx/connection/support/providers/JMXRemotingConnectionProvider.java#L169 If you notice the isConnected method queries every single mbean in the server, which will get slower as more mbeans are added. Also the refresh method above calls into mc4j which does the exact same thing, querying all mbeans. https://github.com/rhq-project/ems/blob/master/src/ems-impl/org/mc4j/ems/impl/jmx/connection/DConnection.java#L133 So what happened is the additional querying of all mbeans is just enough to take the operation to take longer then the default timeout. Unfortunately with the way the code is, it is not possible to remove the additional mbean querying that is now occurring, and I am not sure if we can remove the refresh or not. I need to look into it further.
William Burns <wburns> updated the status of jira ISPN-4236 to Coding In Progress
I can still see those WARNings in agent LOG: 2014-05-19 08:38:04,046 WARN [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=11995, uuid=07982a3d-3546-45ef-822d-c07b34033b92, type={Infinispan}Infinispan Cache Manager, key=DefaultCacheManager, name=DefaultCacheManager, parent=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar MyEAP_2000, version=6.1.0.ER3-redhat-1] - cause: java.lang.NullPointerException:null 2014-05-19 08:38:05,843 WARN [InventoryManager.availability-1] (rhq.core.pc.inventory.ResourceContainer$ComponentInvocation)- Invocation has been marked interrupted for method [AvailabilityFacet.getAvailability] on resource [Resource[id=12041, uuid=5c3f1183-210f-4573-a713-ba5a3fe0b71b, type={JMX}JMX Server, key=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar{MyEAP_2000}, name=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar MyEAP_2000, parent=auto-jdg-agent1.bc.jonqe.lab.eng.bos.redhat.com, version=1.6.0_24]] 2014-05-19 08:38:05,843 WARN [InventoryManager.availability-1] (rhq.core.pc.inventory.AvailabilityExecutor)- Availability collection timed out on Resource[id=12041, uuid=5c3f1183-210f-4573-a713-ba5a3fe0b71b, type={JMX}JMX Server, key=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar{MyEAP_2000}, name=/home/hudson/jboss-eap6-standalone-2000/jboss-modules.jar MyEAP_2000, parent=auto-jdg-agent1.bc.jonqe.lab.eng.bos.redhat.com, version=1.6.0_24], availability will be reported as DOWN Tried with JON 3.2.0, re-spin number 29. Will, do we need some changes implemented in JON itself? I am looking for the information whether I should obtain very latest JON build. But still, we want this to work with JON 3.2.0.GA (x-respin). Maybe next re-spin would need to contain particular fix from JON side?
Version of JDG product plug-in is 6.3.0.ER3... Strange, now I can see main resources (cache manager, caches) UP and available, however, JON is not able to monitor statistics properly due to: 2014-05-19 08:46:43,575 ERROR [WorkerThread#0[10.16.23.101:47298]] (rhq.core.pc.measurement.MeasurementManager)- Could not get measurement values java.lang.NullPointerException at org.rhq.plugins.jmx.MBeanResourceComponent.getEmsConnection(MBeanResourceComponent.java:574) at org.infinispan.rhq.CacheComponent.getValues(CacheComponent.java:97) at sun.reflect.GeneratedMethodAccessor68.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.rhq.core.pc.inventory.ResourceContainer$ComponentInvocation.call(ResourceContainer.java:654) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) The result is that availability scan and gathering of statistics is not stable. This is regression as for later releases the process of discovering was quite smooth. Another relevant information: 2014-05-19 08:35:41,467 INFO [ResourceContainer.invoker.daemon-1672] (ems.impl.jmx.connection.DConnection)- Querying MBeanServer for all MBeans... 2014-05-19 08:35:42,834 INFO [ResourceContainer.invoker.daemon-1672] (ems.impl.jmx.connection.DConnection)- Found 1079 MBeans - starting load... this is for one EAP 6.2.1.GA server with deployed Infinispan library application on it. 1079 seems to be so much for smooth availability check, but not sure about exact number in previous releases.
Quick report for 6.3.0.ER4 -- availability scan is still unstable. Main info: The situation is a little bit better as scan is UNSTABLE. Before a fix from Will, the scan was not working almost completely. Now we can see some intervals when resource is reported as UP, DOWN, UP... etc. So fix leads to improvement, but unfortunately didn't fix it completely.
Which version of JON are you using btw? As I mentioned there are changes in a third party library within JON that also fixes an issue. Without that as more mbeans are added it becomes more likely to not get a response. Need to have 1.3.5 version of mc4j https://github.com/rhq-project/ems/commit/7fcd48d80dee7fed102813afb8e365163901f3e7
Hi Will, now, I know more. As I indicated a few comments before: "Tried with JON 3.2.0, re-spin number 29. Will, do we need some changes implemented in JON itself? I am looking for the information whether I should obtain very latest JON build. But still, we want this to work with JON 3.2.0.GA (x-respin). Maybe next re-spin would need to contain particular fix from JON side?" So in most cases I test with RELEASED JON 3.2.0-GA latest re-spin. Now the important thing is that JON guys are doing something they call CUMULATIVE PATCH (internal name is JON 3.2.1 -- but, there is NO actual release named JON 3.2.1). A CUMULATIVE PATCH which is applied on JON 3.2.0 can be recognized as: 3.2.0.GA Update 01 -- Update X string is added to the JON version label, under Help -> About. This was a comment about how I found out that releases work now. Next comment is technical related.
Now I have tried our 6.3.0-ER4 library plug-in with JON 3.2.0.GA Update 01. I.e. there should be the fix you have mentioned but maybe we should check with JON DEV for sure. Anyway, this is the latest possible productized thing I can use for testing. Results: 1) For small amount of a time, JON reports JMX Server as a DOWN 2) Then, it discovers it, availability goes to UP 3) It is stable for 2 hours now = (no UP, DOWN, UP, DOWN... problems like with pure 3.2.0) However, the problem with monitoring statistics persists. Opening new bugzilla for it. We have null pointer here, don't know why, yet: @Override public void getValues(MeasurementReport report, Set<MeasurementScheduleRequest> metrics) { boolean trace = log.isTraceEnabled(); if (trace) log.trace("Get values metrics"); >>>>>> EmsConnection conn = getEmsConnection(); <<<<<<<<<
Mentioned issue is tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1099864 (JON (3.2.0) -- lib plug-in since 6.3.0.ER4 -- Could not get measurement values)
Added DOC text for this issue.
This hasn't been fixed but I don't consider it a blocker for Beta anymore. This should be fixed for GA, though.
Thanks William! Things are in a far more better state now + it fixed also the problem with statistics we had. We can monitor them now again! Great work. Opening a new issue which covers problems with extensive deployments: https://bugzilla.redhat.com/show_bug.cgi?id=1105680
This was fixed. We don't need to mention it in release notes unless we want to mention issues that were in Beta release notes and fixed for GA.
Where can I get the latest JDG plugin that fixes this issue? I have a customer wanting to test this before GA
Hello, let me point you to the right direction: JDG Library plugin for JON is here: download.lab.bos.redhat.com/devel/candidates/JDG/JDG-6.3.0-CR1/jboss-datagrid-6.3.0.CR1-jon-library-plugin.zip But, this is very important, the plugin is working properly with JON 3.2.2 (which was not released yet and is planned for July 16th, same as JDG) This is the process of application cumulative patch for JON. The cumulative patch is here: http://jon01.mw.lab.eng.bos.redhat.com:8042/dist/release/jon/3.2.2.GA/6-28-2014/ This is "something" which makes from JON 3.2.0 --> JON 3.2.2. That cumulative patch contains hot fixes from RHQ 4.11... those fixes were backported to JON product and will go out with JON 3.2.2 So, to summarize that, a customer needs ALSO JON 3.2.2 to make our JDG library plugin work properly. I hope my comment helped. If anything ask, I will respond asap + we maybe need some guys from JON prod mgmt involved in this kind of situation.
I updated DOC text for release notes to appropriately reflect that we will be providing a FIX for this issue and where can be found. Misha, please, take a look, whether this follows DOC conventions from your side! ;) Thanks!! :)
Looks good, thank you Tomas! :)
Hi Misha, sorry for bothering you again but we needed to add an important piece of information here: "... use JBoss Data Grid library plugin 6.3.0 and also a patch..." To get you in context, using only JON update (patch) is not enough and we have to let customers know in this release note that they ALSO!! need library plugin from JDG 6.3.0 as it contains performance fixes. Thank you very much for incorporating this.
Hi Tomas, Could you have a look at the text in the Doc Text field to the top right of this bug please? In the second para, it looks to me like we have mentioned both the lib plugin and the patch, so this should satisfy your requirement in comment #27. Please let me know if this is not the case. :)
Had a chat with Tomas and cleared this up. He has already made the required change. Thanks, Tomas!
Added "s" in JBoss Operation+s ==> Operations product... to be correct.