(04:37:45 PM) ips: i was working on a bug yesterday and noticed that no queues had been discovered under a newly imported EAP 5.0.1 resource (04:37:53 PM) ips: all other child resources got discovered fine (04:38:03 PM) ips: i then tried adding another queue via RHQ (04:38:21 PM) ips: the create child request succeeded but the queue never appeared in inventory (04:38:43 PM) ips: so i spent a couple hours debugging why the discovery was failing (04:38:58 PM) ips: and finally realizes the queue discovery component had been blacklisted (04:39:05 PM) ips: realized (04:39:57 PM) mazz: blacklisting only happens when (IIRC) it takes too long to finish (04:40:10 PM) ips: i'm not sure why it got blacklisted, but i am thinking we should make it more in the user's face which resources are blacklisted (04:40:30 PM) ips: and maybe clear out the blacklist more often (04:40:49 PM) ips: and/or increase the time limit for blacklisting (04:41:06 PM) mazz: I would first find out why the blacklisting happened (04:41:16 PM) mazz: we blacklist for a purpose (so it doesn't block everything else) (04:41:17 PM) ips: as soon as i manually removed it from the blacklist and ran discovery -f, all 3 queues were discovered fine (04:41:33 PM) mazz: I forget how often that blacklist is cleared (04:42:03 PM) ips: what do you think of the agent printing a message like the following when discovery -f is run: (04:42:20 PM) mazz: BTW: you can configure the timeout via a system prop: (04:42:20 PM) mazz: long timeout = Long.parseLong(System.getProperty("rhq.test.discovery-timeout", "300000")); (04:42:34 PM) mazz: that's in org.rhq.core.pc.inventory.InventoryManager.getDiscoveryComponentTimeout(ResourceType) (04:42:42 PM) mazz: with a TODO that I had in there to actually make it more useful :} (04:42:51 PM) ips: Discovery components for the following resource typ[es are blacklisted: blah, blah, blah - discovery will not be run for these types (04:43:06 PM) mazz: that would be nice (04:43:24 PM) ips: that would have clued me in much sooner and saved me a lot of time debugging (04:44:04 PM) ips: wow, 5 minutes.. (04:45:01 PM) ips: i no longer have the logs from when it got blacklisted. if it happens again, i'll try to figure out why it hung for 5 mins (04:45:12 PM) ips: i'll add that message to the agent (04:45:57 PM) ips: do we already log a big wanring or error when the resourcetype gets blacklisted? (04:46:40 PM) mazz: only in debug mode (04:46:41 PM) mazz: } catch (BlacklistedException be) { (04:46:41 PM) mazz: log.debug(ThrowableUtil.getAllMessages(be)); (04:46:41 PM) mazz: return null; (04:46:45 PM) mazz: feel free to change that to warn (04:46:52 PM) ips: k (04:46:58 PM) mazz: its in there twice, search for usages of BlacklistedException (04:47:06 PM) ips: k (04:47:31 PM) ips: you know what would be really helpful (04:47:57 PM) ips: if we got the hung discovery thread to dump its stack and logged that (04:48:08 PM) ips: so we could see where it was hung (04:53:52 PM) mazz: you could log that at the debug level in those catch clauses (04:54:08 PM) mazz: similar to what I do in that tx interceptor handler I wrote (04:55:36 PM) ips: yeah, i guess i could either use http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Thread.html#getAllStackTraces%28%29 (04:55:53 PM) ips: and then look for the thread by id or name, assuming i know one of those (04:58:18 PM) mazz: here's where we need to log the stack trace: (04:58:19 PM) mazz: org.rhq.core.pc.util.DiscoveryComponentProxyFactory.ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(Method, Object[]) (04:58:34 PM) mazz: line 219 (04:58:37 PM) mazz: or thereabouts (04:58:43 PM) mazz: just log the stack for "e" (04:58:56 PM) mazz: in fact, just modify line 219 (04:59:47 PM) mazz: change that from debug to warn and log the stack and you have everything you'll need to help track it down I think (04:59:57 PM) mazz: nothng else would need to change (at least from a logging standpoint) (05:00:55 PM) ips: i don't think the stack trace of the TimeoutException is the one we'd want (05:02:27 PM) mazz: oh right - there's a thread pool here (05:03:33 PM) mazz: this is the guy that is making the call in the thread (05:03:33 PM) mazz: org.rhq.core.pc.util.DiscoveryComponentProxyFactory.ComponentInvocationThread.call() (05:03:43 PM) mazz: // This is the actual call into the discovery component. (05:03:44 PM) mazz: Object results = this.method.invoke(this.component, this.args); (05:03:48 PM) ips: i think we could update ComponentInvocationThread to expose its Thread object via a getter (05:04:27 PM) mazz: yeah (05:04:30 PM) ips: and then when the timeout occurs, grab that thread and do thread.getStackTrace() (05:04:43 PM) mazz: it gets the currentThread anyway to set the context class loader (05:04:50 PM) mazz: so have it squirrel it away (05:04:55 PM) mazz: and expose in getter (05:05:06 PM) mazz: or, rather than expose the thread itself (05:05:09 PM) mazz: maybe just the stack (05:05:16 PM) mazz: "String getStackTrace" (05:05:21 PM) mazz: since that's what we'll really want (05:05:37 PM) ips: yeah, true, a bit more encapsulated (05:14:29 PM) ips: are you ok with logging the stack trace at warning rather than debug? (05:14:41 PM) mazz: yeah that's fine (05:14:44 PM) ips: with the blacklist mehcnaism, it won't happen that often (05:14:46 PM) mazz: this will rarely happen (05:14:53 PM) mazz: when it does, I think we'll want to know (05:15:01 PM) mazz: +1 (05:15:06 PM) ips: that's what i'm thinking (05:15:23 PM) ips: like this queue issue is probably a bug that can be fixed in the as-5 plugin (05:15:32 PM) ips: setting the jnp timeout lower or something
Implemented as discussed above and committed (git sha 6f0bf04) and pushed to release-3.0.0 branch. Here is a summary of the changes: add better logging when a resource discovery component is timed out by the PC (log the current stack trace from the discovery thread to see where it's hung); when the 'discovery' Agent prompt command is executed and one or more resource types are blacklisted, print a message informing the user that discovery will be skipped for those resource types
QA Closing - basically a code change
Mass-closure of verified bugs against JON.