Bug 606999

Summary:	resource component discovery invocations that timeout (and get blacklisted) are very difficult to debug due to insufficient logging
Product:	[Other] RHQ Project	Reporter:	Ian Springer <ian.springer>
Component:	Plugin Container	Assignee:	Ian Springer <ian.springer>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Corey Welton <cwelton>
Severity:	high	Docs Contact:
Priority:	low
Version:	1.3.1	CC:	ccrouch
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	2.4	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-08-12 16:51:17 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	725852

Description Ian Springer 2010-06-22 22:10:16 UTC

(04:37:45 PM) ips: i was working on a bug yesterday and noticed that no queues had been discovered under a newly imported EAP 5.0.1 resource
(04:37:53 PM) ips: all other child resources got discovered fine
(04:38:03 PM) ips: i then tried adding another queue via RHQ
(04:38:21 PM) ips: the create child request succeeded but the queue never appeared in inventory
(04:38:43 PM) ips: so i spent a couple hours debugging why the discovery was failing 
(04:38:58 PM) ips: and finally realizes the queue discovery component had been blacklisted
(04:39:05 PM) ips: realized
(04:39:57 PM) mazz: blacklisting only happens when (IIRC) it takes too long to finish
(04:40:10 PM) ips: i'm not sure why it got blacklisted, but i am thinking we should make it more in the user's face which resources are blacklisted
(04:40:30 PM) ips: and maybe clear out the blacklist more often
(04:40:49 PM) ips: and/or increase the time limit for blacklisting
(04:41:06 PM) mazz: I would first find out why the blacklisting happened
(04:41:16 PM) mazz: we blacklist for a purpose (so it doesn't block everything else)
(04:41:17 PM) ips: as soon as i manually removed it from the blacklist and ran discovery -f, all 3 queues were discovered fine
(04:41:33 PM) mazz: I forget how often that blacklist is cleared
(04:42:03 PM) ips: what do you think of the agent printing a message like the following when discovery -f is run:
(04:42:20 PM) mazz: BTW: you can configure the timeout via a system prop:
(04:42:20 PM) mazz:         long timeout = Long.parseLong(System.getProperty("rhq.test.discovery-timeout", "300000"));
(04:42:34 PM) mazz: that's in org.rhq.core.pc.inventory.InventoryManager.getDiscoveryComponentTimeout(ResourceType)
(04:42:42 PM) mazz: with a TODO that I had in there to actually make it more useful :}
(04:42:51 PM) ips: Discovery components for the following resource typ[es are blacklisted: blah, blah, blah - discovery will not be run for these types
(04:43:06 PM) mazz: that would be nice
(04:43:24 PM) ips: that would have clued me in much sooner and saved me a lot of time debugging
(04:44:04 PM) ips: wow, 5 minutes..
(04:45:01 PM) ips: i no longer have the logs from when it got blacklisted. if it happens again, i'll try to figure out why it hung for 5 mins
(04:45:12 PM) ips: i'll add that message to the agent
(04:45:57 PM) ips: do we already log a big wanring or error when the resourcetype gets blacklisted?
(04:46:40 PM) mazz: only in debug mode
(04:46:41 PM) mazz:         } catch (BlacklistedException be) {
(04:46:41 PM) mazz:             log.debug(ThrowableUtil.getAllMessages(be));
(04:46:41 PM) mazz:             return null;
(04:46:45 PM) mazz: feel free to change that to warn
(04:46:52 PM) ips: k
(04:46:58 PM) mazz: its in there twice, search for usages of BlacklistedException
(04:47:06 PM) ips: k
(04:47:31 PM) ips: you know what would be really helpful
(04:47:57 PM) ips: if we got the hung discovery thread to dump its stack and logged that
(04:48:08 PM) ips: so we could see where it was hung
(04:53:52 PM) mazz: you could log that at the debug level in those catch clauses
(04:54:08 PM) mazz: similar to what I do in that tx interceptor handler I wrote 
(04:55:36 PM) ips: yeah, i guess i could either use http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Thread.html#getAllStackTraces%28%29
(04:55:53 PM) ips: and then look for the thread by id or name, assuming i know one of those
(04:58:18 PM) mazz: here's where we need to log the stack trace:
(04:58:19 PM) mazz: org.rhq.core.pc.util.DiscoveryComponentProxyFactory.ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(Method, Object[])
(04:58:34 PM) mazz: line 219
(04:58:37 PM) mazz: or thereabouts
(04:58:43 PM) mazz: just log the stack for "e"
(04:58:56 PM) mazz: in fact, just modify line 219
(04:59:47 PM) mazz: change that from debug to warn and log the stack and you have everything you'll need to help track it down I think
(04:59:57 PM) mazz: nothng else would need to change (at least from a logging standpoint)
(05:00:55 PM) ips: i don't think the stack trace of the TimeoutException is the one we'd want
(05:02:27 PM) mazz: oh right - there's a thread pool here
(05:03:33 PM) mazz: this is the guy that is making the call in the thread
(05:03:33 PM) mazz: org.rhq.core.pc.util.DiscoveryComponentProxyFactory.ComponentInvocationThread.call()
(05:03:43 PM) mazz:                 // This is the actual call into the discovery component.
(05:03:44 PM) mazz:                 Object results = this.method.invoke(this.component, this.args);
(05:03:48 PM) ips: i think we could update ComponentInvocationThread to expose its Thread object via a getter
(05:04:27 PM) mazz: yeah
(05:04:30 PM) ips: and then when the timeout occurs, grab that thread and do thread.getStackTrace()
(05:04:43 PM) mazz: it gets the currentThread anyway to set the context class loader
(05:04:50 PM) mazz: so have it squirrel it away
(05:04:55 PM) mazz: and expose in getter
(05:05:06 PM) mazz: or, rather than expose the thread itself
(05:05:09 PM) mazz: maybe just the stack
(05:05:16 PM) mazz: "String getStackTrace"
(05:05:21 PM) mazz: since that's what we'll really want
(05:05:37 PM) ips: yeah, true, a bit more encapsulated
(05:14:29 PM) ips: are you ok with logging the stack trace at warning rather than debug?
(05:14:41 PM) mazz: yeah that's fine
(05:14:44 PM) ips: with the blacklist mehcnaism, it won't happen that often
(05:14:46 PM) mazz: this will rarely happen
(05:14:53 PM) mazz: when it does, I think we'll want to know
(05:15:01 PM) mazz: +1
(05:15:06 PM) ips: that's what i'm thinking
(05:15:23 PM) ips: like this queue issue is probably a bug that can be fixed in the as-5 plugin
(05:15:32 PM) ips: setting the jnp timeout lower or something

Comment 1 Ian Springer 2010-06-22 22:15:39 UTC

Implemented as discussed above and committed (git sha 6f0bf04) and pushed to release-3.0.0 branch. Here is a summary of the changes:

add better logging when a resource discovery component is timed out by the PC (log the current stack trace from the discovery thread to see where it's hung); when the 'discovery' Agent prompt command is executed and one or more resource types are blacklisted, print a message informing the user that discovery will be skipped for those resource types

Comment 2 Corey Welton 2010-06-24 15:02:19 UTC

QA Closing - basically a code change

Comment 3 Corey Welton 2010-08-12 16:51:17 UTC

Mass-closure of verified bugs against JON.