Description of problem: After some time (12hours) of running my testing setup I found one of my 2 agents (AUTO) in kind of broken state. My setup: JON 3.1.ER4 server Agent AUTO Agent MANUAL On both agents there is EAP6 in domain and standalone mode (so 2 instances for each agent), I run automation on agent AUTO. Version-Release number of selected component (if applicable): Version: 3.1.0.ER4 Build Number: 1783b86:2b8d25d How reproducible:very hard Steps to Reproduce: There are no steps that I am aware of. But once this issue comes to play, it is easy to reproduce it on running agent again. 1.remove any (I removed RHQ Server AS) platform child from inventory 2.import it again - it takes more time for the resource to appear in discovery queue Actual results: during or after import you get bunch of exceptions in agent log saying: org.rhq.core.clientapi.agent.PluginContainerException: Failed to obtain classloader for resource: Resource[id=0, uuid=d89d5c8b-03bd-42c9-8ad6-a0fd1087eefc, type={JBossAS7}SocketBindingGroup, key=socket-binding-group=standard-sockets, name=standard-sockets, parent=EAP Domain Controller (0.0.0.0:8990)] Caused by: org.rhq.core.clientapi.agent.PluginContainerException: [Warning] Missing parent resource container for parent resource=Res ource[id=11071, uuid=277964d5-e328-432a-b5e1-95df0b439d05, type={JBossAS7}JBossAS7 Host Controller, key=/home/hudson/jbas-instances/j boss-eap6-domain/domain, name=EAP Domain Controller (0.0.0.0:8990), parent=dhcp-31-185.brq.redhat.com, version=EAP 6.0.0.GA] Additional info: 1. EAP Domain Controller was NOT the resource being inventoried 2. if you repeat steps to reproduce it's not always going to be 'key=socket-binding-group=standard-sockets, name=standard-sockets' that has issues 3. There is possible relation to AS7 plugin
Created attachment 586320 [details] agent.log
Setting to urgent for further investigation.
(9:40:47 AM) jshaughn: The BZ from Libor looks like the same thing I saw in Ian's log (9:40:54 AM) jshaughn: I think I can explain this (9:41:46 AM) jshaughn: It has to do, I think, with an agent sync happening, that includes uninventoried resources, atthe same time the hieracrhy is being traversed (9:43:11 AM) jshaughn: to get a resource classloader you need the parent container, which, I think may have gone away due to the sync (9:43:31 AM) jshaughn: that's the theory at least (10:03:11 AM) lzoubek: jshaughn, so this might happen when I uninventory a resource during service scan on one of its children? If you do not need to see my broken setup, I'll clean it up and try reproducing (10:07:01 AM) jshaughn: lzoubek: yes (10:07:16 AM) jshaughn: that's basically the theory I have so far (10:07:45 AM) jshaughn: The agent can run a sync at the same time as executing an avail scan, or a discovery scan, for example. (10:08:01 AM) lzoubek: ok, I'll play with it (10:08:02 AM) jshaughn: those scans typically recurse through the inventory (10:08:29 AM) jshaughn: but if we alter the inventory during that time it is possible that a parent could disappear (10:08:40 AM) jshaughn: if the sync removes resources (10:09:06 AM) jshaughn: so, we may need to do something to protect against this, or to gracefully accept it (10:09:14 AM) jshaughn: probably the latter (10:09:29 AM) jshaughn: as to protect would probably mean adding more locking
org.rhq.core.clientapi.agent.PluginContainerException: Failed to obtain classloader for resource: Resource[id=0, uuid=d89d5c8b-03bd-42c9-8ad6-a0fd1087eefc, type={JBossAS7}SocketBindingGroup, key=socket-binding-group=standard-sockets, name=standard-sockets, parent=EAP Domain Controller (0.0.0.0:8990)] the id on this reaource is 0, which tells me its in the NEW inventory state (hasn't been committed). I wonder if we don't assign classloaders to resouces not yet committed? In any case, this is a new resource (since id=0 means it hasn't been sync'ed wih the server so it can't have been COMMITTED state yet - the agent doesn't COMMIT by itself). And we shouldn't be doing anything with new resources.
Jay, it looks like your theory is correct: I've: * ./rhq-agent.sh --purgedata * I've imported both EAPs * right after that removed both EAPs resources from inventory here I got bunch of classloader exceptions on agent.log I can now reproduce it 100%. I've also tried this: * ./rhq-agent.sh --purgedata * I've imported both EAPs * wait 20minutes * removed both EAPs resources from inventory And there are no classloader errors
handle condition of a missing parent resourceContainer more gracefully in a few places, since it's normal in situations where the corresponding resource was just uninventoried - we now log a DEBUG message, rather than an ERROR message + stack trace; add a PC integration test that verifies Resource uninventory works: [master http://git.fedorahosted.org/git?p=rhq/rhq.git;a=commitdiff;h=5c4322c]
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.