Bug 824401

Summary:

Missing parent resource container for parent resource

Product:

[Other] RHQ Project

Reporter:

Libor Zoubek <lzoubek>

Component:

Plugin Container, Plugins

Assignee:

Charles Crouch <ccrouch>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Mike Foley <mfoley>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

4.4

CC:

ccrouch, hbrock, hrupp, jshaughn, loleary, mazz, theute

Target Milestone:

---

Target Release:

RHQ 4.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

4.5

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

825019 (view as bug list)

Environment:

Last Closed:

2013-09-01 10:03:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

782579, 825019

Attachments:

Description	Flags
agent.log	none

Description Libor Zoubek 2012-05-23 12:05:33 UTC

Description of problem: After some time (12hours) of running my testing setup I found one of my 2 agents (AUTO) in kind of broken state.

My setup:
JON 3.1.ER4 server
Agent AUTO
Agent MANUAL

On both agents there is EAP6 in domain and standalone mode (so 2 instances for each agent), I run automation on agent AUTO.

Version-Release number of selected component (if applicable):
Version: 3.1.0.ER4 
Build Number: 1783b86:2b8d25d 

How reproducible:very hard


Steps to Reproduce:

There are no steps that I am aware of. But once this issue comes to play, it is easy to reproduce it on running agent again.

1.remove any (I removed RHQ Server AS) platform child from inventory
2.import it again - it takes more time for the resource to appear in discovery queue  
Actual results: during or after import you get bunch of exceptions in agent log saying:

org.rhq.core.clientapi.agent.PluginContainerException: Failed to obtain classloader for resource: Resource[id=0, uuid=d89d5c8b-03bd-42c9-8ad6-a0fd1087eefc, type={JBossAS7}SocketBindingGroup, key=socket-binding-group=standard-sockets, name=standard-sockets, parent=EAP Domain Controller (0.0.0.0:8990)]

Caused by: org.rhq.core.clientapi.agent.PluginContainerException: [Warning] Missing parent resource container for parent resource=Res
ource[id=11071, uuid=277964d5-e328-432a-b5e1-95df0b439d05, type={JBossAS7}JBossAS7 Host Controller, key=/home/hudson/jbas-instances/j
boss-eap6-domain/domain, name=EAP Domain Controller (0.0.0.0:8990), parent=dhcp-31-185.brq.redhat.com, version=EAP 6.0.0.GA]



Additional info:
1. EAP Domain Controller was NOT the resource being inventoried
2. if you repeat steps to reproduce it's not always going to be 'key=socket-binding-group=standard-sockets, name=standard-sockets' that has issues
3. There is possible relation to AS7 plugin

Comment 1 Libor Zoubek 2012-05-23 12:07:34 UTC

Created attachment 586320 [details]
agent.log

Comment 2 Charles Crouch 2012-05-23 14:04:05 UTC

Setting to urgent for further investigation.

Comment 3 Jay Shaughnessy 2012-05-23 14:16:13 UTC

(9:40:47 AM) jshaughn: The BZ from Libor looks like the same thing I saw in Ian's log
(9:40:54 AM) jshaughn: I think I can explain this
(9:41:46 AM) jshaughn: It has to do, I think, with an agent sync happening, that includes uninventoried resources, atthe same time the hieracrhy is being traversed
(9:43:11 AM) jshaughn: to get a resource classloader you need the parent container, which, I think may have gone away due to the sync
(9:43:31 AM) jshaughn: that's the theory at least
(10:03:11 AM) lzoubek: jshaughn, so this might happen when I uninventory a resource during service scan on one of its children? If you do not need to see my broken setup, I'll clean it up and try reproducing
(10:07:01 AM) jshaughn: lzoubek: yes
(10:07:16 AM) jshaughn: that's basically the theory I have so far
(10:07:45 AM) jshaughn: The agent can run a sync at the same time as executing an avail scan, or a discovery scan, for example.
(10:08:01 AM) lzoubek: ok, I'll play with it
(10:08:02 AM) jshaughn: those scans typically recurse through the inventory
(10:08:29 AM) jshaughn: but if we alter the inventory during that time it is possible that a parent could disappear
(10:08:40 AM) jshaughn: if the sync removes resources
(10:09:06 AM) jshaughn: so, we may need to do something to protect against this, or to gracefully accept it
(10:09:14 AM) jshaughn: probably the latter
(10:09:29 AM) jshaughn: as to protect would probably mean adding more locking

Comment 4 John Mazzitelli 2012-05-23 14:54:10 UTC

org.rhq.core.clientapi.agent.PluginContainerException: Failed to obtain classloader for resource: Resource[id=0, uuid=d89d5c8b-03bd-42c9-8ad6-a0fd1087eefc, type={JBossAS7}SocketBindingGroup, key=socket-binding-group=standard-sockets, name=standard-sockets, parent=EAP Domain Controller (0.0.0.0:8990)]

the id on this reaource is 0, which tells me its in the NEW inventory state (hasn't been committed). I wonder if we don't assign classloaders to resouces not yet committed?

In any case, this is a new resource (since id=0 means it hasn't been sync'ed wih the server so it can't have been COMMITTED state yet - the agent doesn't COMMIT by itself). And we shouldn't be doing anything with new resources.

Comment 5 Libor Zoubek 2012-05-23 15:18:57 UTC

Jay, it looks like your theory is correct:

I've:
 * ./rhq-agent.sh --purgedata
 * I've imported both EAPs
 * right after that removed both EAPs resources from inventory
here I got bunch of classloader exceptions on agent.log

I can now reproduce it 100%. 

I've also tried this:
 * ./rhq-agent.sh --purgedata
 * I've imported both EAPs
 * wait 20minutes
 * removed both EAPs resources from inventory
And there are no classloader errors

Comment 6 Ian Springer 2012-05-24 17:14:02 UTC

handle condition of a missing parent resourceContainer more gracefully in a few places, since it's normal in situations where the corresponding resource was just uninventoried - we now log a DEBUG message, rather than an ERROR message + stack trace; add a PC integration test that verifies Resource uninventory works:

[master http://git.fedorahosted.org/git?p=rhq/rhq.git;a=commitdiff;h=5c4322c]

Comment 8 Heiko W. Rupp 2013-09-01 10:03:48 UTC

Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.