887411 – committing/ignoring/unignoring resource causes agent sync to purge all resources

Bug 887411 - committing/ignoring/unignoring resource causes agent sync to purge all resources

Summary: committing/ignoring/unignoring resource causes agent sync to purge all resources

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Inventory
Sub Component:
Version:	JON 3.1.1
Hardware:	All
OS:	All
Priority:	urgent
Severity:	high
Target Milestone:	ER01
Target Release:	JON 3.2.0
Assignee:	Larry O'Leary
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:	891951
Blocks:	892780
TreeView+	depends on / blocked

Reported:	2012-12-14 23:53 UTC by Larry O'Leary
Modified:	2018-11-30 20:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Clones:	891951 892780 (view as bug list)
Environment:
Last Closed:	2014-01-02 20:34:40 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Excerpt from agent log showing freshly imported resource with bad config (143.97 KB, application/x-gzip) 2012-12-15 00:04 UTC, Larry O'Leary	no flags	Details
proposed fix for the problem (3.40 KB, patch) 2012-12-18 22:36 UTC, John Mazzitelli	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	884593	0	medium	CLOSED	Alert definitions are missing after JBoss is imported in JON	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	280793	0	None	None	None	Never

Internal Links: 884593

Description Larry O'Leary 2012-12-14 23:53:57 UTC

Description of problem:
If a new resource is imported and contains an invalid configuration that prevents the resource from being started, all resources are marked as obsolete resulting in a complete inventory purge and re-sync. Not only is this time consuming and CPU intensive on the agent, it continues to do this for each resource that fails to start due to an invalid configuration.

Version-Release number of selected component (if applicable):
4.4.0.JON311GA

How reproducible:
Always

Steps to Reproduce:
1.  Disable admin account for EAP so it can not be discovered
2.  Start EAP server
3.  Start JBoss ON system (server and agent)
4.  Import RHQ Agent, RHQ Server, and Platform from discovery queue
5.  Wait for imported resources to become available in ON UI
6.  Import EAP server from discovery queue
  
Actual results:
Agent temporarily drops all inventory and the following messages appear in the agent log:

    2012-12-14 16:57:58,809 WARN  [WorkerThread#0[127.0.0.1:48685]] (rhq.core.pc.inventory.InventoryManager)- Cannot start component for Resource[id=10003, uuid=6b3182f2-2160-458f-ae69-36a1e1241efb, type={JBossAS5}JBossAS Server, key=/home/loleary/workspace/Cases/00742702/test-env/jboss-eap-5.1.2/jboss-as/server/all, name=EAP loleary:1099 all, parent=localhost.localdomain, version=EAP 5.1.2] from synchronized merge due to invalid plugin config: Failed to start component for resource Resource[id=10003, uuid=6b3182f2-2160-458f-ae69-36a1e1241efb, type={JBossAS5}JBossAS Server, key=/home/loleary/workspace/Cases/00742702/test-env/jboss-eap-5.1.2/jboss-as/server/all, name=EAP loleary:1099 all, parent=localhost.localdomain, version=EAP 5.1.2].
    2012-12-14 16:58:03,519 INFO  [WorkerThread#0[127.0.0.1:48685]] (rhq.core.pc.inventory.InventoryManager)- Detected new Platform [Resource[id=0, uuid=734bc6ba-c3b7-4af8-9e87-97acd5097e4c, type={Platforms}Linux, key=localhost.localdomain, name=localhost.localdomain, parent=<null>, version=Linux 2.6.35.14-106.fc14.x86_64]] - adding to local inventory...
    2012-12-14 16:58:03,519 INFO  [WorkerThread#0[127.0.0.1:48685]] (rhq.core.pc.inventory.InventoryManager)- Deleted resource #[10001] - this will trigger a server scan now
    2012-12-14 16:58:05,517 INFO  [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Got unknown resource: 10001



Expected results:
The newly imported resource should be synced in a STOPPED state and all other resources should not be impacted. Most specifically, the platform resource 10001 should not get deleted.

Additional info:
I am not sure what is causing this but it appears to have something to do with InventoryManager.purgeObsoleteResources. The reason for "Cannot start component for Resource[id=10003..." is because of the following exception being thrown when attempting to connect to profile service:

    2012-12-14 16:57:58,784 DEBUG [WorkerThread#0[127.0.0.1:48685]] (rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler)- Call to [org.rhq.plugins.jbossas5.ApplicationServerComponent.start()] with args [[org.rhq.core.pluginapi.inventory.ResourceContext@26c472b2]] failed.
    java.util.concurrent.ExecutionException: org.rhq.core.pluginapi.inventory.InvalidPluginConfigurationException: Values of 'principal' and/or 'credentials' connection properties are invalid.
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
        at java.util.concurrent.FutureTask.get(FutureTask.java:119)
        at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:554)
        at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invoke(ResourceContainer.java:542)
        at $Proxy42.start(Unknown Source)
        at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1733)
        at org.rhq.core.pc.inventory.InventoryManager.refreshResourceComponentState(InventoryManager.java:3012)
        at org.rhq.core.pc.inventory.InventoryManager.mergeResource(InventoryManager.java:2912)
        at org.rhq.core.pc.inventory.InventoryManager.mergeModifiedResources(InventoryManager.java:2779)
        at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1115)
        at org.rhq.core.pc.inventory.InventoryManager.synchronizeInventory(InventoryManager.java:2164)
        ...

Comment 1 Larry O'Leary 2012-12-15 00:04:25 UTC

Created attachment 663870 [details]
Excerpt from agent log showing freshly imported resource with bad config

Comment 2 Charles Crouch 2012-12-17 16:36:44 UTC

Has this behaviour changed in any recent releases? Targeting at 312 for triage

Comment 3 John Mazzitelli 2012-12-17 20:59:09 UTC

i see this on JON 3.1.2 ER5. Note that I'm not sure how bad this is - it seems to happen only if the plugin config is bad on a newly imported server. So, yes, a full sync is requested, but I assume once the resource is in inventory, it won't keep happening. I'll double check that. However, even if it DOES keep happening, why keep a resource in inventory with bad plugin config anyway? You'll want to fix that - and once you get a good connection and the resource can be managed successfully, things should be fine.

Comment 4 John Mazzitelli 2012-12-17 21:01:24 UTC

Note, I restarted the agent, put in breakpoints at appropriate locations and looked at the logs - I do not see this happening again. So it looks like this might only happen when you first import the resource. Even if I keep the plugin config invalid (the resource still shows red), I don't see this resync happen on restart.

Comment 5 Larry O'Leary 2012-12-17 21:13:33 UTC

The problem with this is that resources are imported into inventory on a continuous basis. Each time, resulting in a complete re-sync of agent inventory. During the re-sync, other operations will fail. Additionally, the re-sync causes templates to be lost as indicated in Bug 884593. 

Don't think of this as the happy path of importing a single resource. Think of this in the sense of a production environment where things are automated and handled by remote API calls on a batch of resources or agents. Configuration gets applies to newly imported resources after they have been imported. Additionally, one can not control the EAP instance being up or down when the resource is picked up from a discovery queue via such automation.

Comment 6 John Mazzitelli 2012-12-18 16:45:29 UTC

This looks like its bad behavior and has been in the code for a long time. And it is not related to the resource not being able to be started - I see this happen if you just import the platform (to get the initial inventory) and then import any single server resource (I tried with the RHQ Agent resource itself and see it happen). The entire agent side inventory is cleared out no matter what is committed.

Starting from the UI code (ResourceGWTServiceImpl.importResources), we can trace the call chain pretty easily down into the remote agent call into InventoryManager.synchronizeInventory and finally into InventoryManager.purgeObsoleteResources

DiscoveryBossBean.importResources
DBB.checkStatus
DBB.updateInventoryStatus
DBB.scheduleAgentInventoryOperationJob
...quartz job triggers...
DBB.updateAgentInventoryStatus(String,String)
DBB.updateAgentInventoryStatus(List,List)
...remote call into agent...
InventoryManager.synchronizeInventory
...
InventoryManager.purgeObsoleteResources

You will notice from the very top of that call chain (that is, from the UI on down), only a single resource is passed around (the resource being committed). But when you get to that last method listed above (IM.purgeObsoleteResources), that method apparently assumes its argument "Set<String> allUuids" contains all uuids from a full sync report (that is, all the uuids that the server actually has in inventory). But for the case when you manually commit a single resource, that's not the case - allUuids contains all the UUIDs from the sync report alright, but that report only has a single resource in it! So in the end, this means that purge method removes all resources from inventory (because it ends up removing the platform resource, too, since it isn't in allUuids). After this, it corrects itself when the agent re-syncs with the server.

Comment 7 John Mazzitelli 2012-12-18 22:30:07 UTC

i think I have a patch for this. i'll check in after some more brief testing, but initial test shows it working correctly.

Comment 8 John Mazzitelli 2012-12-18 22:36:15 UTC

Created attachment 665806 [details]
proposed fix for the problem

attaching proposed patch to fix the issue

Comment 9 John Mazzitelli 2012-12-19 17:40:13 UTC

git commit to master: d5564f3562ee960115cc533f029521000c870f45

Comment 10 John Mazzitelli 2012-12-19 17:43:28 UTC

note that the bug would have also occurred whenever you ignore or unignore a resource from the discovery queue (in addition to committing a resource). I adjusted the title of this bugzilla issue to reflect that.

Comment 12 John Mazzitelli 2013-01-04 15:35:45 UTC

this is only in master, not in any other branch. not sure what status this issue should be in. but the issue is thought to be fixed with the commit to master and should be QA'ed.

Comment 13 Larry O'Leary 2013-01-07 20:15:33 UTC

This missed the JBoss ON 3.1.2 code-freeze/cut-off so is being moved to 3.2.

Comment 14 Larry O'Leary 2013-01-28 19:52:24 UTC

Committed to master: http://git.fedorahosted.org/cgit/rhq/rhq.git/diff/?id=d5564f3562ee960115cc533f029521000c870f45


commit d5564f3562ee960115cc533f029521000c870f45
Author: John Mazzitelli <mazz>
Date:   Wed Dec 19 12:39:24 2012 -0500

    [BZ 887411] don't uninventory everything just because we commited some top level server

Comment 15 Larry O'Leary 2013-09-06 14:31:19 UTC

As this is MODIFIED or ON_QA, setting milestone to ER1.

Comment 16 Filip Brychta 2013-10-30 15:48:38 UTC

Verified on
Version: 3.2.0.ER4
Build Number: e413566:057b211

Note You need to log in before you can comment on or make changes to this bug.