Description of problem: If a new resource is imported and contains an invalid configuration that prevents the resource from being started, all resources are marked as obsolete resulting in a complete inventory purge and re-sync. Not only is this time consuming and CPU intensive on the agent, it continues to do this for each resource that fails to start due to an invalid configuration. Version-Release number of selected component (if applicable): 4.4.0.JON311GA How reproducible: Always Steps to Reproduce: 1. Disable admin account for EAP so it can not be discovered 2. Start EAP server 3. Start JBoss ON system (server and agent) 4. Import RHQ Agent, RHQ Server, and Platform from discovery queue 5. Wait for imported resources to become available in ON UI 6. Import EAP server from discovery queue Actual results: Agent temporarily drops all inventory and the following messages appear in the agent log: 2012-12-14 16:57:58,809 WARN [WorkerThread#0[127.0.0.1:48685]] (rhq.core.pc.inventory.InventoryManager)- Cannot start component for Resource[id=10003, uuid=6b3182f2-2160-458f-ae69-36a1e1241efb, type={JBossAS5}JBossAS Server, key=/home/loleary/workspace/Cases/00742702/test-env/jboss-eap-5.1.2/jboss-as/server/all, name=EAP loleary:1099 all, parent=localhost.localdomain, version=EAP 5.1.2] from synchronized merge due to invalid plugin config: Failed to start component for resource Resource[id=10003, uuid=6b3182f2-2160-458f-ae69-36a1e1241efb, type={JBossAS5}JBossAS Server, key=/home/loleary/workspace/Cases/00742702/test-env/jboss-eap-5.1.2/jboss-as/server/all, name=EAP loleary:1099 all, parent=localhost.localdomain, version=EAP 5.1.2]. 2012-12-14 16:58:03,519 INFO [WorkerThread#0[127.0.0.1:48685]] (rhq.core.pc.inventory.InventoryManager)- Detected new Platform [Resource[id=0, uuid=734bc6ba-c3b7-4af8-9e87-97acd5097e4c, type={Platforms}Linux, key=localhost.localdomain, name=localhost.localdomain, parent=<null>, version=Linux 2.6.35.14-106.fc14.x86_64]] - adding to local inventory... 2012-12-14 16:58:03,519 INFO [WorkerThread#0[127.0.0.1:48685]] (rhq.core.pc.inventory.InventoryManager)- Deleted resource #[10001] - this will trigger a server scan now 2012-12-14 16:58:05,517 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Got unknown resource: 10001 Expected results: The newly imported resource should be synced in a STOPPED state and all other resources should not be impacted. Most specifically, the platform resource 10001 should not get deleted. Additional info: I am not sure what is causing this but it appears to have something to do with InventoryManager.purgeObsoleteResources. The reason for "Cannot start component for Resource[id=10003..." is because of the following exception being thrown when attempting to connect to profile service: 2012-12-14 16:57:58,784 DEBUG [WorkerThread#0[127.0.0.1:48685]] (rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler)- Call to [org.rhq.plugins.jbossas5.ApplicationServerComponent.start()] with args [[org.rhq.core.pluginapi.inventory.ResourceContext@26c472b2]] failed. java.util.concurrent.ExecutionException: org.rhq.core.pluginapi.inventory.InvalidPluginConfigurationException: Values of 'principal' and/or 'credentials' connection properties are invalid. at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262) at java.util.concurrent.FutureTask.get(FutureTask.java:119) at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:554) at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invoke(ResourceContainer.java:542) at $Proxy42.start(Unknown Source) at org.rhq.core.pc.inventory.InventoryManager.activateResource(InventoryManager.java:1733) at org.rhq.core.pc.inventory.InventoryManager.refreshResourceComponentState(InventoryManager.java:3012) at org.rhq.core.pc.inventory.InventoryManager.mergeResource(InventoryManager.java:2912) at org.rhq.core.pc.inventory.InventoryManager.mergeModifiedResources(InventoryManager.java:2779) at org.rhq.core.pc.inventory.InventoryManager.synchInventory(InventoryManager.java:1115) at org.rhq.core.pc.inventory.InventoryManager.synchronizeInventory(InventoryManager.java:2164) ...
Created attachment 663870 [details] Excerpt from agent log showing freshly imported resource with bad config
Has this behaviour changed in any recent releases? Targeting at 312 for triage
i see this on JON 3.1.2 ER5. Note that I'm not sure how bad this is - it seems to happen only if the plugin config is bad on a newly imported server. So, yes, a full sync is requested, but I assume once the resource is in inventory, it won't keep happening. I'll double check that. However, even if it DOES keep happening, why keep a resource in inventory with bad plugin config anyway? You'll want to fix that - and once you get a good connection and the resource can be managed successfully, things should be fine.
Note, I restarted the agent, put in breakpoints at appropriate locations and looked at the logs - I do not see this happening again. So it looks like this might only happen when you first import the resource. Even if I keep the plugin config invalid (the resource still shows red), I don't see this resync happen on restart.
The problem with this is that resources are imported into inventory on a continuous basis. Each time, resulting in a complete re-sync of agent inventory. During the re-sync, other operations will fail. Additionally, the re-sync causes templates to be lost as indicated in Bug 884593. Don't think of this as the happy path of importing a single resource. Think of this in the sense of a production environment where things are automated and handled by remote API calls on a batch of resources or agents. Configuration gets applies to newly imported resources after they have been imported. Additionally, one can not control the EAP instance being up or down when the resource is picked up from a discovery queue via such automation.
This looks like its bad behavior and has been in the code for a long time. And it is not related to the resource not being able to be started - I see this happen if you just import the platform (to get the initial inventory) and then import any single server resource (I tried with the RHQ Agent resource itself and see it happen). The entire agent side inventory is cleared out no matter what is committed. Starting from the UI code (ResourceGWTServiceImpl.importResources), we can trace the call chain pretty easily down into the remote agent call into InventoryManager.synchronizeInventory and finally into InventoryManager.purgeObsoleteResources DiscoveryBossBean.importResources DBB.checkStatus DBB.updateInventoryStatus DBB.scheduleAgentInventoryOperationJob ...quartz job triggers... DBB.updateAgentInventoryStatus(String,String) DBB.updateAgentInventoryStatus(List,List) ...remote call into agent... InventoryManager.synchronizeInventory ... InventoryManager.purgeObsoleteResources You will notice from the very top of that call chain (that is, from the UI on down), only a single resource is passed around (the resource being committed). But when you get to that last method listed above (IM.purgeObsoleteResources), that method apparently assumes its argument "Set<String> allUuids" contains all uuids from a full sync report (that is, all the uuids that the server actually has in inventory). But for the case when you manually commit a single resource, that's not the case - allUuids contains all the UUIDs from the sync report alright, but that report only has a single resource in it! So in the end, this means that purge method removes all resources from inventory (because it ends up removing the platform resource, too, since it isn't in allUuids). After this, it corrects itself when the agent re-syncs with the server.
i think I have a patch for this. i'll check in after some more brief testing, but initial test shows it working correctly.
Created attachment 665806 [details] proposed fix for the problem attaching proposed patch to fix the issue
git commit to master: d5564f3562ee960115cc533f029521000c870f45
note that the bug would have also occurred whenever you ignore or unignore a resource from the discovery queue (in addition to committing a resource). I adjusted the title of this bugzilla issue to reflect that.
this is only in master, not in any other branch. not sure what status this issue should be in. but the issue is thought to be fixed with the commit to master and should be QA'ed.
This missed the JBoss ON 3.1.2 code-freeze/cut-off so is being moved to 3.2.
Committed to master: http://git.fedorahosted.org/cgit/rhq/rhq.git/diff/?id=d5564f3562ee960115cc533f029521000c870f45 commit d5564f3562ee960115cc533f029521000c870f45 Author: John Mazzitelli <mazz> Date: Wed Dec 19 12:39:24 2012 -0500 [BZ 887411] don't uninventory everything just because we commited some top level server
As this is MODIFIED or ON_QA, setting milestone to ER1.
Verified on Version: 3.2.0.ER4 Build Number: e413566:057b211