Description of problem:
Although it is not clear how this state can be created, we have seen an issue in where a plug-in's resource type is left in the `DELETED` state and not purged due to still being in use by another resource.
In the identified case, we can see the following resource type in a `DELETED` state as identified from the `RHQ_RESOURCE_TYPE` while its plug-in is still installed and very well active:
10052,Tomcat,4.4.0,86fe2d1c8147150c1efafa3d56cbdaac,1,INSTALLED,16-Oct-2012 11:47:47,10580,Tomcat Server,1
In this case, RESOURCE_TYPE_DELETED is set to 1 indicating the type has been deleted. This is also the ONLY resource type in the `RHQ_RESOURCE_TYPE` table for this `Tomcat` plug-in. Which tells us that all the other resource types from the `Tomcat` plug-in had been properly removed. The reason for this one remaining is due to a resource which is still using it as seen in the `RHQ_RESOURCE` table:
14733,Tomcat (8080),/test/prop/tomcat/apache-tomcat-5.5.26,COMMITTED,10001,my.test.server.com,10580,Tomcat Server,1
In this case, the resource is still in the COMMITTED inventory state and does not appear to have been marked for purge.
I see multiple issues/questions from this one issue:
1) How did this state occur in the first place? Speculation is that this happened due to a plug-in being deleted and re-added before the deletion finished. Although the UI warns against this, it would seem that we should either prevent it or queue the install request until the purge is complete.
2) The fact that the plug-in is still installed and had not been deleted/purged or expected to be deleted/purged indicates that perhaps a plug-in upgrade or re-install went wrong. In either case, you would think that when the plug-in was loaded on subsequent plug-in update checks that we would re-install plug-in types if they were missing. This has already been reported in a similar case identified in Bug 857189. The end result is that all agents are stuck in a reboot loop due to the difference in resource types between an agent and the server.
3) When the agent is in the reboot loop, there is no information pointing the user to understand why. Instead, we simply log the following very unhelpful error message:
ERROR [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Failed to merge inventory report with server. The report contains one or more resource types that have been marked for deletion. Notifying the plugin container that a reboot is needed to purge stale types.
After which point the agent reboots, spits out the same ERROR message, and reboots again, and again...
At minimum, you would expect to see what these so called types were or perhaps what resource caused the issue. This appears to have been captured in upstream Bug 869015.
Version-Release number of selected component (if applicable):
After some investigation and insights from Larry, I managed to come up with steps to reproduce this issue.
1) Install RHQ server along with plugins.
2) Start the agent (can be running on the same machine). Just be sure to have postgres running on the same machine as the agent.
3) Wait for agent to send initial inventory report and then import resources into inventory.
4) Log in and go to the agent plugins page in the admin section.
5) Select and delete the PostgreSQL plugin.
4) Purge the PostgreSQL plugin. Note that you first need to click the 'Show Deleted' button in order to enable the purge button.
5) Upload the PostgreSQL plugin.
6) Click 'Scan For Updates' to re-install the plugin.
7) Go to the agent prompt and run the 'update plugins' command.
Because this is partly a timing issue you need to execute steps 4 - 6 in quick succession. After step 7 the agent will reboot the plugin container and subsequently run a full discovery scan. When the inventory report is sent to the server, you should see a message like the following in your server log,
2012-11-08 16:28:16,485 INFO [org.rhq.enterprise.server.discovery.DiscoveryBossBean] The inventory report from Agent[id=0,name=192.168.1.12,address=null,port=0,remote-endpoint=null,last-availability-ping=null,last-availability-report=null] contains these deleted resource types [ResourceType[id=0, name=Postgres Server, plugin=Postgres, category=Server]]
Then in the agent log shortly thereafter you should see,
2012-11-08 16:28:40,262 INFO [Plugin Container Reboot Thread] (org.rhq.core.pc.PluginContainer)- Initializing Plugin Container v4.6.0-SNAPSHOT...
Now let me explain what is going on. When you delete a plugin, that initiates a work flow of cascading deletes using a mark and sweep strategy. First, all resources for all types provided by the plugin are marked for deletion. Then all resource types declared by the plugin are marked for deletion. Next the plugin itself is marked as deleted.
There is a quartz job that runs periodically and will delete from the database all resources that are marked for deletion. There is another similar job that does the same thing for resource types. The resource type purge job though will only purge a resource type when it no longer has any resources in the database.
The key to reproducing this issue is to purge and re-install the postgres plugin before those quartz jobs run. This should be fairly easy to accomplish as I believe they only run every 5 minutes.
When you purge and re-install the plugin (assuming those quartz jobs have not run since executed those steps), the meta data for the plugin, specifically the rows in the resource type table still exist. Re-installing the plugin results in a call to PluginManagerBean. PluginManagerBean calls ResourceMetadataManagerBean.updateTypes. Because the resource types still exist in the database, albeit with a status of deleted, ResourceMetadataManagerBean.mergeExistingType is called. This method updates all of the meta data like operation definitions, metrics, resource configuration, etc. It also updates properties of the ResourceType itself. It does not however, update the deleted property.
Remember at this point the agent has downloaded the postgres plugin again but all of the resource types declared by the plugin are still marked for deletion. When those purge jobs run, the resource types will get removed from the database and the next time the agent sends an inventory report to the server, we get the server log message shown above. When DiscoveryBossBean processes an inventory report, it looks up in the database each resource type specified in the report. Since the type (or types) do not exist, the server assumes that the agent has a plugin that is not installed on the server and those resources in the report are not merged into the server inventory. We wind up with the plugin installed on the server but with none of its resource types. And the agent ends up with resources in its local inventory that will not get merged into the server's inventory.
As for the referenced support case, when the customer installed a newer version of the tomcat plugin, it may have introduced a new version of the Tomcat Server type such that that resource type did not have its deleted flag set, but all of the other types in the tomcat plugin were still deleted and eventually purged.
Fortunately the solution is much less complex than the problem. When a plugin is deleted and purged, we cannot allow for that plugin (or a newer version of it) to be re-installed until all of the data associated with that plugin has been purged from the database.
After further investigation, it turns out that my lengthy explanation in comment 2 is not entirely correct. Here is the root cause. When a plugin is deleted, the plugin, all of its resource types, and all resources for those types are marked for deletion. There are separate quartz jobs that run periodically to purge resources, resource types, and plugins. It was possible for a plugin to get purged before all of its resource types.
As soon as the plugin is purged (i.e., deleted from the rhq_plugin table) a user can re-install the plugin. If the plugin is re-installed before its resource types are purged, some or all of those resource types that get re-installed/updated will still wind up getting purged. This is where things get problematic because agents will have those types; however, any resources of those types will never get merged into the server's inventory because the types do not exist on the server.
I have committed a fix that makes sure a plugin is not purged until all of its resource types are purged.
master commit hash: 9e0684aff
Changes have been pushed to the release/jon3.1.x branch
release/jon3.1.x commit hash: a3e8b5d80f0fb
Testing this will be a little difficult due to the timing issues involved. I have tested with the postgres plugin since it involves a lot of resources in inventory. The AS 7 plugin would be another good candidate. You will want to have debug logging turned on for the server. You want to watch for debug messages from PurgeResourceTypesJob and from PurgePluginsJob. PurgePluginsJob will log a message that looks like,
Preparing to purge plugin [Postgres]
After you see that, make sure that all resource types have been deleted from the database. Use this query,
select id, name, plugin, deleted
where deleted = true and plugin = 'Postgres'
That query should return an empty result set. Once the plugin is purged re-install it. Run the plugins update command on the agent and verify that the postgres resources get merged back into inventory.
*** Bug 848868 has been marked as a duplicate of this bug. ***
Moving to ON_QA as available for test with build : https://brewweb.devel.redhat.com//buildinfo?buildID=244662.
Verified on 3.1.2.ER2 build using JBossAS7 plugin. Deleted, purged the plugin, re-installed it and verified that the resources get merged back into inventory after plugins update command on the agent.