Red Hat Bugzilla – Bug 534255
be more fault tolerant when failing to download plugins
Last modified: 2013-04-30 19:32:03 EDT
When the agent starts up, it attempts to download its plugins. If it fails to download one or more plugins, it continues on. This probably shouldn't be the case. If, during startup, the agent fails to pull down a plugin, the agent should try to get it later.
The case where this happens that I have seen is when the agent comes up but some servers in the cloud are down or worse go down during the download. The agent will attempt to switch over to another server but when that happens, the remote stream becomes invalid (the remote stream is only valid for the server where the stream originated from). As soon as the switchover happens, the agent will get a remote stream error and the plugin will fail to download. In this case, perhaps the agent should retry to pull down that plugin again - be fault tolerant of the case where the agent switched over to another server under the covers.
If we don't fix this, an agent could have an incomplete set of plugins and may fail to start properly (if the plugin it failed to get was the platform plugin, the agent will certainly be dead in the water).
I would like to get this fixed in 1.2.
the agent will attempt several times to download a plugin (sleeping for a bit between retries). only if it fails multiple times will the agent give up.
might be tough to test, due to the timings but, here's what to try to test that this works:
1) have 1 or 2 servers in the cloud
2) start the agent
3) while the agent is downloading plugins, kill 1 or both servers
4) after a minute, restart the server(s)
5) in the agent log, you should see the agent successfully download all plugins after some warnings about needing to retry
Its tough because you have to kill the servers in step 2 at the exact same time the agents are going to download the plugins. Perhaps you could deploy a really fat plugin (create a temporary plugin with a very minimal, but valid, rhq-plugin.xml but put very large files inside the plugin .jar so it takes a long time to download - make the plugin jar file 100MB large or more - this way, it'll take a while for the agent to download it - enough time to give the tester a chance to see the download start and to kill the server in the middle of the download).
I prepared one 120MB plugin to deploy to the agent. I stopped the JON server 3 times during the deployment and the agent recovered and downloaded all the plugins.
RHEL5.3, x86_64, PostgreSQL8.2.4, JON RHQ SVN rev# 2894
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1069
This bug is related to RHQ-974
This bug relates to RHQ-1090