Bug 534255 (RHQ-1069)

Summary: be more fault tolerant when failing to download plugins
Product: [Other] RHQ Project Reporter: John Mazzitelli <mazz>
Component: AgentAssignee: John Mazzitelli <mazz>
Status: CLOSED NEXTRELEASE QA Contact: Pavel Kralik <pkralik>
Severity: medium Docs Contact:
Priority: high    
Version: unspecifiedCC: mvecera
Target Milestone: ---Keywords: Improvement
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-1069
Whiteboard:
Fixed In Version: 1.2 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Mazzitelli 2008-11-06 13:54:00 UTC
When the agent starts up, it attempts to download its plugins.  If it fails to download one or more plugins, it continues on.  This probably shouldn't be the case.  If, during startup, the agent fails to pull down a plugin, the agent should try to get it later.

The case where this happens that I have seen is when the agent comes up but some servers in the cloud are down or worse go down during the download.  The agent will attempt to switch over to another server but when that happens, the remote stream becomes invalid (the remote stream is only valid for the server where the stream originated from).  As soon as the switchover happens, the agent will get a remote stream error and the plugin will fail to download. In this case, perhaps the agent should retry to pull down that plugin again - be fault tolerant of the case where the agent switched over to another server under the covers.

If we don't fix this, an agent could have an incomplete set of plugins and may fail to start properly (if the plugin it failed to get was the platform plugin, the agent will certainly be dead in the water).

Comment 1 John Mazzitelli 2008-11-11 17:44:58 UTC
I would like to get this fixed in 1.2.

Comment 2 John Mazzitelli 2008-11-15 17:11:35 UTC
the agent will attempt several times to download a plugin (sleeping for a bit between retries). only if it fails multiple times will the agent give up.

Comment 3 John Mazzitelli 2008-11-15 17:16:44 UTC
might be tough to test, due to the timings but, here's what to try to test that this works:

1) have 1 or 2 servers in the cloud
2) start the agent
3) while the agent is downloading plugins, kill 1 or both servers
4) after a minute, restart the server(s)
5) in the agent log, you should see the agent successfully download all plugins after some warnings about needing to retry

Its tough because you have to kill the servers in step 2 at the exact same time the agents are going to download the plugins. Perhaps you could deploy a really fat plugin (create a temporary plugin with a very minimal, but valid, rhq-plugin.xml but put very large files inside the plugin .jar so it takes a long time to download - make the plugin jar file 100MB large or more - this way, it'll take a while for the agent to download it - enough time to give the tester a chance to see the download start and to kill the server in the middle of the download).

Comment 4 Pavel Kralik 2009-02-04 18:12:57 UTC
I prepared one 120MB plugin to deploy to the agent. I stopped the JON server 3 times during the deployment and the agent recovered and downloaded all the plugins.

RHEL5.3, x86_64, PostgreSQL8.2.4, JON RHQ SVN rev# 2894 

Comment 5 Red Hat Bugzilla 2009-11-10 20:23:13 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1069
This bug is related to RHQ-974
This bug relates to RHQ-1090