Bug 534255 (RHQ-1069) - be more fault tolerant when failing to download plugins
Summary: be more fault tolerant when failing to download plugins
Keywords:
Status: CLOSED NEXTRELEASE
Alias: RHQ-1069
Product: RHQ Project
Classification: Other
Component: Agent
Version: unspecified
Hardware: All
OS: All
high
medium
Target Milestone: ---
: ---
Assignee: John Mazzitelli
QA Contact: Pavel Kralik
URL: http://jira.rhq-project.org/browse/RH...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-11-06 13:54 UTC by John Mazzitelli
Modified: 2013-04-30 23:32 UTC (History)
1 user (show)

Fixed In Version: 1.2
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description John Mazzitelli 2008-11-06 13:54:00 UTC
When the agent starts up, it attempts to download its plugins.  If it fails to download one or more plugins, it continues on.  This probably shouldn't be the case.  If, during startup, the agent fails to pull down a plugin, the agent should try to get it later.

The case where this happens that I have seen is when the agent comes up but some servers in the cloud are down or worse go down during the download.  The agent will attempt to switch over to another server but when that happens, the remote stream becomes invalid (the remote stream is only valid for the server where the stream originated from).  As soon as the switchover happens, the agent will get a remote stream error and the plugin will fail to download. In this case, perhaps the agent should retry to pull down that plugin again - be fault tolerant of the case where the agent switched over to another server under the covers.

If we don't fix this, an agent could have an incomplete set of plugins and may fail to start properly (if the plugin it failed to get was the platform plugin, the agent will certainly be dead in the water).

Comment 1 John Mazzitelli 2008-11-11 17:44:58 UTC
I would like to get this fixed in 1.2.

Comment 2 John Mazzitelli 2008-11-15 17:11:35 UTC
the agent will attempt several times to download a plugin (sleeping for a bit between retries). only if it fails multiple times will the agent give up.

Comment 3 John Mazzitelli 2008-11-15 17:16:44 UTC
might be tough to test, due to the timings but, here's what to try to test that this works:

1) have 1 or 2 servers in the cloud
2) start the agent
3) while the agent is downloading plugins, kill 1 or both servers
4) after a minute, restart the server(s)
5) in the agent log, you should see the agent successfully download all plugins after some warnings about needing to retry

Its tough because you have to kill the servers in step 2 at the exact same time the agents are going to download the plugins. Perhaps you could deploy a really fat plugin (create a temporary plugin with a very minimal, but valid, rhq-plugin.xml but put very large files inside the plugin .jar so it takes a long time to download - make the plugin jar file 100MB large or more - this way, it'll take a while for the agent to download it - enough time to give the tester a chance to see the download start and to kill the server in the middle of the download).

Comment 4 Pavel Kralik 2009-02-04 18:12:57 UTC
I prepared one 120MB plugin to deploy to the agent. I stopped the JON server 3 times during the deployment and the agent recovered and downloaded all the plugins.

RHEL5.3, x86_64, PostgreSQL8.2.4, JON RHQ SVN rev# 2894 

Comment 5 Red Hat Bugzilla 2009-11-10 20:23:13 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-1069
This bug is related to RHQ-974
This bug relates to RHQ-1090



Note You need to log in before you can comment on or make changes to this bug.