There are certain agent->server API calls that must not be switched over to another server. Mainly streaming API calls. Examples include: 1) when an agent pulls down plugins from the server, the agent is streaming file contents from the server. If the agent attempts to call the remote stream and the server goes down, having the agent switch over to another server is futile since the stream doesn't exist on any other server. In this case, the request should immediately fail without attempting to failover. 2) when streaming content to/from the server, the same thing applies. This is a difficult problem to solve - and in fact may not be 100% solveable. The agent's comm layer will just have to bubble up exceptions and fail the message sending. We may want to create a new comm-api annotation - @NotFailoverable. This would correlate to a new command configuration property being attached to the command (e.g. "rhq.not-failoverable" = "true"). Our failover code could look at each command and if it has this property set, it should abort the failover attempt and throw an exception up. If at all possible, the command should never even be attempted to failover - the original cannot connect exception should be considered a "not-failover-able exception".
another thing to possibly help survive something like this. create a core-comm-api non-runtime exception like "AbortedException" and any method that is annotated with @NotFailoverable should consider having a "throws AbortedException". All code that calls these methods would then be forced to handle it. And to handle it, all that you might have to do is call the API again (because hopefully by that time, the agent has switched over to another server).
Mazz, I'm not sure I fully understand the "having the agent switch over to another server is futile" line. If a server goes down then failover isn't futile, right? Just processing that command is futile. Wouldn't we still want the agent to failover in general. @NoFailover (a better name, I think :) could throw the NotProcessedException, which we already have and is indicative of the problem, I think. It's a RuntimeException but could that be sufficient? This could then trigger failover logic at the agent level as opposed to the command processing layer. I think this is basically what you already propose.
I would like to try to get something into 1.2 to address this problem. At the very least we should make downloading the plugins more tolerant of this error... perhaps just perform a retry before saying the download failed?
the download-plugin problem has been addressed by making failures a bit more fault tolerant (we now retry if a plugin download fails). i am putting this down from critical to minor but leaving this open because I suspect we may have to come revisit this - this same kind of problem is going to happen when we need to stream package bits to and from the server so we still may need a solution.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-974 This bug relates to RHQ-1069
Mazz, what are your current thoughts on this in re: bundles/streaming stuff?
we still need to address this - but its not high priority.
Still feel like we need to do this?
(In reply to Jay Shaughnessy from comment #8) > Still feel like we need to do this? This is still a problem for things like streaming as the description says. However, even if we don't switch over, what's the point? The server we are talking to is dead, so unless its an async message with guaranteed delivery, it will still fail. Not switching over just means we'll try to keep talking to the downed server, and thus a failure will results. I think we can close this - in either situation a failure will occur and our exception handling will just take care of dealing with the error condition.