Bug 536645 (RHQ-974)

Summary: do not attempt to failover to another server for calls that must talk to the same server
Product: [Other] RHQ Project Reporter: John Mazzitelli <mazz>
Component: Communications SubsystemAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED WONTFIX QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 1.1CC: cwelton, jshaughn, mazz
Target Milestone: ---Keywords: Improvement
Target Release: ---   
Hardware: All   
OS: All   
URL: http://jira.rhq-project.org/browse/RHQ-974
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-16 15:25:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Mazzitelli 2008-10-10 05:17:00 UTC
There are certain agent->server API calls that must not be switched over to another server. Mainly streaming API calls. Examples include:

1) when an agent pulls down plugins from the server, the agent is streaming file contents from the server.  If the agent attempts to call the remote stream and the server goes down, having the agent switch over to another server is futile since the stream doesn't exist on any other server.  In this case, the request should immediately fail without attempting to failover.

2) when streaming content to/from the server, the same thing applies.

This is a difficult problem to solve - and in fact may not be 100% solveable.  The agent's comm layer will just have to bubble up exceptions and fail the message sending.

We may want to create a new comm-api annotation - @NotFailoverable.  This would correlate to a new command configuration property being attached to the command (e.g. "rhq.not-failoverable" = "true").  Our failover code could look at each command and if it has this property set, it should abort the failover attempt and throw an exception up.  If at all possible, the command should never even be attempted to failover - the original cannot connect exception should be considered a "not-failover-able exception".

Comment 1 John Mazzitelli 2008-10-10 05:40:58 UTC
another thing to possibly help survive something like this.

create a core-comm-api non-runtime exception like "AbortedException" and any method that is annotated with @NotFailoverable should consider having a "throws AbortedException".  All code that calls these methods would then be forced to handle it.  And to handle it, all that you might have to do is call the API again (because hopefully by that time, the agent has switched over to another server).

Comment 2 Jay Shaughnessy 2008-10-10 14:59:15 UTC
Mazz, I'm not sure I fully understand the "having the agent switch over to another server is futile" line.  If a server goes down then failover isn't futile, right? Just processing that command is futile. Wouldn't we still want the agent to failover in general.

@NoFailover (a better name, I think :) could throw the NotProcessedException, which we already have and is indicative of the problem, I think. It's a RuntimeException but could that be sufficient?  This could then trigger failover logic at the agent level as opposed to the command processing layer.

I think this is basically what you already propose.

Comment 3 John Mazzitelli 2008-11-11 17:43:30 UTC
I would like to try to get something into 1.2 to address this problem.

At the very least we should make downloading the plugins more tolerant of this error... perhaps just perform a retry before saying the download failed?

Comment 4 John Mazzitelli 2009-01-10 05:29:40 UTC
the download-plugin problem has been addressed by making failures a bit more fault tolerant (we now retry if a plugin download fails).

i am putting this down from critical to minor but leaving this open because I suspect we may have to come revisit this - this same kind of problem is going to happen when we need to stream package bits to and from the server so we still may need a solution.

Comment 5 Red Hat Bugzilla 2009-11-10 21:20:47 UTC
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-974
This bug relates to RHQ-1069


Comment 6 Corey Welton 2010-08-12 20:00:38 UTC
Mazz, what are your current thoughts on this in re: bundles/streaming stuff?

Comment 7 John Mazzitelli 2010-08-24 17:35:42 UTC
we still need to address this - but its not high priority.

Comment 8 Jay Shaughnessy 2014-05-15 21:49:34 UTC
Still feel like we need to do this?

Comment 9 John Mazzitelli 2014-05-16 15:25:21 UTC
(In reply to Jay Shaughnessy from comment #8)
> Still feel like we need to do this?

This is still a problem for things like streaming as the description says. However, even if we don't switch over, what's the point? The server we are talking to is dead, so unless its an async message with guaranteed delivery, it will still fail. Not switching over just means we'll try to keep talking to the downed server, and thus a failure will results.

I think we can close this - in either situation a failure will occur and our exception handling will just take care of dealing with the error condition.