Bug 1033790 - Agent failover includes hosts that are unreachable or not in affinity group
Summary: Agent failover includes hosts that are unreachable or not in affinity group
Keywords:
Status: NEW
Alias: None
Product: RHQ Project
Classification: Other
Component: Communications Subsystem
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-22 22:38 UTC by Elias Ross
Modified: 2022-03-31 04:28 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Elias Ross 2013-11-22 22:38:31 UTC
Description of problem:

The agent failover list looks like this, in this order:

st11p01ad-rhq004:7080/7443
st11p01ad-rhq003:7080/7443
mr11p01ad-rhq001:7080/7443
st11p01ad-rhq006:7080/7443
st11p01ad-rhq005:7080/7443
mr11p01ad-rhq002:7080/7443

The hosts mr11 are in different datacenters and, not part of the agent affinity group.

Hosts st11p01ad-rhq005 and rhq006 were created and temporary and put into maintenance node and also removed from the affinity list. However, they still appear in this list.

My expectation is that agents would not attempt to fail over to many of these hosts, especially those not part of the affinity group. But I have seen this happen.


Version-Release number of selected component (if applicable): 4.9


How reproducible: Always


Steps to Reproduce:
1. Create a number of servers (6) and split into two affinity groups
2. Assign an agent to one affinity group

Actual results:

See that the failover list contains agents not applicable.

Expected results:

The list only contains hosts in the affinity list.

Comment 1 Jay Shaughnessy 2014-01-09 17:09:16 UTC
This is by design.  Affinity is strong and you should see those nodes at the top of the failover list.  But if all affinity nodes are down we then progress to non-affinity nodes in a "best-effort" to serve the agent.

If the non-affinity servers are unreachable the agent will spin past them and keep trying the others.

If you can present a strong reason why we should make a change please let us know, otherwise I think we should set this to "works as expected".  Asking for feedback...

Comment 2 Elias Ross 2014-01-09 20:35:45 UTC
If we want to have bad servers in the list, then I think the problem is the 'spin past them' step may in fact take quite a while. I can look for some logs, but it seems like it may take around 4-5 minutes to go from a working server to a non-working and back again.

For example, one server is moving from st11p01ad to mr11p01ad, which it won't actually work at all.

2014-01-08 16:26:16,993 INFO  [RHQ Server Polling Thread] (JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [serv
let://st11p01ad-rhq005:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]] to [InvokerLocator [servlet://mr11p01ad-rhq001:7080/jboss-remoting-servlet-invoker/ServerInvok
erServlet]]
...
2014-01-08 16:30:28,997 INFO  [RHQ Agent Ping Thread-1] (AgentMain)- {AgentMain.ping-executor.start-polling-after-exception}Starting polling to determine sender status (server ping failed)
2014-01-08 16:30:28,996 ERROR [ClientCommandSenderTask Timer Thread #4129] (JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It w
ill be tried again. Cause: Initialize callback lock could not be acquired
...
2014-01-08 16:30:29,002 INFO  [ClientCommandSenderTask Timer Thread #4120] (JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://mr11p01ad-rhq001:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]] to [InvokerLocator [servlet://st11p01ad-rhq003:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]

Comment 3 Jay Shaughnessy 2014-02-04 22:10:33 UTC
I think this probably merits an RFE.  Not everyone may see non-affinity servers as "bad" servers but rather "less desirable". But the use-case you describe is totally reasonable; non-affinity servers could be unreachable by design and trying to connect to them is a waste of time.

What would you think about it being a global option like affinity-only-failover-lists?  If true then agents associated with affinity groups would see only affinity servers. Other ideas?

Comment 4 Elias Ross 2014-02-05 00:58:28 UTC
An RFE is a good idea.

But on second thought, the biggest problem was failover was too slow, due to the fact that the server failure detection was slow (or not working) for unreachable servers. Maybe the detection code works when the server is down, resulting in an immediate 'ConnectionRefused', but if the host is behind a firewall, may take too long to detect failure.

If detection is working properly, then from my point of view there is no need a change.

I haven't done enough analysis to understand all aspects of the detection problem.

Comment 5 Jay Shaughnessy 2014-05-09 19:44:59 UTC
Any work here should also consider Bug 535776


Note You need to log in before you can comment on or make changes to this bug.