| Summary: | Agent failover includes hosts that are unreachable or not in affinity group | ||
|---|---|---|---|
| Product: | [Other] RHQ Project | Reporter: | Elias Ross <genman> |
| Component: | Communications Subsystem | Assignee: | Nobody <nobody> |
| Status: | NEW --- | QA Contact: | |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.9 | CC: | genman, hrupp, jshaughn |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Elias Ross
2013-11-22 22:38:31 UTC
This is by design. Affinity is strong and you should see those nodes at the top of the failover list. But if all affinity nodes are down we then progress to non-affinity nodes in a "best-effort" to serve the agent. If the non-affinity servers are unreachable the agent will spin past them and keep trying the others. If you can present a strong reason why we should make a change please let us know, otherwise I think we should set this to "works as expected". Asking for feedback... If we want to have bad servers in the list, then I think the problem is the 'spin past them' step may in fact take quite a while. I can look for some logs, but it seems like it may take around 4-5 minutes to go from a working server to a non-working and back again.
For example, one server is moving from st11p01ad to mr11p01ad, which it won't actually work at all.
2014-01-08 16:26:16,993 INFO [RHQ Server Polling Thread] (JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [serv
let://st11p01ad-rhq005:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]] to [InvokerLocator [servlet://mr11p01ad-rhq001:7080/jboss-remoting-servlet-invoker/ServerInvok
erServlet]]
...
2014-01-08 16:30:28,997 INFO [RHQ Agent Ping Thread-1] (AgentMain)- {AgentMain.ping-executor.start-polling-after-exception}Starting polling to determine sender status (server ping failed)
2014-01-08 16:30:28,996 ERROR [ClientCommandSenderTask Timer Thread #4129] (JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It w
ill be tried again. Cause: Initialize callback lock could not be acquired
...
2014-01-08 16:30:29,002 INFO [ClientCommandSenderTask Timer Thread #4120] (JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://mr11p01ad-rhq001:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]] to [InvokerLocator [servlet://st11p01ad-rhq003:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
I think this probably merits an RFE. Not everyone may see non-affinity servers as "bad" servers but rather "less desirable". But the use-case you describe is totally reasonable; non-affinity servers could be unreachable by design and trying to connect to them is a waste of time. What would you think about it being a global option like affinity-only-failover-lists? If true then agents associated with affinity groups would see only affinity servers. Other ideas? An RFE is a good idea. But on second thought, the biggest problem was failover was too slow, due to the fact that the server failure detection was slow (or not working) for unreachable servers. Maybe the detection code works when the server is down, resulting in an immediate 'ConnectionRefused', but if the host is behind a firewall, may take too long to detect failure. If detection is working properly, then from my point of view there is no need a change. I haven't done enough analysis to understand all aspects of the detection problem. Any work here should also consider Bug 535776 |