Red Hat Bugzilla – Bug 567742
JON231 HA, force repartition does not redistribute agents
Last modified: 2010-07-12 14:38:57 EDT
Description of problem:
JON231 HA, force repartition does not distribute agents.
* No affinity groups were used here.
* Assure that your 4 agents are in a 'healthy' state, attached to one server in the cloud
* Go to HA admin console
* click the "re-partition" button
* check the output of the repartition event
* Run the "failover --list" command in agent.
Notice that all four agents are *still* on the same server. That is a bug.
The bug also *may* that affinity groups are required for this to work, however we have not found any documentation to support that
Heiko are you seeing this in the perf environment?
I see the same - after migrating all agents to one server (via the other having in maintenance mode for enough time). Then I put both servers on normal mode and click force repartition.
All agents stay connected to that one server.
A repartition does not push new server lists to connected agents. This prevents large scale fail-over in large environments, potentially spiking a server with connection processing. Instead, agents will intermittently check for updated server lists, and reconnect to new primary assignments, if necessary. This disperses the connection load.
It looks like this "not forcing the repartition on connected agents" is a feature.
But then when I put the server that stayed up in "normal" mode above into maintenance for a few seconds, some agents migrate over and the partition events list quickly fills with agent connect events.
So this is behaving as expected? Please double check with jshaughn.
The "Force Repartition" option is actually very rarely, if ever, needed. It was put in place really, I think, just as a fail-safe.
A full repartitioning happens automatically if a server goes offline or a server comes online. So, in Heiko's test, moving the server to and from maintenance mode would have forced a repartition anyway. In the partition event list you should see a full partitioning take place for either of these events.
This means that the failover lists are regenerated for all agents. It does not mean that the agents will immediately switch to these new lists, nor does it mean that the new lists will be different from the old lists.
When a server goes down agents will no longer be able to connect and will start searching for a new server on their current failover list in short order (although it's still not immediate, we do wait to see if the connection loss is short). When a server comes online we don't want agents rushing over there in flood fashion. Instead, over the course of, I think an hour, they should redistribute somewhat evenly, now using the new server.
Force repartition will most likely not generate any differences to the failover lists that already exist if no agents have been added since the last full partitioning.
In short, if your agents, an hour after a full partitioning event, are fairly evenly balanced amongst available servers, things are probably working as expected.
So, in the original problem report, it's important to know when the 4 servers were brought online with respect to the reported distribution and how long a period was given to reach a steady state.
Also, the gui, or I think the agent's 'failover --check" command will show the current, or update to the current failover list, respectively. 'failover --list' will not, I believe, update the agent's failover list, just show the local.
I agree with jay's comments. Also "failover --list" only lists the failover list as it is known to the agent at that time - i.e. it is the local failover list (failover-list.dat). It doesn't check the server nor does it ask the server for a new list and return that new list. That's what --check is for.
I'm closing this as NOTABUG because I think its working as expected.