Bug 567742

Summary:	JON231 HA, force repartition does not redistribute agents
Product:	[Other] RHQ Project	Reporter:	wes hayutin <whayutin>
Component:	High Availability	Assignee:	Heiko W. Rupp <hrupp>
Status:	CLOSED NOTABUG	QA Contact:	wes hayutin <whayutin>
Severity:	low	Docs Contact:
Priority:	low
Version:	3.0.0	CC:	cwelton, jshaughn, mazz
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
URL:	http://10.16.120.55:7080/rhq/ha/listServers.xhtml
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-07-12 18:38:57 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	565628, 577041

Description wes hayutin 2010-02-23 20:02:26 UTC

Description of problem:
JON231 HA, force repartition does not distribute agents.


Test Case 
https://tcms.engineering.redhat.com/case/38182/?from_plan=1922
    * No affinity groups were used here.
    *  Assure that your 4 agents are in a 'healthy' state, attached to one server in the cloud
    * Go to HA admin console
    * click the "re-partition" button
    * check the output of the repartition event

Agent Name
	
Server Name
Total: 4 	
	
philcollins	core-02.usersys.redhat.com
core-02.usersys.redhat.com	core-02.usersys.redhat.com
10.16.120.55	10-16-120-55.guest.rhq.lab.eng.bos.redhat.com
lanse.usersys.redhat.com	10-16-120-55.guest.rhq.lab.eng.bos.redhat.com

    * Run the "failover --list" command in agent. 

Notice that all four agents are *still* on the same server.  That is a bug.
The bug also *may* that affinity groups are required for this to work, however we have not found any documentation to support that

Comment 1 Charles Crouch 2010-05-18 03:26:35 UTC

Heiko are you seeing this in the perf environment?

Comment 2 Heiko W. Rupp 2010-05-25 09:31:50 UTC

I see the same - after migrating all agents to one server (via the other having in maintenance mode for enough time). Then I put both servers on normal mode and  click force repartition.
All agents stay connected to that one server. 

Reading http://rhq-project.org/display/RHQ/Design-High+Availability+-+Agent+Failover#Design-HighAvailability-AgentFailover-CloudRepartition

Cloud Repartition
[...]
A repartition does not push new server lists to connected agents. This prevents large scale fail-over in large environments, potentially spiking a server with connection processing. Instead, agents will intermittently check for updated server lists, and reconnect to new primary assignments, if necessary. This disperses the connection load.
---

It looks like this "not forcing the repartition on connected agents" is a feature.

But then when I put the server that stayed up in "normal" mode above into maintenance for a few seconds, some agents migrate over and the partition events list quickly fills with agent connect events.

Comment 3 Charles Crouch 2010-05-25 13:30:28 UTC

So this is behaving as expected? Please double check with jshaughn.

Comment 4 Jay Shaughnessy 2010-05-25 14:40:53 UTC

The "Force Repartition" option is actually very rarely, if ever, needed. It was put in place really, I think, just as a fail-safe.

A full repartitioning happens automatically if a server goes offline or a server comes online. So, in Heiko's test, moving the server to and from maintenance mode would have forced a repartition anyway. In the partition event list you should see a full partitioning take place for either of these events.

This means that the failover lists are regenerated for all agents. It does not mean that the agents will immediately switch to these new lists, nor does it mean that the new lists will be different from the old lists.

When a server goes down agents will no longer be able to connect and will start searching for a new server on their current failover list in short order (although it's still not immediate, we do wait to see if the connection loss is short). When a server comes online we don't want agents rushing over there in flood fashion. Instead, over the course of, I think an hour, they should redistribute somewhat evenly, now using the new server.

Force repartition will most likely not generate any differences to the failover lists that already exist if no agents have been added since the last full partitioning.

In short, if your agents, an hour after a full partitioning event, are fairly evenly balanced amongst available servers, things are probably working as expected.

So, in the original problem report, it's important to know when the 4 servers were brought online with respect to the reported distribution and how long a period was given to reach a steady state.

Also, the gui, or I think the agent's 'failover --check" command will show the current, or update to the current failover list, respectively. 'failover --list' will not, I believe, update the agent's failover list, just show the local.

Comment 5 John Mazzitelli 2010-07-12 18:38:57 UTC

I agree with jay's comments. Also "failover --list" only lists the failover list as it is known to the agent at that time - i.e. it is the local failover list (failover-list.dat). It doesn't check the server nor does it ask the server for a new list and return that new list. That's what --check is for.

I'm closing this as NOTABUG because I think its working as expected.