567742 – JON231 HA, force repartition does not redistribute agents

Bug 567742 - JON231 HA, force repartition does not redistribute agents

Summary: JON231 HA, force repartition does not redistribute agents

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	High Availability
Sub Component:
Version:	3.0.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Heiko W. Rupp
QA Contact:	wes hayutin
Docs Contact:
URL:	http://10.16.120.55:7080/rhq/ha/listS...
Whiteboard:
Depends On:
Blocks:	rhq_triage jon24-perf
TreeView+	depends on / blocked

Reported:	2010-02-23 20:02 UTC by wes hayutin
Modified:	2010-07-12 18:38 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-07-12 18:38:57 UTC
Embargoed:

Attachments	(Terms of Use)

Description wes hayutin 2010-02-23 20:02:26 UTC

Description of problem:
JON231 HA, force repartition does not distribute agents.


Test Case 
https://tcms.engineering.redhat.com/case/38182/?from_plan=1922
    * No affinity groups were used here.
    *  Assure that your 4 agents are in a 'healthy' state, attached to one server in the cloud
    * Go to HA admin console
    * click the "re-partition" button
    * check the output of the repartition event

Agent Name
	
Server Name
Total: 4 	
	
philcollins	core-02.usersys.redhat.com
core-02.usersys.redhat.com	core-02.usersys.redhat.com
10.16.120.55	10-16-120-55.guest.rhq.lab.eng.bos.redhat.com
lanse.usersys.redhat.com	10-16-120-55.guest.rhq.lab.eng.bos.redhat.com

    * Run the "failover --list" command in agent. 

Notice that all four agents are *still* on the same server.  That is a bug.
The bug also *may* that affinity groups are required for this to work, however we have not found any documentation to support that

Comment 1 Charles Crouch 2010-05-18 03:26:35 UTC

Heiko are you seeing this in the perf environment?

Comment 2 Heiko W. Rupp 2010-05-25 09:31:50 UTC

I see the same - after migrating all agents to one server (via the other having in maintenance mode for enough time). Then I put both servers on normal mode and  click force repartition.
All agents stay connected to that one server. 

Reading http://rhq-project.org/display/RHQ/Design-High+Availability+-+Agent+Failover#Design-HighAvailability-AgentFailover-CloudRepartition

Cloud Repartition
[...]
A repartition does not push new server lists to connected agents. This prevents large scale fail-over in large environments, potentially spiking a server with connection processing. Instead, agents will intermittently check for updated server lists, and reconnect to new primary assignments, if necessary. This disperses the connection load.
---

It looks like this "not forcing the repartition on connected agents" is a feature.

But then when I put the server that stayed up in "normal" mode above into maintenance for a few seconds, some agents migrate over and the partition events list quickly fills with agent connect events.

Comment 3 Charles Crouch 2010-05-25 13:30:28 UTC

So this is behaving as expected? Please double check with jshaughn.

Comment 4 Jay Shaughnessy 2010-05-25 14:40:53 UTC

The "Force Repartition" option is actually very rarely, if ever, needed. It was put in place really, I think, just as a fail-safe.

A full repartitioning happens automatically if a server goes offline or a server comes online. So, in Heiko's test, moving the server to and from maintenance mode would have forced a repartition anyway. In the partition event list you should see a full partitioning take place for either of these events.

This means that the failover lists are regenerated for all agents. It does not mean that the agents will immediately switch to these new lists, nor does it mean that the new lists will be different from the old lists.

When a server goes down agents will no longer be able to connect and will start searching for a new server on their current failover list in short order (although it's still not immediate, we do wait to see if the connection loss is short). When a server comes online we don't want agents rushing over there in flood fashion. Instead, over the course of, I think an hour, they should redistribute somewhat evenly, now using the new server.

Force repartition will most likely not generate any differences to the failover lists that already exist if no agents have been added since the last full partitioning.

In short, if your agents, an hour after a full partitioning event, are fairly evenly balanced amongst available servers, things are probably working as expected.

So, in the original problem report, it's important to know when the 4 servers were brought online with respect to the reported distribution and how long a period was given to reach a steady state.

Also, the gui, or I think the agent's 'failover --check" command will show the current, or update to the current failover list, respectively. 'failover --list' will not, I believe, update the agent's failover list, just show the local.

Comment 5 John Mazzitelli 2010-07-12 18:38:57 UTC

I agree with jay's comments. Also "failover --list" only lists the failover list as it is known to the agent at that time - i.e. it is the local failover list (failover-list.dat). It doesn't check the server nor does it ask the server for a new list and return that new list. That's what --check is for.

I'm closing this as NOTABUG because I think its working as expected.

Note You need to log in before you can comment on or make changes to this bug.