Bug 839256

Summary:	FailoverListManagerBeanTest fail on OpenJDK 1.7
Product:	[Other] RHQ Project	Reporter:	Heiko W. Rupp <hrupp>
Component:	High Availability	Assignee:	Jay Shaughnessy <jshaughn>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Mike Foley <mfoley>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.4	CC:	hrupp, jshaughn
Target Milestone:	---
Target Release:	RHQ 4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-09-01 10:18:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	682878

Description Heiko W. Rupp 2012-07-11 11:47:12 UTC

Failing tests in server/jar:
Failed tests:   testGetForAllAgents20_1000(org.rhq.enterprise.server.cloud.FailoverListManagerBeanTest)
 testGetForAllAgents5_10(org.rhq.enterprise.server.cloud.FailoverListManagerBeanTest)
 testGetForAllAgents5_25(org.rhq.enterprise.server.cloud.FailoverListManagerBeanTest)

Those are failing because the distributions of agents on servers is not even (on the 5/10 test, on server has 3 agents
and another one has 1 instead of the 2/2 one).

Comment 1 Jay Shaughnessy 2012-07-12 15:37:28 UTC

Reproduced using Oracle JDK7.  Investigating...

Comment 2 Jay Shaughnessy 2012-07-16 20:29:48 UTC

The "algorithm" is sort of fragile, my guess is just that some Java7 impl change tweaked some sort of non-guaranteed ordering that we were unknowingly relying on.

I spent over a day trying to come up with a better algorithm but ran out of talent. I still feel there is probably some sort of elegant approach to this problem but it escapes me. It's a tricky problem, balancing load while respecting affinity, trying to retain existing primary servers (to reduce churn when reassigning the agent population), and further trying to distribute load on failures. (also, although we don't use it, the "algorithm" currently handles varying server compute power but today we treat them all as equals).

In the end I tweaked the existing "algorithm" and I think it's improved, it reduces the chances for duplicated fail-over lists, therefore doing a better job at distributing load after failures.

Still, this change did not provide a clean test run but did reduce it to a single failure.  I think in the end the test code was a little strict in its expectations for balance.  So, I've relaxed the test verification such that balance does not need to be perfect after the tertiary level of fail-over.

It should be noted that I don't think there was a major problem with the existing "algorithm" wrt Java7.  Fairly decent balance was still being maintained, but the test code verification was strict.

---------------------

master commit b98e5f305e20dfc04baa38036c6b4e1e377052f8

    Tweak algorithm for better distribution and also relax test verification to
    allow for minor imbalance at deeper levels of failover.


This is not really testable in any easy way other than running through any failover test scenarios that may exist.

Comment 3 Heiko W. Rupp 2013-09-01 10:18:47 UTC

Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.