839256 – FailoverListManagerBeanTest fail on OpenJDK 1.7

Bug 839256 - FailoverListManagerBeanTest fail on OpenJDK 1.7

Summary: FailoverListManagerBeanTest fail on OpenJDK 1.7

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	High Availability
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	RHQ 4.5.0
Assignee:	Jay Shaughnessy
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	682878
TreeView+	depends on / blocked

Reported:	2012-07-11 11:47 UTC by Heiko W. Rupp
Modified:	2013-09-01 10:18 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-09-01 10:18:47 UTC
Embargoed:

Attachments	(Terms of Use)

Description Heiko W. Rupp 2012-07-11 11:47:12 UTC

Failing tests in server/jar:
Failed tests:   testGetForAllAgents20_1000(org.rhq.enterprise.server.cloud.FailoverListManagerBeanTest)
 testGetForAllAgents5_10(org.rhq.enterprise.server.cloud.FailoverListManagerBeanTest)
 testGetForAllAgents5_25(org.rhq.enterprise.server.cloud.FailoverListManagerBeanTest)

Those are failing because the distributions of agents on servers is not even (on the 5/10 test, on server has 3 agents
and another one has 1 instead of the 2/2 one).

Comment 1 Jay Shaughnessy 2012-07-12 15:37:28 UTC

Reproduced using Oracle JDK7.  Investigating...

Comment 2 Jay Shaughnessy 2012-07-16 20:29:48 UTC

The "algorithm" is sort of fragile, my guess is just that some Java7 impl change tweaked some sort of non-guaranteed ordering that we were unknowingly relying on.

I spent over a day trying to come up with a better algorithm but ran out of talent. I still feel there is probably some sort of elegant approach to this problem but it escapes me. It's a tricky problem, balancing load while respecting affinity, trying to retain existing primary servers (to reduce churn when reassigning the agent population), and further trying to distribute load on failures. (also, although we don't use it, the "algorithm" currently handles varying server compute power but today we treat them all as equals).

In the end I tweaked the existing "algorithm" and I think it's improved, it reduces the chances for duplicated fail-over lists, therefore doing a better job at distributing load after failures.

Still, this change did not provide a clean test run but did reduce it to a single failure.  I think in the end the test code was a little strict in its expectations for balance.  So, I've relaxed the test verification such that balance does not need to be perfect after the tertiary level of fail-over.

It should be noted that I don't think there was a major problem with the existing "algorithm" wrt Java7.  Fairly decent balance was still being maintained, but the test code verification was strict.

---------------------

master commit b98e5f305e20dfc04baa38036c6b4e1e377052f8

    Tweak algorithm for better distribution and also relax test verification to
    allow for minor imbalance at deeper levels of failover.


This is not really testable in any easy way other than running through any failover test scenarios that may exist.

Comment 3 Heiko W. Rupp 2013-09-01 10:18:47 UTC

Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.

Note You need to log in before you can comment on or make changes to this bug.