Bug 1190029

Summary:	[GSS] (6.4.z) Multiple Cluster Singletons after a merge
Product:	[JBoss] JBoss Enterprise Application Platform 6	Reporter:	dereed
Component:	Clustering	Assignee:	Tomas Hofman <thofman>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Jitka Kozana <jkudrnac>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	6.3.2	CC:	bbaranow, bmaxwell, cdewolf, jawilson, jforte, lthon, smatasar, thofman
Target Milestone:	CR2
Target Release:	EAP 6.4.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1213425 (view as bug list)		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1207953, 1213425

Description dereed 2015-02-06 06:14:28 UTC

Cluster Singleton services can run on multiple instances after a cluster merge.

===================================================================

The service->node mappings used to determine which nodes are eligible to run a singleton are kept in Infinispan.
During a cluster split/merge this data will get corrupted (any changes during the split in all but one partition will be lost due to Infinispan limitations, and can be different per key)

The mapping is rebuilt after a merge by asking new members for their data, and fixing the list stored in Infinispan.

But the new data is looked up in the same Infinspan store that has been corrupted!

Comment 1 dereed 2015-02-06 06:15:04 UTC

This was previously fixed upstream in 678cf687823ec189ad199b78645c99dc71e00272.
The fix is to instead look it up in a local store on each node (which already exists).

org.jboss.as.clustering.service.ServiceProviderRegistryService#getServices needs to be replaced with the fixed version from
org.wildfly.clustering.server.service.ServiceProviderRegistryFactoryService#getServices from that commit.

Comment 2 dereed 2015-02-06 06:24:54 UTC

(not just a copy/paste fix since there are some minor differences in the method API between the versions, but still looks like a pretty basic change).

Comment 3 Tomas Hofman 2015-03-19 15:17:11 UTC

Couldn't reproduce this on current EAP 6.4.x nor 6.3.x branches.

Following method was used:

1) started two instances of EAP using standalone-ha.xml,
2) deployed cluster-ha-singleton from jboss-as-quickstarts, which contain a singleton timer bean, on both instances,
3) timer was running or first instance only,
4) freezing first instance,
5) after split, singleton bean was instantiated on second instance,
6) waking first instance again,
7) after merging singleton bean on second instance was stopped.

Comment 7 dereed 2015-03-27 21:41:27 UTC

I suspect there's another fix needed also (that's not upstream yet).

membershipChangedDuringMerge only loops through new members since that data has been removed from the cache.  It uses the existing Infinispan data for the current partition.

But after the merge the cache will contain data from one of the sub-partitions (I believe it can be different per cache entry?), but not necessarily from the current partition.

I think membershipChangedDuringMerge will need to loop through *all* members to fix the data instead of just the new members.

Comment 20 Tomas Hofman 2015-04-08 12:25:33 UTC

PR: https://github.com/jbossas/jboss-eap/pull/2380

@dereed, carlo: Could you please review?

Comment 21 dereed 2015-04-08 17:26:37 UTC

I think the "if (newMembers.isEmpty())" block also needs to be removed, as we may still need to rebuild the cache in an asymetrical cluster split.
(When {A,B} and {B} merge, the cache might contain only B's data after the merge).

And since we're completely rebuilding the cache, it might be best to clear it out first just to make sure no stale data is left over.

Comment 23 Tomas Hofman 2015-04-13 13:55:59 UTC

Here is a test case for this issue: https://github.com/TomasHofman/jboss-eap/commit/b637847585cac5af122d4cc6a6139cd6ea9549e1

It is being discussed whether to make it part of official testsuite.

Comment 28 Tomas Hofman 2015-04-20 14:23:34 UTC

Created new PR against 6.4.x: https://github.com/jbossas/jboss-eap/pull/2389

Old PR was closed.

Comment 29 Tomas Hofman 2015-04-23 08:51:50 UTC

Note that the test case was added to the PR to the extendedTests profile of EAP test suite.

Comment 31 Rostislav Svoboda 2015-04-27 08:50:35 UTC

qa_acking to avoid regressions, we have received one-off BZ 1213425 in week beginning April 27th.

Comment 33 Ladislav Thon 2015-05-15 08:20:44 UTC

Verified with EAP 6.4.1.CP.CR2.

Comment 34 JBoss JIRA Server 2015-06-05 09:08:54 UTC

Tomas Hofman <thofman> updated the status of jira WFLY-4724 to Coding In Progress

Comment 35 Petr Penicka 2017-01-17 09:59:58 UTC

Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 36 Petr Penicka 2017-01-17 10:00:39 UTC

Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.