Bug 1190029

Summary: [GSS] (6.4.z) Multiple Cluster Singletons after a merge
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: dereed
Component: ClusteringAssignee: Tomas Hofman <thofman>
Status: CLOSED CURRENTRELEASE QA Contact: Jitka Kozana <jkudrnac>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.3.2CC: bbaranow, bmaxwell, cdewolf, jawilson, jforte, lthon, smatasar, thofman
Target Milestone: CR2   
Target Release: EAP 6.4.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1213425 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1207953, 1213425    

Description dereed 2015-02-06 06:14:28 UTC
Cluster Singleton services can run on multiple instances after a cluster merge.

===================================================================

The service->node mappings used to determine which nodes are eligible to run a singleton are kept in Infinispan.
During a cluster split/merge this data will get corrupted (any changes during the split in all but one partition will be lost due to Infinispan limitations, and can be different per key)

The mapping is rebuilt after a merge by asking new members for their data, and fixing the list stored in Infinispan.

But the new data is looked up in the same Infinspan store that has been corrupted!

Comment 1 dereed 2015-02-06 06:15:04 UTC
This was previously fixed upstream in 678cf687823ec189ad199b78645c99dc71e00272.
The fix is to instead look it up in a local store on each node (which already exists).

org.jboss.as.clustering.service.ServiceProviderRegistryService#getServices needs to be replaced with the fixed version from
org.wildfly.clustering.server.service.ServiceProviderRegistryFactoryService#getServices from that commit.

Comment 2 dereed 2015-02-06 06:24:54 UTC
(not just a copy/paste fix since there are some minor differences in the method API between the versions, but still looks like a pretty basic change).

Comment 3 Tomas Hofman 2015-03-19 15:17:11 UTC
Couldn't reproduce this on current EAP 6.4.x nor 6.3.x branches.

Following method was used:

1) started two instances of EAP using standalone-ha.xml,
2) deployed cluster-ha-singleton from jboss-as-quickstarts, which contain a singleton timer bean, on both instances,
3) timer was running or first instance only,
4) freezing first instance,
5) after split, singleton bean was instantiated on second instance,
6) waking first instance again,
7) after merging singleton bean on second instance was stopped.

Comment 7 dereed 2015-03-27 21:41:27 UTC
I suspect there's another fix needed also (that's not upstream yet).

membershipChangedDuringMerge only loops through new members since that data has been removed from the cache.  It uses the existing Infinispan data for the current partition.

But after the merge the cache will contain data from one of the sub-partitions (I believe it can be different per cache entry?), but not necessarily from the current partition.

I think membershipChangedDuringMerge will need to loop through *all* members to fix the data instead of just the new members.

Comment 20 Tomas Hofman 2015-04-08 12:25:33 UTC
PR: https://github.com/jbossas/jboss-eap/pull/2380

@dereed, carlo: Could you please review?

Comment 21 dereed 2015-04-08 17:26:37 UTC
I think the "if (newMembers.isEmpty())" block also needs to be removed, as we may still need to rebuild the cache in an asymetrical cluster split.
(When {A,B} and {B} merge, the cache might contain only B's data after the merge).

And since we're completely rebuilding the cache, it might be best to clear it out first just to make sure no stale data is left over.

Comment 23 Tomas Hofman 2015-04-13 13:55:59 UTC
Here is a test case for this issue: https://github.com/TomasHofman/jboss-eap/commit/b637847585cac5af122d4cc6a6139cd6ea9549e1

It is being discussed whether to make it part of official testsuite.

Comment 28 Tomas Hofman 2015-04-20 14:23:34 UTC
Created new PR against 6.4.x: https://github.com/jbossas/jboss-eap/pull/2389

Old PR was closed.

Comment 29 Tomas Hofman 2015-04-23 08:51:50 UTC
Note that the test case was added to the PR to the extendedTests profile of EAP test suite.

Comment 31 Rostislav Svoboda 2015-04-27 08:50:35 UTC
qa_acking to avoid regressions, we have received one-off BZ 1213425 in week beginning April 27th.

Comment 33 Ladislav Thon 2015-05-15 08:20:44 UTC
Verified with EAP 6.4.1.CP.CR2.

Comment 34 JBoss JIRA Server 2015-06-05 09:08:54 UTC
Tomas Hofman <thofman> updated the status of jira WFLY-4724 to Coding In Progress

Comment 35 Petr Penicka 2017-01-17 09:59:58 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 36 Petr Penicka 2017-01-17 10:00:39 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.