Bug 1190029 - [GSS] (6.4.z) Multiple Cluster Singletons after a merge
Summary: [GSS] (6.4.z) Multiple Cluster Singletons after a merge
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Clustering
Version: 6.3.2
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: CR2
: EAP 6.4.1
Assignee: Tomas Hofman
QA Contact: Jitka Kozana
URL:
Whiteboard:
Depends On:
Blocks: eap641-payload 1213425
TreeView+ depends on / blocked
 
Reported: 2015-02-06 06:14 UTC by dereed
Modified: 2019-07-11 08:37 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1213425 (view as bug list)
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1223665 0 unspecified CLOSED Modify SingletonTunnelTestCase to allow GossipRouter interface to be configurable 2021-02-22 00:41:40 UTC
Red Hat Issue Tracker WFLY-4724 0 Major Resolved Port SingletonTunnelTestCase from EAP6 to WildFly 2017-10-23 10:12:14 UTC

Internal Links: 1223665

Description dereed 2015-02-06 06:14:28 UTC
Cluster Singleton services can run on multiple instances after a cluster merge.

===================================================================

The service->node mappings used to determine which nodes are eligible to run a singleton are kept in Infinispan.
During a cluster split/merge this data will get corrupted (any changes during the split in all but one partition will be lost due to Infinispan limitations, and can be different per key)

The mapping is rebuilt after a merge by asking new members for their data, and fixing the list stored in Infinispan.

But the new data is looked up in the same Infinspan store that has been corrupted!

Comment 1 dereed 2015-02-06 06:15:04 UTC
This was previously fixed upstream in 678cf687823ec189ad199b78645c99dc71e00272.
The fix is to instead look it up in a local store on each node (which already exists).

org.jboss.as.clustering.service.ServiceProviderRegistryService#getServices needs to be replaced with the fixed version from
org.wildfly.clustering.server.service.ServiceProviderRegistryFactoryService#getServices from that commit.

Comment 2 dereed 2015-02-06 06:24:54 UTC
(not just a copy/paste fix since there are some minor differences in the method API between the versions, but still looks like a pretty basic change).

Comment 3 Tomas Hofman 2015-03-19 15:17:11 UTC
Couldn't reproduce this on current EAP 6.4.x nor 6.3.x branches.

Following method was used:

1) started two instances of EAP using standalone-ha.xml,
2) deployed cluster-ha-singleton from jboss-as-quickstarts, which contain a singleton timer bean, on both instances,
3) timer was running or first instance only,
4) freezing first instance,
5) after split, singleton bean was instantiated on second instance,
6) waking first instance again,
7) after merging singleton bean on second instance was stopped.

Comment 7 dereed 2015-03-27 21:41:27 UTC
I suspect there's another fix needed also (that's not upstream yet).

membershipChangedDuringMerge only loops through new members since that data has been removed from the cache.  It uses the existing Infinispan data for the current partition.

But after the merge the cache will contain data from one of the sub-partitions (I believe it can be different per cache entry?), but not necessarily from the current partition.

I think membershipChangedDuringMerge will need to loop through *all* members to fix the data instead of just the new members.

Comment 20 Tomas Hofman 2015-04-08 12:25:33 UTC
PR: https://github.com/jbossas/jboss-eap/pull/2380

@dereed, carlo: Could you please review?

Comment 21 dereed 2015-04-08 17:26:37 UTC
I think the "if (newMembers.isEmpty())" block also needs to be removed, as we may still need to rebuild the cache in an asymetrical cluster split.
(When {A,B} and {B} merge, the cache might contain only B's data after the merge).

And since we're completely rebuilding the cache, it might be best to clear it out first just to make sure no stale data is left over.

Comment 23 Tomas Hofman 2015-04-13 13:55:59 UTC
Here is a test case for this issue: https://github.com/TomasHofman/jboss-eap/commit/b637847585cac5af122d4cc6a6139cd6ea9549e1

It is being discussed whether to make it part of official testsuite.

Comment 28 Tomas Hofman 2015-04-20 14:23:34 UTC
Created new PR against 6.4.x: https://github.com/jbossas/jboss-eap/pull/2389

Old PR was closed.

Comment 29 Tomas Hofman 2015-04-23 08:51:50 UTC
Note that the test case was added to the PR to the extendedTests profile of EAP test suite.

Comment 31 Rostislav Svoboda 2015-04-27 08:50:35 UTC
qa_acking to avoid regressions, we have received one-off BZ 1213425 in week beginning April 27th.

Comment 33 Ladislav Thon 2015-05-15 08:20:44 UTC
Verified with EAP 6.4.1.CP.CR2.

Comment 34 JBoss JIRA Server 2015-06-05 09:08:54 UTC
Tomas Hofman <thofman> updated the status of jira WFLY-4724 to Coding In Progress

Comment 35 Petr Penicka 2017-01-17 09:59:58 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 36 Petr Penicka 2017-01-17 10:00:39 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.


Note You need to log in before you can comment on or make changes to this bug.