Cluster Singleton services can run on multiple instances after a cluster merge. =================================================================== The service->node mappings used to determine which nodes are eligible to run a singleton are kept in Infinispan. During a cluster split/merge this data will get corrupted (any changes during the split in all but one partition will be lost due to Infinispan limitations, and can be different per key) The mapping is rebuilt after a merge by asking new members for their data, and fixing the list stored in Infinispan. But the new data is looked up in the same Infinspan store that has been corrupted!
This was previously fixed upstream in 678cf687823ec189ad199b78645c99dc71e00272. The fix is to instead look it up in a local store on each node (which already exists). org.jboss.as.clustering.service.ServiceProviderRegistryService#getServices needs to be replaced with the fixed version from org.wildfly.clustering.server.service.ServiceProviderRegistryFactoryService#getServices from that commit.
(not just a copy/paste fix since there are some minor differences in the method API between the versions, but still looks like a pretty basic change).
Couldn't reproduce this on current EAP 6.4.x nor 6.3.x branches. Following method was used: 1) started two instances of EAP using standalone-ha.xml, 2) deployed cluster-ha-singleton from jboss-as-quickstarts, which contain a singleton timer bean, on both instances, 3) timer was running or first instance only, 4) freezing first instance, 5) after split, singleton bean was instantiated on second instance, 6) waking first instance again, 7) after merging singleton bean on second instance was stopped.
I suspect there's another fix needed also (that's not upstream yet). membershipChangedDuringMerge only loops through new members since that data has been removed from the cache. It uses the existing Infinispan data for the current partition. But after the merge the cache will contain data from one of the sub-partitions (I believe it can be different per cache entry?), but not necessarily from the current partition. I think membershipChangedDuringMerge will need to loop through *all* members to fix the data instead of just the new members.
PR: https://github.com/jbossas/jboss-eap/pull/2380 @dereed, carlo: Could you please review?
I think the "if (newMembers.isEmpty())" block also needs to be removed, as we may still need to rebuild the cache in an asymetrical cluster split. (When {A,B} and {B} merge, the cache might contain only B's data after the merge). And since we're completely rebuilding the cache, it might be best to clear it out first just to make sure no stale data is left over.
Here is a test case for this issue: https://github.com/TomasHofman/jboss-eap/commit/b637847585cac5af122d4cc6a6139cd6ea9549e1 It is being discussed whether to make it part of official testsuite.
Created new PR against 6.4.x: https://github.com/jbossas/jboss-eap/pull/2389 Old PR was closed.
Note that the test case was added to the PR to the extendedTests profile of EAP test suite.
qa_acking to avoid regressions, we have received one-off BZ 1213425 in week beginning April 27th.
Verified with EAP 6.4.1.CP.CR2.
Tomas Hofman <thofman> updated the status of jira WFLY-4724 to Coding In Progress
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.