Bug 806855 - SuspectedException blocks cluster formation
Summary: SuspectedException blocks cluster formation
Keywords:
Status: VERIFIED
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.0.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 6.0.0
Assignee: Tristan Tarrant
QA Contact: Nobody
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-03-26 11:12 UTC by Michal Linhard
Modified: 2023-03-02 08:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When a new node joined an existing JBoss Data Grid cluster, in some instances the new node splits into a separate network partition before the state transfer concludes and logged a SuspectException. As a result, the existing nodes do not receive a new JGroups view for ten minutes, after which the state transfer fails. The cluster did not form with all three members as expected. </para><para> This problem occurs rarely, due to two problems. First, the SuspectException on the new node does not allow a new cluster with just the single new node to form. Secondly, the two existing nodes in the cluster did not install a new JGroups cluster view for ten minutes, during which time state transfer remains blocked.
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
view installation times (825 bytes, text/html)
2012-03-26 11:13 UTC, Michal Linhard
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-1944 0 Critical Resolved SuspectedException blocks cluster formation 2013-03-07 10:23:44 UTC

Description Michal Linhard 2012-03-26 11:12:54 UTC
http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-elasticity-repl-basic/13/

When 3rd node joins it receives SuspectedException. The startup of the node fails and this holds the view installation for 10 minutes until timeout.

Comment 1 Michal Linhard 2012-03-26 11:13:33 UTC
Created attachment 572740 [details]
view installation times

Comment 2 JBoss JIRA Server 2012-03-26 11:58:33 UTC
Tristan Tarrant <ttarrant> made a comment on jira ISPN-1944

This is probably one for Dan

Comment 3 JBoss JIRA Server 2012-03-27 10:04:53 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1944

With the single target optimization in CommandAwareRpcDispatcher and the changes in JGroups for JGRP-1428, there are some scenarios now where a SuspectException is thrown even though we're using the SYNCHRONOUS_IGNORE_LEAVERS ResponseMode.

Comment 4 JBoss JIRA Server 2012-03-27 15:34:18 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1944

After examining the logs closer, it seems the SuspectException is only part of the problem (and it's actually described in ISPN-1934).

At some point node3 suspects both node01 and node03 and excludes them from the view, but node01 neither removes node03 from its view nor tries to merge the partitions for 10 minutes (see the attached ispn-1944.txt log).

We need to find whether there is anything that we can change in the JGroups configuration to prevent this kind of situation. Bela suggested using MERGE3 (see ISPN-1951), but it's not available in JGroups 3.0.8.

Comment 5 Michal Linhard 2012-03-29 12:35:25 UTC
shouldn't version here stay ER4 and only target version be changed to ER7 ? cause I haven't found this in ER7 yet :-)

Comment 6 mark yarborough 2012-03-29 13:37:48 UTC
I think Michal is correct. We accidentally changed "version" instead of "target release". I will correct to:

version: ER6 or less
target release: ER7

Comment 7 mark yarborough 2012-04-04 13:16:58 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
CCFR - Dan

Comment 8 Dan Berindei 2012-04-05 02:17:58 UTC
https://issues.jboss.org/browse/ISPN-1934 is fixed, so I think we can close this bug as well.

We still have the problem with the cluster splitting after a join, but that only means that the state transfer takes a lot longer than it should. State transfer does eventually end (even though it doesn't necessarily keep the data consistent, see bu #808623), so we can't say that a SuspectException blocks the cluster formation any more.

Comment 9 Dan Berindei 2012-04-05 08:32:15 UTC
Oops, I thought ISPN-1934 was fixed, but I hadn't issued the pull request yet...

Comment 10 JBoss JIRA Server 2012-04-13 09:06:37 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-1944 to Resolved

Comment 11 JBoss JIRA Server 2012-04-13 09:06:37 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-1944

With ISPN-1934 fixed, I'm marking this as resolved as well.

Comment 12 Michal Linhard 2012-05-03 08:32:58 UTC
I haven't seen this in ER7 Testing

Comment 13 Misha H. Ali 2012-06-06 03:29:43 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-CCFR - Dan+<remark>CCFR - Dan</remark>

Comment 14 Dan Berindei 2012-06-11 20:36:34 UTC
Michal's log actually shows two problems:

1. When the coordinator commits a cache view, the view should be installed even if some of the members of the view had already left. This was not happening, but it was fixed in Infinispan 5.1.4.CR1.

2. The test was just starting a new node. However, the 3rd node somehow got into  a partition by itself and the other nodes didn't notice it even after 10 minutes. This could be a JGroups problem, but it's very hard to diagnose as this is the only time we've ever seen it.

Comment 15 Dan Berindei 2012-06-11 22:12:02 UTC
I have tried to replicate the 2nd issue with a JGroups-only test, but I was not able to.

FD_SOCK is enabled in the configuration, so when node C forms a new view by itself it closes the FD_SOCK connections to A and B, which immediately suspect it and form a new view by themselves. So I don't think it's possible to have the situation in thelogs unless there was a real network problem that prevented C from closing the FD_SOCK connection to A and B gracefully.

Comment 16 Dan Berindei 2012-06-12 06:30:16 UTC
It looks like I was wrong about FD_SOCK, it actually doesn't detect immediately when a node has left the cluster (and the channel is still up, but seeing only a part of the cluster). 

However, when C leaves the cluster, FD_ALL on C notices that it is the only member and it stops broadcasting heartbeat messages. After the FD_ALL timeout expires, A and B will suspect C and eliminate it from their view. So it still shouldn't be possible for A and B to see view A2 [A, B, C] for 10 minutes after C left...

Comment 17 Dan Berindei 2012-06-12 14:48:03 UTC
Technical notes:

This bug appeared only once, with version 6.0.0.ER4. A new node joined an existing cluster, but before state transfer ended it split into a separate network partition and logged a SuspectException. The existing nodes did not receive a new JGroups view for 10 minutes, after which state transfer failed. The cluster never formed with all 3 members.

There were two separate problems:
* Because of the SuspectException on the joiner, it was never able to form a cluster by itself. This was fixed in Infinispan 5.1.4.CR1 with https://issues.jboss.org/browse/ISPN-1934.
* The other two nodes did not install a new JGroups cluster view for 10 minutes, blocking state transfer for this time. It is very likely that this was fixed in AS7 7.1.3 with https://issues.jboss.org/browse/AS7-4933.

Comment 18 Misha H. Ali 2012-06-12 15:13:03 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,3 @@
-<remark>CCFR - Dan</remark>+When a new node joined an existing JBoss Data Grid cluster, in some instances the new node splits into a separate network partition before the state transfer concludes and logged a SuspectException. As a result, the existing nodes do not receive a new JGroups view for ten minutes, after which the state transfer fails. The cluster did not form with all three members as expected.
+</para><para>
+This problem occurs rarely, due to two problems. First, the SuspectException on the new node does not allow a new cluster with just the single new node to form. Secondly, the two existing nodes in the cluster did not install a new JGroups cluster view for ten minutes, during which time state transfer remains blocked.


Note You need to log in before you can comment on or make changes to this bug.