http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-REPORTS-RESILIENCE/job/edg-60-elasticity-repl-basic/13/ When 3rd node joins it receives SuspectedException. The startup of the node fails and this holds the view installation for 10 minutes until timeout.
Created attachment 572740 [details] view installation times
Tristan Tarrant <ttarrant> made a comment on jira ISPN-1944 This is probably one for Dan
Dan Berindei <dberinde> made a comment on jira ISPN-1944 With the single target optimization in CommandAwareRpcDispatcher and the changes in JGroups for JGRP-1428, there are some scenarios now where a SuspectException is thrown even though we're using the SYNCHRONOUS_IGNORE_LEAVERS ResponseMode.
Dan Berindei <dberinde> made a comment on jira ISPN-1944 After examining the logs closer, it seems the SuspectException is only part of the problem (and it's actually described in ISPN-1934). At some point node3 suspects both node01 and node03 and excludes them from the view, but node01 neither removes node03 from its view nor tries to merge the partitions for 10 minutes (see the attached ispn-1944.txt log). We need to find whether there is anything that we can change in the JGroups configuration to prevent this kind of situation. Bela suggested using MERGE3 (see ISPN-1951), but it's not available in JGroups 3.0.8.
shouldn't version here stay ER4 and only target version be changed to ER7 ? cause I haven't found this in ER7 yet :-)
I think Michal is correct. We accidentally changed "version" instead of "target release". I will correct to: version: ER6 or less target release: ER7
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: CCFR - Dan
https://issues.jboss.org/browse/ISPN-1934 is fixed, so I think we can close this bug as well. We still have the problem with the cluster splitting after a join, but that only means that the state transfer takes a lot longer than it should. State transfer does eventually end (even though it doesn't necessarily keep the data consistent, see bu #808623), so we can't say that a SuspectException blocks the cluster formation any more.
Oops, I thought ISPN-1934 was fixed, but I hadn't issued the pull request yet...
Dan Berindei <dberinde> updated the status of jira ISPN-1944 to Resolved
Dan Berindei <dberinde> made a comment on jira ISPN-1944 With ISPN-1934 fixed, I'm marking this as resolved as well.
I haven't seen this in ER7 Testing
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -CCFR - Dan+<remark>CCFR - Dan</remark>
Michal's log actually shows two problems: 1. When the coordinator commits a cache view, the view should be installed even if some of the members of the view had already left. This was not happening, but it was fixed in Infinispan 5.1.4.CR1. 2. The test was just starting a new node. However, the 3rd node somehow got into a partition by itself and the other nodes didn't notice it even after 10 minutes. This could be a JGroups problem, but it's very hard to diagnose as this is the only time we've ever seen it.
I have tried to replicate the 2nd issue with a JGroups-only test, but I was not able to. FD_SOCK is enabled in the configuration, so when node C forms a new view by itself it closes the FD_SOCK connections to A and B, which immediately suspect it and form a new view by themselves. So I don't think it's possible to have the situation in thelogs unless there was a real network problem that prevented C from closing the FD_SOCK connection to A and B gracefully.
It looks like I was wrong about FD_SOCK, it actually doesn't detect immediately when a node has left the cluster (and the channel is still up, but seeing only a part of the cluster). However, when C leaves the cluster, FD_ALL on C notices that it is the only member and it stops broadcasting heartbeat messages. After the FD_ALL timeout expires, A and B will suspect C and eliminate it from their view. So it still shouldn't be possible for A and B to see view A2 [A, B, C] for 10 minutes after C left...
Technical notes: This bug appeared only once, with version 6.0.0.ER4. A new node joined an existing cluster, but before state transfer ended it split into a separate network partition and logged a SuspectException. The existing nodes did not receive a new JGroups view for 10 minutes, after which state transfer failed. The cluster never formed with all 3 members. There were two separate problems: * Because of the SuspectException on the joiner, it was never able to form a cluster by itself. This was fixed in Infinispan 5.1.4.CR1 with https://issues.jboss.org/browse/ISPN-1934. * The other two nodes did not install a new JGroups cluster view for 10 minutes, blocking state transfer for this time. It is very likely that this was fixed in AS7 7.1.3 with https://issues.jboss.org/browse/AS7-4933.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,3 @@ -<remark>CCFR - Dan</remark>+When a new node joined an existing JBoss Data Grid cluster, in some instances the new node splits into a separate network partition before the state transfer concludes and logged a SuspectException. As a result, the existing nodes do not receive a new JGroups view for ten minutes, after which the state transfer fails. The cluster did not form with all three members as expected. +</para><para> +This problem occurs rarely, due to two problems. First, the SuspectException on the new node does not allow a new cluster with just the single new node to form. Secondly, the two existing nodes in the cluster did not install a new JGroups cluster view for ten minutes, during which time state transfer remains blocked.