Bug 750472

Summary: FAILED TO RECEIVE seen during 10 node startup
Product: [Retired] Corosync Cluster Engine Reporter: John Thompson <thompa26>
Component: totemAssignee: Steven Dake <sdake>
Status: CLOSED UPSTREAM QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 1.3CC: agk, asalkeld, fdinitto, jfriesse, sdake
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-11-29 17:32:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Blackbox information
none
Blackbox information of original problem - fail to recv const = 50 none

Description John Thompson 2011-11-01 09:08:51 UTC
Description of problem:
During bootup of a 10 node cluster (with all nodes starting corosync at roughly the same time) and we are seeing a node hitting the FAILED TO RECEIVE case.

This appears to be because when a number of the nodes have joined together and are transitioning to operational a member join occurs when some of the joined nodes are still in recovery.

This issue was first discussed in the mailing list (Linux Foundation) under the title:
[Openais] Problems forming cluster on corosync startup

When the fail_recv_const is increased to 5000 the protocol appears to block forever - this was requested during the mailing list conversation with 3 possible options resulting:
1. the protocol blocks forever
2. the protocol enters operational after some short period
3. fail to recv is printed after a long period of time (1-10 minutes).

#2 is not occurring.  

I haven't confirmed properly whether it is #1 or #3 as the nodes healthcheck after a period and I am trying to leave it long enough to be certain.  6 minutes after startup the FAILED TO RECEIVE message had not been seen.

Version-Release number of selected component (if applicable):
Corosync 1.3.1
Plus we have applied the patches:
Revert "totemsrp: Remove recv_flush code"
Ignore memb_join messages during flush operations
private patch from Steve Dake regarding the transitional membership calculation

How reproducible:
1 in 40 attempts

Steps to Reproduce:
1. Have a 10 node cluster startup at approximately the same time.
  
Actual results:
A FAILED TO RECEIVE state occurs and the cluster eventually recovers but after a lot of churn and node leave & join events.

Expected results:
The member join should be processed with out the FAILED TO RECEIVE state occurring.

Additional info:
blackbox output had been requested from a few seconds after startup and 2 minutes later - with the 5000 fail_recv_const.  I will attach this soon.

We're using corosync 1.3.1 on an embedded linux system (with a low spec CPU).
Corosync is running over a basic ethernet interface (no hubs/routers/etc).

Comment 1 John Thompson 2011-11-01 09:14:10 UTC
Created attachment 531087 [details]
Blackbox information

Blackbox information from each node just after startup & two minutes later

Comment 2 Steven Dake 2011-11-01 14:47:46 UTC
This implies your multicast is not working correctly in your network.  Can you try broadcast mode or udpu to verify it is not a environmental problem?

Thanks
-steve

Comment 3 John Thompson 2011-11-03 08:02:56 UTC
I have tried broadcast mode and it also fails in a similar way.  I haven't been able to get udpu working.

I will double check the underlying connections to see if any packets are being dropped.

Comment 4 John Thompson 2011-11-08 02:10:56 UTC
I have checked the underlying connections between the nodes and I do not see any errors or drops occurring when failure occurs.

This last time I have gotten the debug output 20 minutes after startup and the failed to receive message hasn't been seen.  The protocol appears to be blocking indefinitely.

What appears to be occurring is that only some nodes in the ring have moved out of recovery to operational when a member join is seen.  The next time the ring forms seems to be when the problem occurs.

Let me know if there is any more information I can provide.

Thanks,
John

Comment 5 John Thompson 2011-11-21 02:05:41 UTC
Created attachment 534702 [details]
Blackbox information of original problem - fail to recv const = 50

This is blackbox information taken two minutes after startup on each of the nodes.  This is done with a fail to recv const of 50.

IT seems to be that when the cluster is larger and a new node joins not all the nodes in the cluster are getting to transition to OPERATIONAL before another node joins.  This seems to lead to the FAILED TO RECEIVE.

Comment 6 Steven Dake 2011-11-21 19:08:44 UTC
increase fail to recv const to 1000.  50 is too low for large node counts.  I am pretty sure we changed this in the source base defaults.

Please report back if larger fail to recv const solves the problem.

Regards
-steve

Comment 7 John Thompson 2011-11-21 22:57:20 UTC
I have tried with a fail to recv const of 1000 and the problem is not solved.

With the high fail to recv const the protocol appears to sit and nothing else occurs - the totem debug has nothing come out.  I kept it in this state for 10 minutes and nothing happened.

The attachment Blackbox information, which was produced with fail to recv const = 5000, is the same type of result as what I reproduced this morning and should have debug if that is helpful.

I have found the change that increased the fail to recv const in the source - it is only in 1.4 and above, not 1.3 - which we are using.

Comment 8 Steven Dake 2011-11-21 23:08:23 UTC
Is this a fedora 16 bug?  Fedora 16 does not ship corosync 1.3.z.  If this is not a fedora bug, I'll move it to the proper location (community bugs).

We have fixed many protocol bugs between 1.3.1 and 1.3.4.  The z stream is for bug fixes _only_ and typically based upon thousands of field deployments hammering the software and unfortunately finding very difficult to reproduce cases usually triggered by environmental issues.  We do zero feature development in a z stream.  As such, I'd recommend you give it a try.  The scenarios you explain sounds like it could be 1 more a combination of things fixed in the zstreams.

Comment 9 John Thompson 2011-11-22 01:26:55 UTC
Sorry this isn't a Fedora 16 bug, I wasn't too sure what to put the problem under.

I understand that you are suggesting that we try out 1.3.4 as there have been a number of fixes since 1.3.1.

Tim Beale and I have tried pulling in a couple of the fixes that looked like they might have an affect on this operation but there wasn't any improvement:
Revert "totemsrp: Remove recv_flush code"
Ignore memb_join messages during flush operations

Comment 10 Steven Dake 2011-11-22 05:47:55 UTC
Pulling in a couple of fixes likely won't get it done.  You need _all_ the patches (some work together or interact).  Just give it a try and see if it resolves the problem.  I can't spend time re-debugging issues that are already fixed.  If the problem persists then I will investigate further.

Regards
-steve

Comment 11 John Thompson 2011-11-29 08:41:10 UTC
I have performed the update to corosync 1.3.4 and retested with a fail to recv const = 1000.

I do not see a FAILED TO RECEIVE message.

There is still a problem where when the scenario mentioned in this report occurs - a node joins just as the cluster is finishing off a join for another node (some nodes have gone OPERATIONAL from RECOVERY and some nodes are still in RECOVERY).

The problem is that we see a CLM & CPG config change with some of the nodes marked as left.  A second CLM config change is then seen with them having joined again.

For CPG the left nodes don't rejoin the group - as shown in corosync-cpgtool, they are not group members any more.  These nodes have not actually left but are still part of the cluster, they still appear in more debug showing that they are part of membership.

The left nodes have a more advanced ring seq id as they have gone operational.  In this case the nodes that were in RECOVERY for the last node join are on the previous ring seq id, this causes the transitional membership to be calculated so the operational nodes are removed, and therefore have left the cluster.

When this type of situation occurs is the cluster considered to have split and be reforming?  Would this be why the nodes are considered to have left and then rejoined in CLM?

Should this be something I email the mailing list about or leave in this report?

[TOTEM ] totemsrp.c:1998 entering RECOVERY state.
[TOTEM ] totemsrp.c:2040 TRANS [0] member 192.168.255.1:
[TOTEM ] totemsrp.c:2040 TRANS [1] member 192.168.255.10:
[TOTEM ] totemsrp.c:2040 TRANS [2] member 192.168.255.11:
[TOTEM ] totemsrp.c:2040 TRANS [3] member 192.168.255.12:
[TOTEM ] totemsrp.c:2044 position [0] member 192.168.255.1:
[TOTEM ] totemsrp.c:2048 previous ring seq 20 rep 192.168.255.1
[TOTEM ] totemsrp.c:2054 aru 60 high delivered 60 received flag 1
[TOTEM ] totemsrp.c:2044 position [1] member 192.168.255.3:
[TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1
[TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1
[TOTEM ] totemsrp.c:2044 position [2] member 192.168.255.4:
[TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1
[TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1
[TOTEM ] totemsrp.c:2044 position [3] member 192.168.255.5:
[TOTEM ] totemsrp.c:2048 previous ring seq 0 rep 192.168.255.5
[TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1
[TOTEM ] totemsrp.c:2044 position [4] member 192.168.255.6:
[TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1
[TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1
[TOTEM ] totemsrp.c:2044 position [5] member 192.168.255.8:
[TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1
[TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1
[TOTEM ] totemsrp.c:2044 position [6] member 192.168.255.9:
[TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1
[TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1
[TOTEM ] totemsrp.c:2044 position [7] member 192.168.255.10:
[TOTEM ] totemsrp.c:2048 previous ring seq 20 rep 192.168.255.1
[TOTEM ] totemsrp.c:2054 aru 60 high delivered 60 received flag 1
[TOTEM ] totemsrp.c:2044 position [8] member 192.168.255.11:
[TOTEM ] totemsrp.c:2048 previous ring seq 20 rep 192.168.255.1
[TOTEM ] totemsrp.c:2054 aru 60 high delivered 60 received flag 1
[TOTEM ] totemsrp.c:2044 position [9] member 192.168.255.12:
[TOTEM ] totemsrp.c:2048 previous ring seq 20 rep 192.168.255.1
[TOTEM ] totemsrp.c:2054 aru 60 high delivered 60 received flag 1

Comment 12 Steven Dake 2011-11-29 17:32:58 UTC
Please file a new bug for Comment #11.  Closing rest of this bug as upstream.