| Summary: | FAILED TO RECEIVE seen during 10 node startup | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] Corosync Cluster Engine | Reporter: | John Thompson <thompa26> | ||||||
| Component: | totem | Assignee: | Steven Dake <sdake> | ||||||
| Status: | CLOSED UPSTREAM | QA Contact: | |||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 1.3 | CC: | agk, asalkeld, fdinitto, jfriesse, sdake | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-11-29 17:32:58 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
John Thompson
2011-11-01 09:08:51 UTC
Created attachment 531087 [details]
Blackbox information
Blackbox information from each node just after startup & two minutes later
This implies your multicast is not working correctly in your network. Can you try broadcast mode or udpu to verify it is not a environmental problem? Thanks -steve I have tried broadcast mode and it also fails in a similar way. I haven't been able to get udpu working. I will double check the underlying connections to see if any packets are being dropped. I have checked the underlying connections between the nodes and I do not see any errors or drops occurring when failure occurs. This last time I have gotten the debug output 20 minutes after startup and the failed to receive message hasn't been seen. The protocol appears to be blocking indefinitely. What appears to be occurring is that only some nodes in the ring have moved out of recovery to operational when a member join is seen. The next time the ring forms seems to be when the problem occurs. Let me know if there is any more information I can provide. Thanks, John Created attachment 534702 [details]
Blackbox information of original problem - fail to recv const = 50
This is blackbox information taken two minutes after startup on each of the nodes. This is done with a fail to recv const of 50.
IT seems to be that when the cluster is larger and a new node joins not all the nodes in the cluster are getting to transition to OPERATIONAL before another node joins. This seems to lead to the FAILED TO RECEIVE.
increase fail to recv const to 1000. 50 is too low for large node counts. I am pretty sure we changed this in the source base defaults. Please report back if larger fail to recv const solves the problem. Regards -steve I have tried with a fail to recv const of 1000 and the problem is not solved. With the high fail to recv const the protocol appears to sit and nothing else occurs - the totem debug has nothing come out. I kept it in this state for 10 minutes and nothing happened. The attachment Blackbox information, which was produced with fail to recv const = 5000, is the same type of result as what I reproduced this morning and should have debug if that is helpful. I have found the change that increased the fail to recv const in the source - it is only in 1.4 and above, not 1.3 - which we are using. Is this a fedora 16 bug? Fedora 16 does not ship corosync 1.3.z. If this is not a fedora bug, I'll move it to the proper location (community bugs). We have fixed many protocol bugs between 1.3.1 and 1.3.4. The z stream is for bug fixes _only_ and typically based upon thousands of field deployments hammering the software and unfortunately finding very difficult to reproduce cases usually triggered by environmental issues. We do zero feature development in a z stream. As such, I'd recommend you give it a try. The scenarios you explain sounds like it could be 1 more a combination of things fixed in the zstreams. Sorry this isn't a Fedora 16 bug, I wasn't too sure what to put the problem under. I understand that you are suggesting that we try out 1.3.4 as there have been a number of fixes since 1.3.1. Tim Beale and I have tried pulling in a couple of the fixes that looked like they might have an affect on this operation but there wasn't any improvement: Revert "totemsrp: Remove recv_flush code" Ignore memb_join messages during flush operations Pulling in a couple of fixes likely won't get it done. You need _all_ the patches (some work together or interact). Just give it a try and see if it resolves the problem. I can't spend time re-debugging issues that are already fixed. If the problem persists then I will investigate further. Regards -steve I have performed the update to corosync 1.3.4 and retested with a fail to recv const = 1000. I do not see a FAILED TO RECEIVE message. There is still a problem where when the scenario mentioned in this report occurs - a node joins just as the cluster is finishing off a join for another node (some nodes have gone OPERATIONAL from RECOVERY and some nodes are still in RECOVERY). The problem is that we see a CLM & CPG config change with some of the nodes marked as left. A second CLM config change is then seen with them having joined again. For CPG the left nodes don't rejoin the group - as shown in corosync-cpgtool, they are not group members any more. These nodes have not actually left but are still part of the cluster, they still appear in more debug showing that they are part of membership. The left nodes have a more advanced ring seq id as they have gone operational. In this case the nodes that were in RECOVERY for the last node join are on the previous ring seq id, this causes the transitional membership to be calculated so the operational nodes are removed, and therefore have left the cluster. When this type of situation occurs is the cluster considered to have split and be reforming? Would this be why the nodes are considered to have left and then rejoined in CLM? Should this be something I email the mailing list about or leave in this report? [TOTEM ] totemsrp.c:1998 entering RECOVERY state. [TOTEM ] totemsrp.c:2040 TRANS [0] member 192.168.255.1: [TOTEM ] totemsrp.c:2040 TRANS [1] member 192.168.255.10: [TOTEM ] totemsrp.c:2040 TRANS [2] member 192.168.255.11: [TOTEM ] totemsrp.c:2040 TRANS [3] member 192.168.255.12: [TOTEM ] totemsrp.c:2044 position [0] member 192.168.255.1: [TOTEM ] totemsrp.c:2048 previous ring seq 20 rep 192.168.255.1 [TOTEM ] totemsrp.c:2054 aru 60 high delivered 60 received flag 1 [TOTEM ] totemsrp.c:2044 position [1] member 192.168.255.3: [TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1 [TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1 [TOTEM ] totemsrp.c:2044 position [2] member 192.168.255.4: [TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1 [TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1 [TOTEM ] totemsrp.c:2044 position [3] member 192.168.255.5: [TOTEM ] totemsrp.c:2048 previous ring seq 0 rep 192.168.255.5 [TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1 [TOTEM ] totemsrp.c:2044 position [4] member 192.168.255.6: [TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1 [TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1 [TOTEM ] totemsrp.c:2044 position [5] member 192.168.255.8: [TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1 [TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1 [TOTEM ] totemsrp.c:2044 position [6] member 192.168.255.9: [TOTEM ] totemsrp.c:2048 previous ring seq 24 rep 192.168.255.1 [TOTEM ] totemsrp.c:2054 aru 0 high delivered 0 received flag 1 [TOTEM ] totemsrp.c:2044 position [7] member 192.168.255.10: [TOTEM ] totemsrp.c:2048 previous ring seq 20 rep 192.168.255.1 [TOTEM ] totemsrp.c:2054 aru 60 high delivered 60 received flag 1 [TOTEM ] totemsrp.c:2044 position [8] member 192.168.255.11: [TOTEM ] totemsrp.c:2048 previous ring seq 20 rep 192.168.255.1 [TOTEM ] totemsrp.c:2054 aru 60 high delivered 60 received flag 1 [TOTEM ] totemsrp.c:2044 position [9] member 192.168.255.12: [TOTEM ] totemsrp.c:2048 previous ring seq 20 rep 192.168.255.1 [TOTEM ] totemsrp.c:2054 aru 60 high delivered 60 received flag 1 Please file a new bug for Comment #11. Closing rest of this bug as upstream. |