| Summary: | corosync crashes with combo of lossy network and config changes | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Steven Dake <sdake> | ||||||
| Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 6.1 | CC: | cluster-maint, djansa, jkortus, jwest, syeghiay | ||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | corosync-1.4.1-4.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Previously, when a combination of a lossy network and a large number of configuration changes was used with corosync, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario.
|
Story Points: | --- | ||||||
| Clone Of: | |||||||||
| : | 729081 (view as bug list) | Environment: | |||||||
| Last Closed: | 2011-12-06 11:51:25 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 727960, 727962, 729081 | ||||||||
| Attachments: |
|
||||||||
|
Description
Steven Dake
2011-07-15 15:06:52 UTC
My response: A proper fix should be in commit master: 7d5e588931e4393c06790995a995ea69e6724c54 flatiron-1.3: 8603ff6e9a270ecec194f4e13780927ebeb9f5b2 A new flatiron-1.3 release is in the works. There are other totem bugs you may wish to backport in the meantime. Let us know if that commit fixes the problem you encountered. Regards -steve (note these are in rhel products) his response: Which is why I was retesting this issue. But I still see the problem even with the above change. The recovery code seems to work most of the time. But occasionally it doesn't free all of the recovery messages on the queue. It seems there are recovery messages left with seq numbers higher than instance->my_high_delivered/ instance->my_aru. In the last crash I saw there were 12 messages on the recovery queue but only 5 of them got freed by the above patch/code. I think usually a node leave event seems to occur at the same time. I can reproduce the problem reasonably reliably in a 2-node cluster with: #define TEST_DROP_ORF_TOKEN_PERCENTAGE 40 #define TEST_DROP_MCAST_PERCENTAGE 20 But I suspect it's reliant on timing/messaging specific to my system. Let me know if there's any debug or anything you want me to try out. Thanks, Tim note there are reproducer instructions in this message but require binary changes My response: I speculate there are gaps in the recovery queue. Example my_aru = 5, but there are messages at 7,8. 8 = my_high_seq_received which results in data slots taken up in new message queue. What should really happen is these last messages should be delivered after a transitional configuration to maintain SAFE agreement. We don't have support for SAFE atm, so it is probably safe just to throw these messages away. Could you test my speculatory patch against your test case? Thanks! -steve His response: Hi Steve, Thanks for your help. I've tried out your patch and confirmed it fixes the problem. Cheers, Tim engineering recommend full zstreams. Created attachment 513405 [details]
upstream master version patch
This patch fails QE. The problem is that for some reason in some circumstances early messages are lost on whiplash test case. The lost messages result in barriers not synchronizing, blocking corosync. As an example: node 1,2,3,4,5 node 2 looses 2 messages as a result of this patch because my_aru is set to 2. In some way, my_high_seq_recieved had reached 2 before operational_enter was called. Needs more investigation. Reverting this patch results in 500 iterations of whiplash passing, although segfault scenario would occur in that situation. Created attachment 523928 [details]
Refined patch which resolves problem
Made it through 500 iterations of whiplash plus all of the CMAN tests and revolver. Marking this as VERIFIED.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Previously, when a combination of a lossy network and a large number of configuration changes was used with corosync, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1515.html |