Red Hat Bugzilla – Bug 722522
corosync crashes with combo of lossy network and config changes
Last modified: 2016-04-26 09:31:29 EDT
Description of problem: From Tim Beale on the mailing list: Hi, We've hit a problem in the recovery code and I'm struggling to understand why we do the following: /* * The recovery sort queue now becomes the regular * sort queue. It is necessary to copy the state * into the regular sort queue. */ sq_copy (&instance->regular_sort_queue, &instance->recovery_sort_queue); The problem we're seeing is sometimes we get an encapsulated message from the recovery queue copied onto the regular queue, and corosync then crashes trying to process the message. (When it strips off the totemsrp header it gets another totemsrp header rather than the totempg header it expects). The problem seems to happen when we only do the sq_items_release() for a subset of the recovery messages, e.g. there are 12 messages on the recovery queue and we only free/release 5 of them. The remaining encapsulated recovery messages get left on the regular queue and corosync crashes trying to deliver them. It looks to me like deliver_messages_from_recovery_to_regular() handles the encapsulation correctly, stripping the extra header and adding the recovery messages to the regular queue. But then the sq_copy() just seems to overwrite the regular queue. We've avoided the crash in the past by just reiniting both queues, but I don't think this is the best solution. Any advice would be appreciated. Thanks, Tim Version-Release number of selected component (if applicable): all versions of openais and corosync How reproducible: not sure Steps to Reproduce: 1. unclear how to reproduce without modifying binary 2. reproducer in further message exchanges 3. Actual results: crashes with lossy network and lots of config changes Expected results: no crash Additional info:
My response: A proper fix should be in commit master: 7d5e588931e4393c06790995a995ea69e6724c54 flatiron-1.3: 8603ff6e9a270ecec194f4e13780927ebeb9f5b2 A new flatiron-1.3 release is in the works. There are other totem bugs you may wish to backport in the meantime. Let us know if that commit fixes the problem you encountered. Regards -steve (note these are in rhel products)
his response: Which is why I was retesting this issue. But I still see the problem even with the above change. The recovery code seems to work most of the time. But occasionally it doesn't free all of the recovery messages on the queue. It seems there are recovery messages left with seq numbers higher than instance->my_high_delivered/ instance->my_aru. In the last crash I saw there were 12 messages on the recovery queue but only 5 of them got freed by the above patch/code. I think usually a node leave event seems to occur at the same time. I can reproduce the problem reasonably reliably in a 2-node cluster with: #define TEST_DROP_ORF_TOKEN_PERCENTAGE 40 #define TEST_DROP_MCAST_PERCENTAGE 20 But I suspect it's reliant on timing/messaging specific to my system. Let me know if there's any debug or anything you want me to try out. Thanks, Tim note there are reproducer instructions in this message but require binary changes
My response: I speculate there are gaps in the recovery queue. Example my_aru = 5, but there are messages at 7,8. 8 = my_high_seq_received which results in data slots taken up in new message queue. What should really happen is these last messages should be delivered after a transitional configuration to maintain SAFE agreement. We don't have support for SAFE atm, so it is probably safe just to throw these messages away. Could you test my speculatory patch against your test case? Thanks! -steve
His response: Hi Steve, Thanks for your help. I've tried out your patch and confirmed it fixes the problem. Cheers, Tim
engineering recommend full zstreams.
Created attachment 513405 [details] upstream master version patch
This patch fails QE. The problem is that for some reason in some circumstances early messages are lost on whiplash test case. The lost messages result in barriers not synchronizing, blocking corosync. As an example: node 1,2,3,4,5 node 2 looses 2 messages as a result of this patch because my_aru is set to 2. In some way, my_high_seq_recieved had reached 2 before operational_enter was called. Needs more investigation. Reverting this patch results in 500 iterations of whiplash passing, although segfault scenario would occur in that situation.
Created attachment 523928 [details] Refined patch which resolves problem
Made it through 500 iterations of whiplash plus all of the CMAN tests and revolver. Marking this as VERIFIED.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, when a combination of a lossy network and a large number of configuration changes was used with corosync, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1515.html