Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 722522 - corosync crashes with combo of lossy network and config changes
corosync crashes with combo of lossy network and config changes
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.1
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
: ZStream
Depends On:
Blocks: 727960 727962 729081
  Show dependency treegraph
 
Reported: 2011-07-15 11:06 EDT by Steven Dake
Modified: 2016-04-26 09:31 EDT (History)
5 users (show)

See Also:
Fixed In Version: corosync-1.4.1-4.el6
Doc Type: Bug Fix
Doc Text:
Previously, when a combination of a lossy network and a large number of configuration changes was used with corosync, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario.
Story Points: ---
Clone Of:
: 729081 (view as bug list)
Environment:
Last Closed: 2011-12-06 06:51:25 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
upstream master version patch (1.41 KB, patch)
2011-07-15 11:18 EDT, Steven Dake
no flags Details | Diff
Refined patch which resolves problem (2.21 KB, patch)
2011-09-19 19:41 EDT, Steven Dake
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1515 normal SHIPPED_LIVE corosync bug fix and enhancement update 2011-12-05 19:38:47 EST

  None (edit)
Description Steven Dake 2011-07-15 11:06:52 EDT
Description of problem:

From Tim Beale on the mailing list:

Hi,

We've hit a problem in the recovery code and I'm struggling to understand why
we do the following:

	/*
	 * The recovery sort queue now becomes the regular
	 * sort queue.  It is necessary to copy the state
	 * into the regular sort queue.
	 */
	sq_copy (&instance->regular_sort_queue, &instance->recovery_sort_queue);

The problem we're seeing is sometimes we get an encapsulated message from the
recovery queue copied onto the regular queue, and corosync then crashes trying
to process the message. (When it strips off the totemsrp header it gets another
totemsrp header rather than the totempg header it expects).

The problem seems to happen when we only do the sq_items_release() for a subset
of the recovery messages, e.g. there are 12 messages on the recovery queue and
we only free/release 5 of them. The remaining encapsulated recovery messages
get left on the regular queue and corosync crashes trying to deliver them.

It looks to me like deliver_messages_from_recovery_to_regular() handles the
encapsulation correctly, stripping the extra header and adding the recovery
messages to the regular queue. But then the sq_copy() just seems to overwrite
the regular queue.

We've avoided the crash in the past by just reiniting both queues, but I don't
think this is the best solution.

Any advice would be appreciated.

Thanks,
Tim


Version-Release number of selected component (if applicable):
all versions of openais and corosync

How reproducible:
not sure

Steps to Reproduce:
1. unclear how to reproduce without modifying binary
2. reproducer in further message exchanges
3.
  
Actual results:
crashes with lossy network and lots of config changes

Expected results:
no crash

Additional info:
Comment 1 Steven Dake 2011-07-15 11:08:03 EDT
My response:

A proper fix should be in commit
master:
7d5e588931e4393c06790995a995ea69e6724c54
flatiron-1.3:
8603ff6e9a270ecec194f4e13780927ebeb9f5b2

A new flatiron-1.3 release is in the works.  There are other totem bugs
you may wish to backport in the meantime.

Let us know if that commit fixes the problem you encountered.

Regards
-steve

(note these are in rhel products)
Comment 2 Steven Dake 2011-07-15 11:09:02 EDT
his response:

Which is why I was retesting this issue. But I still see the problem even with
the above change.

The recovery code seems to work most of the time. But occasionally it doesn't
free all of the recovery messages on the queue. It seems there are recovery
messages left with seq numbers higher than instance->my_high_delivered/
instance->my_aru.

In the last crash I saw there were 12 messages on the recovery queue but only
5 of them got freed by the above patch/code. I think usually a node leave event
seems to occur at the same time.

I can reproduce the problem reasonably reliably in a 2-node cluster with:
#define TEST_DROP_ORF_TOKEN_PERCENTAGE 40
#define TEST_DROP_MCAST_PERCENTAGE 20
But I suspect it's reliant on timing/messaging specific to my system. Let me
know if there's any debug or anything you want me to try out.

Thanks,
Tim

note there are reproducer instructions in this message but require binary changes
Comment 3 Steven Dake 2011-07-15 11:09:53 EDT
My response:

I speculate there are gaps in the recovery queue.  Example my_aru = 5,
but there are messages at 7,8.  8 = my_high_seq_received which results
in data slots taken up in new message queue.  What should really happen
is these last messages should be delivered after a transitional
configuration to maintain SAFE agreement.  We don't have support for
SAFE atm, so it is probably safe just to throw these messages away.

Could you test my speculatory patch against your test case?

Thanks!
-steve
Comment 4 Steven Dake 2011-07-15 11:10:30 EDT
His response:


Hi Steve,

Thanks for your help. I've tried out your patch and confirmed it fixes
the problem.

Cheers,
Tim
Comment 5 Steven Dake 2011-07-15 11:11:59 EDT
engineering recommend full zstreams.
Comment 6 Steven Dake 2011-07-15 11:18:20 EDT
Created attachment 513405 [details]
upstream master version patch
Comment 14 Steven Dake 2011-09-16 18:32:23 EDT
This patch fails QE.  The problem is that for some reason in some circumstances early messages are lost on whiplash test case.

The lost messages result in barriers not synchronizing, blocking corosync.

As an example:

node 1,2,3,4,5
node 2 looses 2 messages as a result of this patch because my_aru is set to 2.  In some way, my_high_seq_recieved had reached 2 before operational_enter was called.  Needs more investigation.

Reverting this patch results in 500 iterations of whiplash passing, although segfault scenario would occur in that situation.
Comment 15 Steven Dake 2011-09-19 19:41:04 EDT
Created attachment 523928 [details]
Refined patch which resolves problem
Comment 23 Nate Straz 2011-10-03 10:03:56 EDT
Made it through 500 iterations of whiplash plus all of the CMAN tests and revolver.  Marking this as VERIFIED.
Comment 24 Tomas Capek 2011-10-06 09:38:48 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when a combination of a lossy network and a large number of configuration changes was used with corosync, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario.
Comment 25 errata-xmlrpc 2011-12-06 06:51:25 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1515.html

Note You need to log in before you can comment on or make changes to this bug.