Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 722522

Summary:

corosync crashes with combo of lossy network and config changes

Product:

Red Hat Enterprise Linux 6

Reporter:

Steven Dake <sdake>

Component:

corosync

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.1

CC:

cluster-maint, djansa, jkortus, jwest, syeghiay

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

corosync-1.4.1-4.el6

Doc Type:

Bug Fix

Doc Text:

Previously, when a combination of a lossy network and a large number of configuration changes was used with corosync, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario.

Story Points:

---

Clone Of:

Clones:

729081 (view as bug list)

Environment:

Last Closed:

2011-12-06 11:51:25 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

727960, 727962, 729081

Attachments:

Description	Flags
upstream master version patch	none
Refined patch which resolves problem	none

Description Steven Dake 2011-07-15 15:06:52 UTC

Description of problem:

From Tim Beale on the mailing list:

Hi,

We've hit a problem in the recovery code and I'm struggling to understand why
we do the following:

	/*
	 * The recovery sort queue now becomes the regular
	 * sort queue.  It is necessary to copy the state
	 * into the regular sort queue.
	 */
	sq_copy (&instance->regular_sort_queue, &instance->recovery_sort_queue);

The problem we're seeing is sometimes we get an encapsulated message from the
recovery queue copied onto the regular queue, and corosync then crashes trying
to process the message. (When it strips off the totemsrp header it gets another
totemsrp header rather than the totempg header it expects).

The problem seems to happen when we only do the sq_items_release() for a subset
of the recovery messages, e.g. there are 12 messages on the recovery queue and
we only free/release 5 of them. The remaining encapsulated recovery messages
get left on the regular queue and corosync crashes trying to deliver them.

It looks to me like deliver_messages_from_recovery_to_regular() handles the
encapsulation correctly, stripping the extra header and adding the recovery
messages to the regular queue. But then the sq_copy() just seems to overwrite
the regular queue.

We've avoided the crash in the past by just reiniting both queues, but I don't
think this is the best solution.

Any advice would be appreciated.

Thanks,
Tim


Version-Release number of selected component (if applicable):
all versions of openais and corosync

How reproducible:
not sure

Steps to Reproduce:
1. unclear how to reproduce without modifying binary
2. reproducer in further message exchanges
3.
  
Actual results:
crashes with lossy network and lots of config changes

Expected results:
no crash

Additional info:

Comment 1 Steven Dake 2011-07-15 15:08:03 UTC

My response:

A proper fix should be in commit
master:
7d5e588931e4393c06790995a995ea69e6724c54
flatiron-1.3:
8603ff6e9a270ecec194f4e13780927ebeb9f5b2

A new flatiron-1.3 release is in the works.  There are other totem bugs
you may wish to backport in the meantime.

Let us know if that commit fixes the problem you encountered.

Regards
-steve

(note these are in rhel products)

Comment 2 Steven Dake 2011-07-15 15:09:02 UTC

his response:

Which is why I was retesting this issue. But I still see the problem even with
the above change.

The recovery code seems to work most of the time. But occasionally it doesn't
free all of the recovery messages on the queue. It seems there are recovery
messages left with seq numbers higher than instance->my_high_delivered/
instance->my_aru.

In the last crash I saw there were 12 messages on the recovery queue but only
5 of them got freed by the above patch/code. I think usually a node leave event
seems to occur at the same time.

I can reproduce the problem reasonably reliably in a 2-node cluster with:
#define TEST_DROP_ORF_TOKEN_PERCENTAGE 40
#define TEST_DROP_MCAST_PERCENTAGE 20
But I suspect it's reliant on timing/messaging specific to my system. Let me
know if there's any debug or anything you want me to try out.

Thanks,
Tim

note there are reproducer instructions in this message but require binary changes

Comment 3 Steven Dake 2011-07-15 15:09:53 UTC

My response:

I speculate there are gaps in the recovery queue.  Example my_aru = 5,
but there are messages at 7,8.  8 = my_high_seq_received which results
in data slots taken up in new message queue.  What should really happen
is these last messages should be delivered after a transitional
configuration to maintain SAFE agreement.  We don't have support for
SAFE atm, so it is probably safe just to throw these messages away.

Could you test my speculatory patch against your test case?

Thanks!
-steve

Comment 4 Steven Dake 2011-07-15 15:10:30 UTC

His response:


Hi Steve,

Thanks for your help. I've tried out your patch and confirmed it fixes
the problem.

Cheers,
Tim

Comment 5 Steven Dake 2011-07-15 15:11:59 UTC

engineering recommend full zstreams.

Comment 6 Steven Dake 2011-07-15 15:18:20 UTC

Created attachment 513405 [details]
upstream master version patch

Comment 14 Steven Dake 2011-09-16 22:32:23 UTC

This patch fails QE.  The problem is that for some reason in some circumstances early messages are lost on whiplash test case.

The lost messages result in barriers not synchronizing, blocking corosync.

As an example:

node 1,2,3,4,5
node 2 looses 2 messages as a result of this patch because my_aru is set to 2.  In some way, my_high_seq_recieved had reached 2 before operational_enter was called.  Needs more investigation.

Reverting this patch results in 500 iterations of whiplash passing, although segfault scenario would occur in that situation.

Comment 15 Steven Dake 2011-09-19 23:41:04 UTC

Created attachment 523928 [details]
Refined patch which resolves problem

Comment 23 Nate Straz 2011-10-03 14:03:56 UTC

Made it through 500 iterations of whiplash plus all of the CMAN tests and revolver.  Marking this as VERIFIED.

Comment 24 Tomas Capek 2011-10-06 13:38:48 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when a combination of a lossy network and a large number of configuration changes was used with corosync, corosync sometimes terminated unexpectedly. This bug has been fixed, and corosync no longer crashes in the described scenario.

Comment 25 errata-xmlrpc 2011-12-06 11:51:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1515.html