RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 675783 - token loss during recovery can trigger abort
Summary: token loss during recovery can trigger abort
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.1
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Steven Dake
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 623176
Blocks: 696733
TreeView+ depends on / blocked
 
Reported: 2011-02-07 18:25 UTC by Steven Dake
Modified: 2016-04-26 15:30 UTC (History)
6 users (show)

Fixed In Version: corosync-1.2.3-32.el6
Doc Type: Bug Fix
Doc Text:
Clone Of: 623176
Environment:
Last Closed: 2011-05-19 14:24:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Reproducer (4.08 KB, patch)
2011-03-03 14:43 UTC, Jan Friesse
no flags Details | Diff
the blackbox flight record prior to the resolution (11.05 KB, text/plain)
2011-03-04 18:09 UTC, Steven Dake
no flags Details
the blackbox flight record after the resolution (29.14 KB, text/plain)
2011-03-04 18:09 UTC, Steven Dake
no flags Details
patch to fix the problem (4.56 KB, patch)
2011-03-04 20:29 UTC, Steven Dake
no flags Details | Diff
a regression trace introduced by this patch (137.33 KB, application/x-gzip)
2011-03-23 18:37 UTC, Steven Dake
no flags Details
patch 1/2 to finish resolving this bug (2.45 KB, patch)
2011-03-24 15:53 UTC, Steven Dake
no flags Details | Diff
patch 2/2 to finish resolving this bug (2.36 KB, application/octet-stream)
2011-03-24 15:58 UTC, Steven Dake
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0764 0 normal SHIPPED_LIVE corosync bug fix update 2011-05-18 18:08:44 UTC

Comment 2 Jan Friesse 2011-03-03 14:43:14 UTC
Created attachment 482098 [details]
Reproducer

Reproduce patch. It's code change and must be applied on current master of corosync.

Actually, patch needs to be applied only on only ONE node and specifically FIRST node (node creating commit token).

Comment 3 Jan Friesse 2011-03-03 14:45:11 UTC
(In reply to comment #2)
> Created attachment 482098 [details]
> Reproducer
> 
> Reproduce patch. It's code change and must be applied on current master of
> corosync.
> 
> Actually, patch needs to be applied only on only ONE node and specifically
> FIRST node (node creating commit token).

Also one extra note. It works only for 3 node configurations, so:
node1 - with applied patch, lowest node id
node2 - no change
node3 - no change

Comment 4 Steven Dake 2011-03-04 18:09:01 UTC
Created attachment 482342 [details]
the blackbox flight record prior to the resolution

Comment 5 Steven Dake 2011-03-04 18:09:43 UTC
Created attachment 482343 [details]
the blackbox flight record after the resolution

Comment 6 Steven Dake 2011-03-04 18:12:45 UTC
pre-fix:
rec=[169] Log Message=The token was lost in the RECOVERY state.
rec=[170] Log Message=Restoring instance->my_aru 2 my high seq received 2
rec=[171] Log Message=entering GATHER state from 5.
rec=[172] Log Message=got commit token
Finishing replay: records found [172]
(asserts here)

post-fix
rec=[169] Log Message=The token was lost in the RECOVERY state.
rec=[170] Log Message=Restoring instance->my_aru 2 my high seq received 2
rec=[171] Log Message=entering GATHER state from 5.
rec=[172] Log Message=Creating commit token because I am the rep.
rec=[173] Log Message=Saving state aru 2 high seq received 2
rec=[174] Log Message=Storing new sequence id for ring 3e8c
rec=[175] Log Message=entering COMMIT state.
rec=[176] Log Message=got commit token
rec=[177] Log Message=entering RECOVERY state.


We see that in pre-fix, record 172 is showing a "got commit token".  This should NOT be happening and matches the speculated cause in the last bit of comment #1.  We see in post-fix, the commit token is ignored, and instead we issue a new commit token on the node.

This resolves the abort and fixes a serious problem we have had for some time in the code base.

Comment 7 Steven Dake 2011-03-04 20:29:58 UTC
Created attachment 482370 [details]
patch to fix the problem

Comment 9 Steven Dake 2011-03-23 18:37:06 UTC
Created attachment 487114 [details]
a regression trace introduced by this patch

Comment 11 Steven Dake 2011-03-24 15:53:05 UTC
Created attachment 487380 [details]
patch 1/2 to finish resolving this bug

Comment 12 Steven Dake 2011-03-24 15:58:45 UTC
Created attachment 487384 [details]
patch 2/2 to finish resolving this bug

Comment 14 Nate Straz 2011-03-25 11:13:09 UTC
Made it through 500 iterations of whiplash with corosync-1.2.3-32.el6.

Comment 18 errata-xmlrpc 2011-05-19 14:24:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0764.html


Note You need to log in before you can comment on or make changes to this bug.