Bug 675783
| Summary: | token loss during recovery can trigger abort | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Steven Dake <sdake> |
| Component: | corosync | Assignee: | Steven Dake <sdake> |
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 6.1 | CC: | cluster-maint, djansa, edamato, jfriesse, jkortus, jwest |
| Target Milestone: | rc | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | corosync-1.2.3-32.el6 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 623176 | Environment: | |
| Last Closed: | 2011-05-19 14:24:23 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 623176 | ||
| Bug Blocks: | 696733 | ||
| Attachments: | |||
(In reply to comment #2) > Created attachment 482098 [details] > Reproducer > > Reproduce patch. It's code change and must be applied on current master of > corosync. > > Actually, patch needs to be applied only on only ONE node and specifically > FIRST node (node creating commit token). Also one extra note. It works only for 3 node configurations, so: node1 - with applied patch, lowest node id node2 - no change node3 - no change Created attachment 482342 [details]
the blackbox flight record prior to the resolution
Created attachment 482343 [details]
the blackbox flight record after the resolution
pre-fix: rec=[169] Log Message=The token was lost in the RECOVERY state. rec=[170] Log Message=Restoring instance->my_aru 2 my high seq received 2 rec=[171] Log Message=entering GATHER state from 5. rec=[172] Log Message=got commit token Finishing replay: records found [172] (asserts here) post-fix rec=[169] Log Message=The token was lost in the RECOVERY state. rec=[170] Log Message=Restoring instance->my_aru 2 my high seq received 2 rec=[171] Log Message=entering GATHER state from 5. rec=[172] Log Message=Creating commit token because I am the rep. rec=[173] Log Message=Saving state aru 2 high seq received 2 rec=[174] Log Message=Storing new sequence id for ring 3e8c rec=[175] Log Message=entering COMMIT state. rec=[176] Log Message=got commit token rec=[177] Log Message=entering RECOVERY state. We see that in pre-fix, record 172 is showing a "got commit token". This should NOT be happening and matches the speculated cause in the last bit of comment #1. We see in post-fix, the commit token is ignored, and instead we issue a new commit token on the node. This resolves the abort and fixes a serious problem we have had for some time in the code base. Created attachment 482370 [details]
patch to fix the problem
Created attachment 487114 [details]
a regression trace introduced by this patch
Created attachment 487380 [details]
patch 1/2 to finish resolving this bug
Created attachment 487384 [details]
patch 2/2 to finish resolving this bug
Made it through 500 iterations of whiplash with corosync-1.2.3-32.el6. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0764.html |
Created attachment 482098 [details] Reproducer Reproduce patch. It's code change and must be applied on current master of corosync. Actually, patch needs to be applied only on only ONE node and specifically FIRST node (node creating commit token).