Bug 907894
Summary: | CPG: Corosync can duplicate and/or lost messages - Multiple nodes problems | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jan Friesse <jfriesse> | |
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 6.4 | CC: | bmarson, cluster-maint, jkortus, perfbz, sdake, slevine, sradvan | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | ||||
Fixed In Version: | corosync-1.4.1-16.el6 | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
Corosync running on multiple nodes, some of nodes are killed (corosync dies/exits/switch failure/...)
Consequence:
Very rarely, corosync can lost or duplicate messages.
Fix:
Fixed many race conditions.
Result:
Corosync should no longer loose or duplicate messages.
|
Story Points: | --- | |
Clone Of: | ||||
: | 922671 (view as bug list) | Environment: | ||
Last Closed: | 2013-11-21 04:32:17 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 922671, 924261 | |||
Bug Blocks: | 960054 | |||
Attachments: |
Description
Jan Friesse
2013-02-05 13:36:33 UTC
Created attachment 693380 [details]
Proposed patch - part 1
Created attachment 693381 [details]
Proposed patch - part 2 - check dispatch_put return code
With patches 1 and 2, duplicate messages are gone. Problem with corrupted/lost messages persists. Created attachment 706733 [details]
Proposed patch - part 3 - Take alignment in acount for free_bytes in ring buffer
This patch solves another issue (CRITICAL). I believe it also solves problem's bmarsons is hitting.
Reproducer:
- Start corosync on one node
- Run cpgload -q -n 500
- After (usually) less then three minutes, lost messages will appear in form:
20130307T170546:(a126220a 3200):341:Incorrect msg seq 341 != 297
20130307T170546:(a126220a 3200):342:Incorrect msg seq 342 != 297
- cpgload will end with
Dispatch error 2
"Unit test" for proposed patch - part 3: https://github.com/jfriesse/csts/commit/5188f85d1956db1c14e37737d4dadd5935c78d52 With patch 3, messages are no longer lost on single node cluster, but sadly still may be lost on multiple nodes cluster. First run (10 iterations) showed no failures :) .. Running again for more confidence. Barry I've split this BZ to two different BZillas. Bug #907894 (this one) is to solve multiple nodes message corruption/lost/out of order messages delivery. Bug #922671 is to solve local IPC problems. Tho marking patches in this BZ as obsoleted (and move them to Bug #922671) Created attachment 713841 [details]
Proposed patch - part 1 - totempg: Make iov_delv local variable
Created attachment 713842 [details]
Proposed patch - part 2 - Remove exit thread and replace it by exit pipe
Created attachment 713843 [details]
Proposed patch - part 3 - schedwrk: Set values before create callback
Created attachment 713844 [details]
Proposed patch - part 4 - Fix race for sending_allowed
Created attachment 713855 [details]
Proposed patch - part 5 - totempg: Store and restore global variables
Created attachment 754386 [details]
Proposed patch - part 6 - Lock sync_in_process variable
sync_in_process is changed by coropoll thread (main thread) but used by
all IPC connections. To ensure correct value is read, mutex is added.
Verified using start-cfgstop-one-by-one-with-load.sh test FAIL on corosync-1.4.1-15.el6.x86_64 (RHEL6.4) PASS on corosync-1.4.1-17.el6.x86_64 (RHEL6.5) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1531.html |