Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 907894 - CPG: Corosync can duplicate and/or lost messages - Multiple nodes problems
CPG: Corosync can duplicate and/or lost messages - Multiple nodes problems
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.4
All All
urgent Severity urgent
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
:
Depends On: 922671 924261
Blocks: 960054
  Show dependency treegraph
 
Reported: 2013-02-05 08:36 EST by Jan Friesse
Modified: 2015-09-27 22:24 EDT (History)
7 users (show)

See Also:
Fixed In Version: corosync-1.4.1-16.el6
Doc Type: Bug Fix
Doc Text:
Cause: Corosync running on multiple nodes, some of nodes are killed (corosync dies/exits/switch failure/...) Consequence: Very rarely, corosync can lost or duplicate messages. Fix: Fixed many race conditions. Result: Corosync should no longer loose or duplicate messages.
Story Points: ---
Clone Of:
: 922671 (view as bug list)
Environment:
Last Closed: 2013-11-20 23:32:17 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed patch - part 1 (1.56 KB, patch)
2013-02-05 08:38 EST, Jan Friesse
no flags Details | Diff
Proposed patch - part 2 - check dispatch_put return code (4.58 KB, patch)
2013-02-05 08:38 EST, Jan Friesse
no flags Details | Diff
Proposed patch - part 3 - Take alignment in acount for free_bytes in ring buffer (1.15 KB, patch)
2013-03-07 12:00 EST, Jan Friesse
no flags Details | Diff
Proposed patch - part 1 - totempg: Make iov_delv local variable (868 bytes, patch)
2013-03-21 09:23 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 2 - Remove exit thread and replace it by exit pipe (4.38 KB, patch)
2013-03-21 09:27 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 3 - schedwrk: Set values before create callback (1.11 KB, patch)
2013-03-21 09:28 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 4 - Fix race for sending_allowed (1.42 KB, patch)
2013-03-21 09:29 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 5 - totempg: Store and restore global variables (12.50 KB, patch)
2013-03-21 09:32 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 6 - Lock sync_in_process variable (2.07 KB, patch)
2013-05-29 09:47 EDT, Jan Friesse
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:1531 normal SHIPPED_LIVE corosync bug fix and enhancement update 2013-11-20 19:40:57 EST

  None (edit)
Description Jan Friesse 2013-02-05 08:36:33 EST
Description of problem:
Corosync can duplicate messages on corosync server exit and/or lost (overwrite) messages (looks like generally anytime)

Version-Release number of selected component (if applicable):
EL 6.4

How reproducible:
Almost 100%

Steps to Reproduce:
1. https://github.com/jfriesse/csts/blob/master/tests/start-cfgstop-one-by-one-with-load.sh (this is "Unit" test)
  
Actual results:
- On the corosync exit, multiple duplicated messages can be delivered to app
- Messages are lost/overwritten

Expected results:
Test result is 0 ($? == 0)

Additional info:
Comment 1 Jan Friesse 2013-02-05 08:38:25 EST
Created attachment 693380 [details]
Proposed patch - part 1
Comment 2 Jan Friesse 2013-02-05 08:38:57 EST
Created attachment 693381 [details]
Proposed patch - part 2 - check dispatch_put return code
Comment 3 Jan Friesse 2013-02-05 08:39:36 EST
With patches 1 and 2, duplicate messages are gone. Problem with corrupted/lost messages persists.
Comment 4 Jan Friesse 2013-03-07 12:00:24 EST
Created attachment 706733 [details]
Proposed patch - part 3 - Take alignment in acount for free_bytes in ring buffer

This patch solves another issue (CRITICAL). I believe it also solves problem's bmarsons is hitting.

Reproducer:
- Start corosync on one node
- Run cpgload -q -n 500
- After (usually) less then three minutes, lost messages will appear in form:
20130307T170546:(a126220a 3200):341:Incorrect msg seq 341 != 297
20130307T170546:(a126220a 3200):342:Incorrect msg seq 342 != 297
- cpgload will end with
Dispatch error 2
Comment 6 Jan Friesse 2013-03-08 04:21:11 EST
"Unit test" for proposed patch - part 3:
https://github.com/jfriesse/csts/commit/5188f85d1956db1c14e37737d4dadd5935c78d52
Comment 7 Jan Friesse 2013-03-08 05:00:41 EST
With patch 3, messages are no longer lost on single node cluster, but sadly still may be lost on multiple nodes cluster.
Comment 10 Barry Marson 2013-03-10 11:08:42 EDT
First run (10 iterations) showed no failures :) .. Running again for more confidence.

Barry
Comment 11 Jan Friesse 2013-03-18 05:30:20 EDT
I've split this BZ to two different BZillas.
Bug #907894 (this one) is to solve multiple nodes message corruption/lost/out of order messages delivery.
Bug #922671 is to solve local IPC problems.

Tho marking patches in this BZ as obsoleted (and move them to Bug #922671)
Comment 12 Jan Friesse 2013-03-21 09:23:22 EDT
Created attachment 713841 [details]
Proposed patch - part 1 - totempg: Make iov_delv local variable
Comment 13 Jan Friesse 2013-03-21 09:27:13 EDT
Created attachment 713842 [details]
Proposed patch - part 2 - Remove exit thread and replace it by exit pipe
Comment 14 Jan Friesse 2013-03-21 09:28:25 EDT
Created attachment 713843 [details]
Proposed patch - part 3 - schedwrk: Set values before create callback
Comment 15 Jan Friesse 2013-03-21 09:29:47 EDT
Created attachment 713844 [details]
Proposed patch - part 4 - Fix race for sending_allowed
Comment 16 Jan Friesse 2013-03-21 09:32:17 EDT
Created attachment 713855 [details]
Proposed patch - part 5 - totempg: Store and restore global variables
Comment 18 Jan Friesse 2013-05-29 09:47:51 EDT
Created attachment 754386 [details]
Proposed patch - part 6 - Lock sync_in_process variable

sync_in_process is changed by coropoll thread (main thread) but used by
all IPC connections. To ensure correct value is read, mutex is added.
Comment 25 Jaroslav Kortus 2013-09-11 09:07:22 EDT
Verified using start-cfgstop-one-by-one-with-load.sh test

FAIL on corosync-1.4.1-15.el6.x86_64 (RHEL6.4)
PASS on corosync-1.4.1-17.el6.x86_64 (RHEL6.5)
Comment 27 errata-xmlrpc 2013-11-20 23:32:17 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1531.html

Note You need to log in before you can comment on or make changes to this bug.