Bug 922671 - CPG: Corosync can duplicate and/or lost messages - Local IPC
CPG: Corosync can duplicate and/or lost messages - Local IPC
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.4
All All
urgent Severity urgent
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
: ZStream
Depends On:
Blocks: 907894 929096 929098 929100 929101
  Show dependency treegraph
 
Reported: 2013-03-18 05:25 EDT by Jan Friesse
Modified: 2013-11-20 23:33 EST (History)
6 users (show)

See Also:
Fixed In Version: corosync-1.4.1-16.el6
Doc Type: Bug Fix
Doc Text:
When running applications which used the Corosync IPC library, some messages in the dispatch() function were lost or duplicated. This update properly checks the return values of the dispatch_put() function, returns the correct remaining bytes in the IPC ring buffer, and ensures that the IPC client is correctly informed about the real number of messages in the ring buffer. Now, messages in the dispatch() function are no longer lost or duplicated.
Story Points: ---
Clone Of: 907894
Environment:
Last Closed: 2013-11-20 23:33:08 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed patch - part 1 (1.56 KB, patch)
2013-03-18 06:00 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 2 - check dispatch_put return code (4.58 KB, patch)
2013-03-18 06:02 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 3 - Take alignment in acount for free_bytes in ring buffer (1.15 KB, patch)
2013-03-18 06:03 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 4 - Properly lock pending_semops (2.86 KB, patch)
2013-03-18 06:10 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 4 - Properly lock pending_semops - Try2 (2.86 KB, patch)
2013-03-18 06:33 EDT, Jan Friesse
no flags Details | Diff
Proposed patch - part 4 - Properly lock pending_semops - Try3 (2.97 KB, patch)
2013-03-18 06:37 EDT, Jan Friesse
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 389613 None None None Never

  None (edit)
Comment 1 Jan Friesse 2013-03-18 05:26:32 EDT
This is clone of Bug #907894 solving local IPC problems.
Comment 2 Jan Friesse 2013-03-18 06:00:18 EDT
Created attachment 711845 [details]
Proposed patch - part 1
Comment 3 Jan Friesse 2013-03-18 06:02:05 EDT
Created attachment 711846 [details]
Proposed patch - part 2 - check dispatch_put return code

Proposed patch 1 + 2 are reproducible by running https://github.com/jfriesse/csts/blob/master/tests/start-cfgstop-one-by-one-with-load.sh. When bug appear, there are duplicated messages in output (usually last 2 are duplicate).
Comment 4 Jan Friesse 2013-03-18 06:03:47 EDT
Created attachment 711848 [details]
Proposed patch - part 3 - Take alignment in acount for free_bytes in ring buffer

"Unit test" https://github.com/jfriesse/csts/blob/master/tests/ipc-overflow.sh
Comment 5 Jan Friesse 2013-03-18 06:10:35 EDT
Created attachment 711850 [details]
Proposed patch - part 4 - Properly lock pending_semops

Sadly, this problem is race so it's quite hard to reproduce. I had moderate success with two nodes and:
- node 1 - running corosync, cpgload -q -n 500 and cpgload -l 1 -n 500 -q
- node 2 - running corosync and cpgload -q -n 500

After 5+ hours, one of cpgload is terminated (it ends up return code 0, because CS_ERR_LIBRARY arrived).

With patch, I was able to run configuration above for 3 days.

Keep in mind that it CAN happen, that (because of extreme high load) cpgload may pause and corosync is terminated by OOM. This is not a bug.
Comment 6 Jan Friesse 2013-03-18 06:16:37 EDT
Barry: Can you please give a try scratch build http://brewweb.devel.redhat.com/brew/taskinfo?taskID=5527619 ? I was able to run above test for 3 days, and I really hope it solves problem you are hitting (please use most unstable configuration, so irqbalance, corosync unpined, errata kernel, ...)
Comment 7 Jan Friesse 2013-03-18 06:33:01 EDT
Created attachment 711875 [details]
Proposed patch - part 4 - Properly lock pending_semops - Try2
Comment 8 Jan Friesse 2013-03-18 06:37:33 EDT
Created attachment 711877 [details]
Proposed patch - part 4 - Properly lock pending_semops - Try3
Comment 9 Barry Marson 2013-03-25 10:33:36 EDT
I have run SAS calibration on my 4 node cluster with the latest bits.

5 different types of test configurations were run to try and stress the cluster interconnect in different fashions

Each configuration was run 10 times (each of which stresses the system for at least 2-3 hours).

After 6 days of testing, I can report ZERO failures.

Nice job.

Barry
Comment 16 Jaroslav Kortus 2013-09-11 09:15:33 EDT
Verified using ipc-overflow.sh test:

FAIL on corosync-1.4.1-15.el6.x86_64 (RHEL6.4)
PASS on corosync-1.4.1-17.el6.x86_64 (RHEL6.5)
Comment 18 errata-xmlrpc 2013-11-20 23:33:08 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1531.html

Note You need to log in before you can comment on or make changes to this bug.