907894 – CPG: Corosync can duplicate and/or lost messages - Multiple nodes problems

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 907894 - CPG: Corosync can duplicate and/or lost messages - Multiple nodes problems

Summary: CPG: Corosync can duplicate and/or lost messages - Multiple nodes problems

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	6.4
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	922671 924261
Blocks:	960054
TreeView+	depends on / blocked

Reported:	2013-02-05 13:36 UTC by Jan Friesse
Modified:	2015-09-28 02:24 UTC (History)
CC List:	7 users (show)
Fixed In Version:	corosync-1.4.1-16.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Corosync running on multiple nodes, some of nodes are killed (corosync dies/exits/switch failure/...) Consequence: Very rarely, corosync can lost or duplicate messages. Fix: Fixed many race conditions. Result: Corosync should no longer loose or duplicate messages.
Clone Of:
Clones:	922671 (view as bug list)
Environment:
Last Closed:	2013-11-21 04:32:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Proposed patch - part 1 (1.56 KB, patch) 2013-02-05 13:38 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 2 - check dispatch_put return code (4.58 KB, patch) 2013-02-05 13:38 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 3 - Take alignment in acount for free_bytes in ring buffer (1.15 KB, patch) 2013-03-07 17:00 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 1 - totempg: Make iov_delv local variable (868 bytes, patch) 2013-03-21 13:23 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 2 - Remove exit thread and replace it by exit pipe (4.38 KB, patch) 2013-03-21 13:27 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 3 - schedwrk: Set values before create callback (1.11 KB, patch) 2013-03-21 13:28 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 4 - Fix race for sending_allowed (1.42 KB, patch) 2013-03-21 13:29 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 5 - totempg: Store and restore global variables (12.50 KB, patch) 2013-03-21 13:32 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 6 - Lock sync_in_process variable (2.07 KB, patch) 2013-05-29 13:47 UTC, Jan Friesse	no flags	Details \| Diff
Show Obsolete (3) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:1531	0	normal	SHIPPED_LIVE	corosync bug fix and enhancement update	2013-11-21 00:40:57 UTC

Description Jan Friesse 2013-02-05 13:36:33 UTC

Description of problem:
Corosync can duplicate messages on corosync server exit and/or lost (overwrite) messages (looks like generally anytime)

Version-Release number of selected component (if applicable):
EL 6.4

How reproducible:
Almost 100%

Steps to Reproduce:
1. https://github.com/jfriesse/csts/blob/master/tests/start-cfgstop-one-by-one-with-load.sh (this is "Unit" test)
  
Actual results:
- On the corosync exit, multiple duplicated messages can be delivered to app
- Messages are lost/overwritten

Expected results:
Test result is 0 ($? == 0)

Additional info:

Comment 1 Jan Friesse 2013-02-05 13:38:25 UTC

Created attachment 693380 [details]
Proposed patch - part 1

Comment 2 Jan Friesse 2013-02-05 13:38:57 UTC

Created attachment 693381 [details]
Proposed patch - part 2 - check dispatch_put return code

Comment 3 Jan Friesse 2013-02-05 13:39:36 UTC

With patches 1 and 2, duplicate messages are gone. Problem with corrupted/lost messages persists.

Comment 4 Jan Friesse 2013-03-07 17:00:24 UTC

Created attachment 706733 [details]
Proposed patch - part 3 - Take alignment in acount for free_bytes in ring buffer

This patch solves another issue (CRITICAL). I believe it also solves problem's bmarsons is hitting.

Reproducer:
- Start corosync on one node
- Run cpgload -q -n 500
- After (usually) less then three minutes, lost messages will appear in form:
20130307T170546:(a126220a 3200):341:Incorrect msg seq 341 != 297
20130307T170546:(a126220a 3200):342:Incorrect msg seq 342 != 297
- cpgload will end with
Dispatch error 2

Comment 6 Jan Friesse 2013-03-08 09:21:11 UTC

"Unit test" for proposed patch - part 3:
https://github.com/jfriesse/csts/commit/5188f85d1956db1c14e37737d4dadd5935c78d52

Comment 7 Jan Friesse 2013-03-08 10:00:41 UTC

With patch 3, messages are no longer lost on single node cluster, but sadly still may be lost on multiple nodes cluster.

Comment 10 Barry Marson 2013-03-10 15:08:42 UTC

First run (10 iterations) showed no failures :) .. Running again for more confidence.

Barry

Comment 11 Jan Friesse 2013-03-18 09:30:20 UTC

I've split this BZ to two different BZillas.
Bug #907894 (this one) is to solve multiple nodes message corruption/lost/out of order messages delivery.
Bug #922671 is to solve local IPC problems.

Tho marking patches in this BZ as obsoleted (and move them to Bug #922671)

Comment 12 Jan Friesse 2013-03-21 13:23:22 UTC

Created attachment 713841 [details]
Proposed patch - part 1 - totempg: Make iov_delv local variable

Comment 13 Jan Friesse 2013-03-21 13:27:13 UTC

Created attachment 713842 [details]
Proposed patch - part 2 - Remove exit thread and replace it by exit pipe

Comment 14 Jan Friesse 2013-03-21 13:28:25 UTC

Created attachment 713843 [details]
Proposed patch - part 3 - schedwrk: Set values before create callback

Comment 15 Jan Friesse 2013-03-21 13:29:47 UTC

Created attachment 713844 [details]
Proposed patch - part 4 - Fix race for sending_allowed

Comment 16 Jan Friesse 2013-03-21 13:32:17 UTC

Created attachment 713855 [details]
Proposed patch - part 5 - totempg: Store and restore global variables

Comment 18 Jan Friesse 2013-05-29 13:47:51 UTC

Created attachment 754386 [details]
Proposed patch - part 6 - Lock sync_in_process variable

sync_in_process is changed by coropoll thread (main thread) but used by
all IPC connections. To ensure correct value is read, mutex is added.

Comment 25 Jaroslav Kortus 2013-09-11 13:07:22 UTC

Verified using start-cfgstop-one-by-one-with-load.sh test

FAIL on corosync-1.4.1-15.el6.x86_64 (RHEL6.4)
PASS on corosync-1.4.1-17.el6.x86_64 (RHEL6.5)

Comment 27 errata-xmlrpc 2013-11-21 04:32:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1531.html

Note You need to log in before you can comment on or make changes to this bug.