830799 – Nodes do not agree on CPG membership, messages lost

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 830799 - Nodes do not agree on CPG membership, messages lost

Summary: Nodes do not agree on CPG membership, messages lost

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	6.4
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Friesse
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	884770 (view as bug list)
Depends On:	820821 863940 869609
Blocks:	895654
TreeView+	depends on / blocked

Reported:	2012-06-11 12:47 UTC by Jan Friesse
Modified:	2018-12-02 18:51 UTC (History)
CC List:	12 users (show)
Fixed In Version:	corosync-1.4.1-13.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:	820821
Environment:
Last Closed:	2013-02-21 07:50:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Proposed patch - part 1 - Never choose downlist with localnode (1.85 KB, patch) 2012-06-14 13:22 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 2 - Process join list after downlists (6.08 KB, patch) 2012-06-14 13:23 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 3 - Enhance downlist selection algorithm (1.63 KB, patch) 2012-06-14 13:24 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 4 - Fix problem with sync operations under very rare circumstances (7.06 KB, patch) 2012-11-08 14:59 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 5 - Handle segfault in backlog_get (1009 bytes, patch) 2012-11-08 15:00 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch - part 6 - Add waiting_trans_ack also to fragmentation layer (9.44 KB, patch) 2012-11-08 15:01 UTC, Jan Friesse	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	265823	0	None	None	None	2018-12-02 18:51:38 UTC
Red Hat Product Errata	RHBA-2013:0497	0	normal	SHIPPED_LIVE	corosync bug fix and enhancement update	2013-02-20 21:18:24 UTC

Description Jan Friesse 2012-06-11 12:47:15 UTC

+++ This bug was initially created as a clone of Bug #820821 +++

Created attachment 583724 [details]
Full logs and versions

Description of problem:

4 node cluster
- pcmk-1 = 1702537408 
- pcmk-2 = 1719314624
- pcmk-3 = 1736091840
- pcmk-4 = 1752869056

All nodes see pcmk-1, pcmk-2, pcmk-4
pcmk-2 cannot see pcmk-3, everyone else can

Once we reach this state, the only messages logged by corosync are of the form:

corosync warning [CPG   ] Unknown node -> we will not deliver message

This repeats for about 15 minutes until pcmk-4 is shut down and things appear to return to normal.

The other nodes also show the same message at times, but the pacemaker logs show that many messages are getting through in between.

After talking with Angus, if nothing else, it is probably a good idea to add some additional information to that log message.

Version-Release number of selected component (if applicable):

Name        : corosync
Version     : 1.99.9
Release     : 1.fc17
Architecture: x86_64
Install Date: Fri 13 Apr 2012 01:58:59 CEST
Group       : System Environment/Base
Size        : 379855
License     : BSD
Signature   : RSA/SHA256, Fri 30 Mar 2012 02:35:59 CEST, Key ID 50e94c991aca3465
Source RPM  : corosync-1.99.9-1.fc17.src.rpm
Build Date  : Wed 28 Mar 2012 07:15:04 CEST
Build Host  : x86-07.phx2.fedoraproject.org

How reproducible:

Rare

Steps to Reproduce:
1.  Run CTS
2.  Pray
3.
  
Actual results:


Expected results:


Additional info:


May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] The network interface [192.168.122.102] is now up.
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: cmap
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync configuration service [1]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: cfg
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: cpg
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
May 11 04:55:10 [3718] pcmk-2 corosync notice  [QUORUM] Using quorum provider corosync_votequorum
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: votequorum
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: quorum
May 11 04:55:10 [3718] pcmk-2 corosync notice  [QUORUM] Members[1]: 1719314624
May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 11 04:55:10 [3718] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 11 04:55:10 [3718] pcmk-2 corosync notice  [QUORUM] Members[2]: 1719314624 1736091840
May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 11 04:55:10 [3718] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 11 04:55:11 [3718] pcmk-2 corosync notice  [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056
May 11 04:55:11 [3718] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 11 04:55:11 [3718] pcmk-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
May 11 04:55:11 [3718] pcmk-2 corosync notice  [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056
May 11 04:55:11 [3718] pcmk-2 corosync notice  [TOTEM ] Retransmit List: 1 2 3 
May 11 04:55:11 [3718] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 11 04:55:11 [3718] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 11 04:55:11 [3718] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
...
May 11 05:10:27 [3718] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 11 05:10:30 [3718] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 11 05:10:30 [3718] pcmk-2 corosync notice  [QUORUM] Members[3]: 1702537408 1719314624 1736091840

At this point pcmk-2 starts to receive messages from pcmk-3

--- Additional comment from abeekhof on 2012-05-11 04:45:38 EDT ---

Reproduced with latest from f17. Now that I know what to look for, perhaps its not so rare afterall.

Name        : corosync
Version     : 2.0.0
Release     : 1.fc17
Architecture: x86_64
Install Date: Fri 11 May 2012 07:03:45 CEST
Group       : System Environment/Base
Size        : 383797
License     : BSD
Signature   : RSA/SHA256, Wed 11 Apr 2012 19:53:39 CEST, Key ID 50e94c991aca3465
Source RPM  : corosync-2.0.0-1.fc17.src.rpm
Build Date  : Wed 11 Apr 2012 08:55:29 CEST
Build Host  : x86-09.phx2.fedoraproject.org

--- Additional comment from abeekhof on 2012-05-14 22:10:52 EDT ---

I managed to catch a live cluster in this state.

pcmk-1 == 192.168.122.101
pcmk-2 == 192.168.122.102

I verified that the "we will not deliver message" log occurs whenever the cib on pcmk-2 tries to send a CPG message to the cib on pcmk-1.  Probably also in additional cases.

Black box data will be attached.


[root@pcmk-2 ~]# ps axf
...
24038 ?        Ssl   17:00 /usr/sbin/corosync
24051 ?        Ssl    0:28 /usr/sbin/pacemakerd -f
24055 ?        Ssl    0:45  \_ /usr/libexec/pacemaker/cib
24057 ?        Ss     0:30  \_ /usr/libexec/pacemaker/stonithd
24058 ?        Ss     2:35  \_ /usr/lib64/heartbeat/lrmd
24059 ?        Ss     0:52  \_ /usr/libexec/pacemaker/attrd
24060 ?        Ss     0:28  \_ /usr/libexec/pacemaker/pengine
24568 ?        Ss     0:00  \_ /usr/libexec/pacemaker/crmd
24501 ?        Ssl    0:00 /usr/sbin/console-kit-daemon --no-daemon
[root@pcmk-2 ~]# corosync-cpgtool 
Group Name	       PID	   Node ID
crmd\x00
		     17249	1702537408 (192.168.122.101)
		     24568	1719314624 (192.168.122.102)
attrd\x00
		     17247	1702537408 (192.168.122.101)
stonith-ng\x00
		     17245	1702537408 (192.168.122.101)
cib\x00
		     17243	1702537408 (192.168.122.101)
pcmk\x00
		     17242	1702537408 (192.168.122.101)
[root@pcmk-2 ~]# grep 24055 /var/log/messages | grep "info: pcmk_cpg_membership"
May 15 02:50:23 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1719314624 
May 15 02:50:23 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:50:59 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:50:59 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[1] 1719314624 
May 15 02:50:59 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:50:59 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1719314624 
May 15 02:51:02 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:51:05 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:51:40 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:51:41 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:56:33 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:56:47 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:59:29 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:59:31 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 




[root@pcmk-1 ~]# ps axf
...
17224 ?        Ssl   28:08 /usr/sbin/corosync
17242 ?        Ssl    0:35 /usr/sbin/pacemakerd -f
17243 ?        Ssl    0:48  \_ /usr/libexec/pacemaker/cib
17245 ?        Ssl    1:17  \_ /usr/libexec/pacemaker/stonithd
17246 ?        Ss     1:52  \_ /usr/lib64/heartbeat/lrmd
17247 ?        Ss     0:55  \_ /usr/libexec/pacemaker/attrd
17248 ?        Ss     0:46  \_ /usr/libexec/pacemaker/pengine
17249 ?        Ss     0:57  \_ /usr/libexec/pacemaker/crmd
22059 ?        Ssl    0:00 /usr/sbin/console-kit-daemon --no-daemon
[root@pcmk-1 ~]# corosync-cpgtool 
Group Name	       PID	   Node ID
crmd\x00
		     17249	1702537408 (192.168.122.101)
		     24568	1719314624 (192.168.122.102)
attrd\x00
		     17247	1702537408 (192.168.122.101)
stonith-ng\x00
		     17245	1702537408 (192.168.122.101)
cib\x00
		     17243	1702537408 (192.168.122.101)
pcmk\x00
		     17242	1702537408 (192.168.122.101)
[root@pcmk-1 ~]# grep 17243 /var/log/messages | grep "info: pcmk_cpg_membership"
May 15 02:51:02 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:51:02 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Left[0] 1719314624 
May 15 02:51:02 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:51:02 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Member[1] 1719314624 
May 15 02:51:09 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:51:09 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Left[0] 1719314624

--- Additional comment from abeekhof on 2012-05-14 22:14:05 EDT ---

Created attachment 584502 [details]
Black box from pcmk-1

--- Additional comment from abeekhof on 2012-05-14 22:15:00 EDT ---

Created attachment 584503 [details]
Black box from pcmk-2

--- Additional comment from jfriesse on 2012-05-15 10:57:03 EDT ---

Andrew,
thanks for all blackboxes/reports you've sent. I'm not ignoring this bug (even it make looks so) and it's actually getting higher and higher priority (but sadly there were still things with even greater priority...). Anyway, I have one question and one appeal.

Question:
corosync-cpgtool 
Group Name        PID    Node ID
crmd\x00
       17249 1702537408 (192.168.122.101)
       24568 1719314624 (192.168.122.102)
attrd\x00
       17247 1702537408 (192.168.122.101)
stonith-ng\x00
       17245 1702537408 (192.168.122.101)
cib\x00
       17243 1702537408 (192.168.122.101)
pcmk\x00
       17242 1702537408 (192.168.122.101)

Is this normal that only 192.168.122.101 is connected to pcmk, cib, ... group or every node should be?

Appeal:
do you think it's possible to create core dump on place where "Unknown node -> we will not deliver message" message is printed?

I will try to prepare patches for enhance log from cpg, tomorrow so it should be easier to find out what is happening (and hopefully it will allow me to at least reproduce bug by myself in shorter time).

--- Additional comment from abeekhof on 2012-05-15 18:03:23 EDT ---

(In reply to comment #5)
> Is this normal that only 192.168.122.101 is connected to pcmk, cib, ... group
> or every node should be?

The latter. All the processes should be connected on every node.

> 
> Appeal:
> do you think it's possible to create core dump on place where "Unknown node ->
> we will not deliver message" message is printed?

I can try, but it involves catching the cluster when its in this state which is tricky.
Yesterday I got lucky :)
If it happens again I'll set a breakpoint at that log and run the cibadmin command (which sends a CPG message) to trigger it.

What information do you want?

--- Additional comment from jfriesse on 2012-05-16 05:45:41 EDT ---

(In reply to comment #6)
> (In reply to comment #5)
> > Is this normal that only 192.168.122.101 is connected to pcmk, cib, ... group
> > or every node should be?
> 
> The latter. All the processes should be connected on every node.
> 

This is what I was scary about :(

> > 
> > Appeal:
> > do you think it's possible to create core dump on place where "Unknown node ->
> > we will not deliver message" message is printed?
> 
> I can try, but it involves catching the cluster when its in this state which is
> tricky.
> Yesterday I got lucky :)
> If it happens again I'll set a breakpoint at that log and run the cibadmin
> command (which sends a CPG message) to trigger it.
> 
> What information do you want?

Actually I wanted to find out what pids are in internal structures ... but it seems to be not so important now, because what I've seen directed me to another path.

Can you please apply patch in following comment and send me blackboxes (ideally again from 2 node cluster) when problem appears?

For now it looks like join msg was not send / delivered or some kind of weird coincidence between downlist and join list happened. Hopefully with more debug informations I will be more clever.

--- Additional comment from jfriesse on 2012-05-16 05:48:05 EDT ---

Created attachment 584918 [details]
Add more debug informations

Patch should be applicable to current master (88dd3e1eeacd64701d665f10acbc40f3795dd32f)

--- Additional comment from jfriesse on 2012-05-16 07:50:19 EDT ---

Created attachment 584944 [details]
Add even more debug informations

Patch should be applicable to current master
(88dd3e1eeacd64701d665f10acbc40f3795dd32f). It's replacement for previous one!

--- Additional comment from abeekhof on 2012-05-16 20:10:54 EDT ---

Built and running.

Question: Is it expected that the "unknown" message is ever seen?

I see it semi-regularly for a period of 1-3 seconds but then it seems to resolve itself.
It is much rarer that it persists for minutes or longer - but perhaps thats just because there were no additional membership changes to break the cycle.

--- Additional comment from abeekhof on 2012-05-17 06:33:52 EDT ---

I got this from the corosync logs today, but it doesn't seem like there is much more information:


May 17 06:03:41 [10190] pcmk-2 corosync notice  [QUORUM] Members[3]: 101 102 103
May 17 06:03:41 [10190] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 06:03:41 [10190] pcmk-2 corosync notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 17 06:03:41 [10190] pcmk-2 corosync notice  [QUORUM] Members[1]: 102
May 17 06:03:41 [10190] pcmk-2 corosync notice  [QUORUM] Members[2]: 101 102
May 17 06:03:41 [10190] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] downlist left_list: 3 received in state 0
May 17 06:03:41 [10190] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:42 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:42 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:42 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:48 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:04:09 [10190] pcmk-2 corosync notice  [QUORUM] Members[3]: 101 102 104
May 17 06:04:09 [10190] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 06:04:10 [10190] pcmk-2 corosync notice  [QUORUM] Members[4]: 101 102 103 104
May 17 06:04:10 [10190] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 06:04:10 [10190] pcmk-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
May 17 06:04:10 [10190] pcmk-2 corosync notice  [QUORUM] Members[4]: 101 102 103 104
May 17 06:04:10 [10190] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.

--- Additional comment from jfriesse on 2012-05-17 07:43:16 EDT ---

Andrew,

(In reply to comment #10)
> Built and running.
> 

Thanks

> Question: Is it expected that the "unknown" message is ever seen?
> 

NO. It shouldn't ever happen. Basically, all nodes should agree on membership (what is partly true) and cannot lost messages without let user app know.

> I see it semi-regularly for a period of 1-3 seconds but then it seems to
> resolve itself.

This is reason why I need more logs (blackbox).

> It is much rarer that it persists for minutes or longer - but perhaps thats
> just because there were no additional membership changes to break the cycle.

Exactly. Membership change will make problem disappear.

BTW. I've added all new log messages as a DEBUG, so need blackbox output.

--- Additional comment from abeekhof on 2012-05-20 21:46:29 EDT ---

(In reply to comment #12)
> BTW. I've added all new log messages as a DEBUG, so need blackbox output.

That is a problem.

While the problem gets triggered reasonably often, its not really practical to expect people to watch, in real-time, for hours, logs from N machines so they can run additional commands in the 1-2 minute window before the processes exit (afterall, nodes joining/leaving is what triggers this).

The blackbox is awesome, as is the ability to grab the contents while the process is running.  Pacemaker now uses that too. But its not a cure-all and, particularly for problems like this, you need the capability to trigger it programatically too.

Pacemaker added "dump the blackbox" to its non-fatal assertion code for this purpose but it could equally be integrated into the error/warning logging code.

--- Additional comment from sdake on 2012-05-21 10:38:07 EDT ---

Angus,

Regarding Comment #13, any chance of implementing that as a flag to the setup of the flight recorder?

--- Additional comment from asalkeld on 2012-05-28 20:57:43 EDT ---

(In reply to comment #14)
> Angus,
> 
> Regarding Comment #13, any chance of implementing that as a flag to the
> setup of the flight recorder?

I am not sure what a flag in blackbox world do. What Andrew I think
is suggesting is we setup a soft_assert() in corosync that runs
the blackbox dump (i.e. self triggering). Then instead of logging
we could make a debug build that soft_assert()'s in this particular condition
Thus not requiring Andrew to stare at his screen for ages waiting for this 
event to happen.

something like this:

#define soft_assert(_cond_) \
do { \
if (_cond_) { \
qb_log_blackbox_write_to_file (LOCALSTATEDIR "/lib/corosync/assert"); \
} \
} while (0)

--- Additional comment from sdake on 2012-05-28 23:21:45 EDT ---

I think what Andrew was suggesting was a config option that automatically dumps the blackbox anytime a WARN or ERR log type occurs.

Andrew,
could you offer more info as to what you desire here?

Regards
-steve

--- Additional comment from abeekhof on 2012-06-01 00:56:40 EDT ---

(In reply to comment #16)
> I think what Andrew was suggesting was a config option that automatically
> dumps the blackbox anytime a WARN or ERR log type occurs.

Either this or Angus' idea would be good.
Anything that automates the writing out of blackbox data when something bad happens.

--- Additional comment from jfriesse on 2012-06-06 09:25:13 EDT ---

Created attachment 589886 [details]
Add even more debug informations - try 2

Andrew,
can you please try apply patch to master? It has assert(0) on "unknown node" failure, so it will hang corosync but allow me to find out more information.

Thanks

--- Additional comment from jfriesse on 2012-06-11 08:28:02 EDT ---

Good news,
I was able to reproduce this bug very reliable and very easily. So reproducer:
node1, node2, both running testcpg and corosync

ctrl+z on node1

wait for node2 to create membership

fg %1 on node1

-> node2 accepts downlist which looks like:

sender r(0) ip(10.34.38.107) ; members(old:1 left:0)
comparing: sender r(0) ip(10.34.38.106) ; members(old:2 left:1)

chosen downlist: sender r(0) ip(10.34.38.106) ; members(old:2 left:1)

and also ignores it's own join messages.

--- Additional comment from jfriesse on 2012-06-11 08:34:07 EDT ---

Main question here is, what membership changes we want to see. It looks like Angus idea was to have SAME downlist across each of node, in other words:

node1 - was sleeping and old membership was 1,2 - new membership is 1,2, transitional was 1 -> left_list = 2
node2 - was not sleeping, and was able to correctly deliver lost of node 1 and when node 1 connected, it seems old membership = 2 - new membership is 1,2, transitional was nothing -> left_list = nothing.

Angus idea seems to be that we want to see on each of node node 1 point of view.

Even it may make some kind of sense, it looks more logical to:
node 1 sees that node 2 was left
node 2 sees that NOBODY left

Simple patch is included in following commit. It just changes selection of "best list" by removing lists which includes local node (because such node never left building).

But it's for sure something to discuss, so Angus, Steve (and probably Andrew) which view of world do you think is more correct?

--- Additional comment from jfriesse on 2012-06-11 08:40:50 EDT ---

Also same bug exists in flatiron.

--- Additional comment from jfriesse on 2012-06-11 08:41:31 EDT ---

Created attachment 590918 [details]
Proposed patch

Comment 1 Jan Friesse 2012-06-14 13:22:05 UTC

Created attachment 591832 [details]
Proposed patch - part 1 - Never choose downlist with localnode

Test scenario is follows:
- node 1, node 2
- node 1 is paused
- node 2 sees node 1 dead
- node 1 unpaused
- node 1 and 2 both choose same dowlist message which includes node 2 ->
node 2 is efectivelly disconnected

Patch includes additional test if left_node is localnode. If so, such
downlist is ignored.

Comment 2 Jan Friesse 2012-06-14 13:23:03 UTC

Created attachment 591833 [details]
Proposed patch - part 2 - Process join list after downlists

let's say following situation will happen:
- we have 3 nodes
- on wire messages looks like D1,J1,D2,J2,D3,J3 (D is downlist, J is
  joinlist)
- let's say, D1 and D3 contains node 2
- it means that J2 is applied, but right after that, D1 (or D3) is
  applied what means, node 2 is again considered down

It's solved by collecting joinlists and apply them after downlist, so
order is:
- apply best matching downlist
- apply all joinlists

Comment 3 Jan Friesse 2012-06-14 13:24:15 UTC

Created attachment 591834 [details]
Proposed patch - part 3 - Enhance downlist selection algorithm

Enhance downlist selection algorithm

Let's say we have 2 nodes:
- node 2 is paused
- node 1 create membership (one node)
- node 2 is unpaused

Result is that node 1 downlist is selected, so it means that
from node 2 point of view, node 1 was never down.

Patch solves situation by adding additional check for largest
previous membership.

So current tests are:
1) largest (previous #nodes - #nodes know to have left)
2) (then) largest previous membership
3) (and last as a tie-breaker) node with smallest nodeid

Comment 8 Jan Friesse 2012-11-08 14:59:38 UTC

Created attachment 640856 [details]
Proposed patch - part 4 - Fix problem with sync operations under very rare circumstances

This patch creates a special message queue for synchronization messages.
This prevents a situation in which messages are queued in the
new_message_queue but have not yet been originated from corrupting the
synchronization process.

Comment 9 Jan Friesse 2012-11-08 15:00:41 UTC

Created attachment 640861 [details]
Proposed patch - part 5 - Handle segfault in backlog_get

If instance->memb_state is not OPERATION or RECOVERY, we was passing
NULL to cs_queue_used call.

Comment 10 Jan Friesse 2012-11-08 15:01:33 UTC

Created attachment 640862 [details]
Proposed patch - part 6 - Add waiting_trans_ack also to fragmentation layer

Patch for support waiting_trans_ack may fail if there is synchronization
happening between delivery of fragmented message. In such situation,
fragmentation layer is waiting for message with correct number, but it
will never arrive.

Solution is to handle (callback) change of waiting_trans_ack and use
different queue.

Comment 11 Jan Friesse 2013-01-03 09:09:55 UTC

*** Bug 889564 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2013-02-21 07:50:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0497.html

Comment 18 Jan Friesse 2013-05-13 07:11:20 UTC

*** Bug 884770 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.