Bug 820821 - Nodes do not agree on CPG membership, messages lost
Summary: Nodes do not agree on CPG membership, messages lost
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: corosync
Version: 17
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Jan Friesse
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 830799
TreeView+ depends on / blocked
 
Reported: 2012-05-11 04:57 UTC by Andrew Beekhof
Modified: 2012-11-28 07:39 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 830799 (view as bug list)
Environment:
Last Closed: 2012-11-28 07:39:30 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Full logs and versions (541.40 KB, application/x-bzip2)
2012-05-11 04:57 UTC, Andrew Beekhof
no flags Details
Black box from pcmk-1 (2.23 MB, application/octet-stream)
2012-05-15 02:14 UTC, Andrew Beekhof
no flags Details
Black box from pcmk-2 (2.23 MB, application/octet-stream)
2012-05-15 02:15 UTC, Andrew Beekhof
no flags Details
Add more debug informations (3.44 KB, patch)
2012-05-16 09:48 UTC, Jan Friesse
no flags Details | Diff
Add even more debug informations (6.92 KB, patch)
2012-05-16 11:50 UTC, Jan Friesse
no flags Details | Diff
Add even more debug informations - try 2 (7.23 KB, patch)
2012-06-06 13:25 UTC, Jan Friesse
no flags Details | Diff
Proposed patch (1.77 KB, patch)
2012-06-11 12:41 UTC, Jan Friesse
no flags Details | Diff
Proposed patch - part 2 (5.26 KB, patch)
2012-06-12 13:42 UTC, Jan Friesse
no flags Details | Diff
Proposed patch - part 3 (1.61 KB, patch)
2012-06-13 14:10 UTC, Jan Friesse
no flags Details | Diff
Proposed patch - part 3 - try 2 (1.61 KB, patch)
2012-06-13 14:50 UTC, Jan Friesse
no flags Details | Diff
Add assert if we were delivering to unknown node (733 bytes, patch)
2012-06-15 07:54 UTC, Jan Friesse
no flags Details | Diff
Blackbox contents from pcmk-1 after "unknown node abort" (69.16 KB, text/plain)
2012-06-20 05:42 UTC, Andrew Beekhof
no flags Details
Backtrace with all variables printed (9.83 KB, text/plain)
2012-06-20 05:47 UTC, Andrew Beekhof
no flags Details
Proposed patch - part 4 - Fix problem with sync operations under very rare circumstances (7.53 KB, patch)
2012-11-08 15:05 UTC, Jan Friesse
no flags Details | Diff
Proposed patch - part 5 - Handle segfault in backlog_get (975 bytes, patch)
2012-11-08 15:06 UTC, Jan Friesse
no flags Details | Diff
Proposed patch - part 6 - Add waiting_trans_ack also to fragmentation layer (9.32 KB, patch)
2012-11-08 15:07 UTC, Jan Friesse
no flags Details | Diff

Description Andrew Beekhof 2012-05-11 04:57:19 UTC
Created attachment 583724 [details]
Full logs and versions

Description of problem:

4 node cluster
- pcmk-1 = 1702537408 
- pcmk-2 = 1719314624
- pcmk-3 = 1736091840
- pcmk-4 = 1752869056

All nodes see pcmk-1, pcmk-2, pcmk-4
pcmk-2 cannot see pcmk-3, everyone else can

Once we reach this state, the only messages logged by corosync are of the form:

corosync warning [CPG   ] Unknown node -> we will not deliver message

This repeats for about 15 minutes until pcmk-4 is shut down and things appear to return to normal.

The other nodes also show the same message at times, but the pacemaker logs show that many messages are getting through in between.

After talking with Angus, if nothing else, it is probably a good idea to add some additional information to that log message.

Version-Release number of selected component (if applicable):

Name        : corosync
Version     : 1.99.9
Release     : 1.fc17
Architecture: x86_64
Install Date: Fri 13 Apr 2012 01:58:59 CEST
Group       : System Environment/Base
Size        : 379855
License     : BSD
Signature   : RSA/SHA256, Fri 30 Mar 2012 02:35:59 CEST, Key ID 50e94c991aca3465
Source RPM  : corosync-1.99.9-1.fc17.src.rpm
Build Date  : Wed 28 Mar 2012 07:15:04 CEST
Build Host  : x86-07.phx2.fedoraproject.org

How reproducible:

Rare

Steps to Reproduce:
1.  Run CTS
2.  Pray
3.
  
Actual results:


Expected results:


Additional info:


May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] Initializing transport (UDP/IP Multicast).
May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] The network interface [192.168.122.102] is now up.
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: cmap
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync configuration service [1]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: cfg
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: cpg
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
May 11 04:55:10 [3718] pcmk-2 corosync notice  [QUORUM] Using quorum provider corosync_votequorum
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: votequorum
May 11 04:55:10 [3718] pcmk-2 corosync notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 11 04:55:10 [3718] pcmk-2 corosync info    [QB    ] server name: quorum
May 11 04:55:10 [3718] pcmk-2 corosync notice  [QUORUM] Members[1]: 1719314624
May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 11 04:55:10 [3718] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 11 04:55:10 [3718] pcmk-2 corosync notice  [QUORUM] Members[2]: 1719314624 1736091840
May 11 04:55:10 [3718] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 11 04:55:10 [3718] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 11 04:55:11 [3718] pcmk-2 corosync notice  [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056
May 11 04:55:11 [3718] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 11 04:55:11 [3718] pcmk-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
May 11 04:55:11 [3718] pcmk-2 corosync notice  [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056
May 11 04:55:11 [3718] pcmk-2 corosync notice  [TOTEM ] Retransmit List: 1 2 3 
May 11 04:55:11 [3718] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 11 04:55:11 [3718] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 11 04:55:11 [3718] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
...
May 11 05:10:27 [3718] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 11 05:10:30 [3718] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 11 05:10:30 [3718] pcmk-2 corosync notice  [QUORUM] Members[3]: 1702537408 1719314624 1736091840

At this point pcmk-2 starts to receive messages from pcmk-3

Comment 1 Andrew Beekhof 2012-05-11 08:45:38 UTC
Reproduced with latest from f17. Now that I know what to look for, perhaps its not so rare afterall.

Name        : corosync
Version     : 2.0.0
Release     : 1.fc17
Architecture: x86_64
Install Date: Fri 11 May 2012 07:03:45 CEST
Group       : System Environment/Base
Size        : 383797
License     : BSD
Signature   : RSA/SHA256, Wed 11 Apr 2012 19:53:39 CEST, Key ID 50e94c991aca3465
Source RPM  : corosync-2.0.0-1.fc17.src.rpm
Build Date  : Wed 11 Apr 2012 08:55:29 CEST
Build Host  : x86-09.phx2.fedoraproject.org

Comment 2 Andrew Beekhof 2012-05-15 02:10:52 UTC
I managed to catch a live cluster in this state.

pcmk-1 == 192.168.122.101
pcmk-2 == 192.168.122.102

I verified that the "we will not deliver message" log occurs whenever the cib on pcmk-2 tries to send a CPG message to the cib on pcmk-1.  Probably also in additional cases.

Black box data will be attached.


[root@pcmk-2 ~]# ps axf
...
24038 ?        Ssl   17:00 /usr/sbin/corosync
24051 ?        Ssl    0:28 /usr/sbin/pacemakerd -f
24055 ?        Ssl    0:45  \_ /usr/libexec/pacemaker/cib
24057 ?        Ss     0:30  \_ /usr/libexec/pacemaker/stonithd
24058 ?        Ss     2:35  \_ /usr/lib64/heartbeat/lrmd
24059 ?        Ss     0:52  \_ /usr/libexec/pacemaker/attrd
24060 ?        Ss     0:28  \_ /usr/libexec/pacemaker/pengine
24568 ?        Ss     0:00  \_ /usr/libexec/pacemaker/crmd
24501 ?        Ssl    0:00 /usr/sbin/console-kit-daemon --no-daemon
[root@pcmk-2 ~]# corosync-cpgtool 
Group Name	       PID	   Node ID
crmd\x00
		     17249	1702537408 (192.168.122.101)
		     24568	1719314624 (192.168.122.102)
attrd\x00
		     17247	1702537408 (192.168.122.101)
stonith-ng\x00
		     17245	1702537408 (192.168.122.101)
cib\x00
		     17243	1702537408 (192.168.122.101)
pcmk\x00
		     17242	1702537408 (192.168.122.101)
[root@pcmk-2 ~]# grep 24055 /var/log/messages | grep "info: pcmk_cpg_membership"
May 15 02:50:23 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1719314624 
May 15 02:50:23 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:50:59 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:50:59 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[1] 1719314624 
May 15 02:50:59 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:50:59 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1719314624 
May 15 02:51:02 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:51:05 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:51:40 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:51:41 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:56:33 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:56:47 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:59:29 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Left[0] 1702537408 
May 15 02:59:31 pcmk-2 cib[24055]:     info: pcmk_cpg_membership: Member[0] 1702537408 




[root@pcmk-1 ~]# ps axf
...
17224 ?        Ssl   28:08 /usr/sbin/corosync
17242 ?        Ssl    0:35 /usr/sbin/pacemakerd -f
17243 ?        Ssl    0:48  \_ /usr/libexec/pacemaker/cib
17245 ?        Ssl    1:17  \_ /usr/libexec/pacemaker/stonithd
17246 ?        Ss     1:52  \_ /usr/lib64/heartbeat/lrmd
17247 ?        Ss     0:55  \_ /usr/libexec/pacemaker/attrd
17248 ?        Ss     0:46  \_ /usr/libexec/pacemaker/pengine
17249 ?        Ss     0:57  \_ /usr/libexec/pacemaker/crmd
22059 ?        Ssl    0:00 /usr/sbin/console-kit-daemon --no-daemon
[root@pcmk-1 ~]# corosync-cpgtool 
Group Name	       PID	   Node ID
crmd\x00
		     17249	1702537408 (192.168.122.101)
		     24568	1719314624 (192.168.122.102)
attrd\x00
		     17247	1702537408 (192.168.122.101)
stonith-ng\x00
		     17245	1702537408 (192.168.122.101)
cib\x00
		     17243	1702537408 (192.168.122.101)
pcmk\x00
		     17242	1702537408 (192.168.122.101)
[root@pcmk-1 ~]# grep 17243 /var/log/messages | grep "info: pcmk_cpg_membership"
May 15 02:51:02 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:51:02 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Left[0] 1719314624 
May 15 02:51:02 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:51:02 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Member[1] 1719314624 
May 15 02:51:09 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Member[0] 1702537408 
May 15 02:51:09 pcmk-1 cib[17243]:     info: pcmk_cpg_membership: Left[0] 1719314624

Comment 3 Andrew Beekhof 2012-05-15 02:14:05 UTC
Created attachment 584502 [details]
Black box from pcmk-1

Comment 4 Andrew Beekhof 2012-05-15 02:15:00 UTC
Created attachment 584503 [details]
Black box from pcmk-2

Comment 5 Jan Friesse 2012-05-15 14:57:03 UTC
Andrew,
thanks for all blackboxes/reports you've sent. I'm not ignoring this bug (even it make looks so) and it's actually getting higher and higher priority (but sadly there were still things with even greater priority...). Anyway, I have one question and one appeal.

Question:
corosync-cpgtool 
Group Name        PID    Node ID
crmd\x00
       17249 1702537408 (192.168.122.101)
       24568 1719314624 (192.168.122.102)
attrd\x00
       17247 1702537408 (192.168.122.101)
stonith-ng\x00
       17245 1702537408 (192.168.122.101)
cib\x00
       17243 1702537408 (192.168.122.101)
pcmk\x00
       17242 1702537408 (192.168.122.101)

Is this normal that only 192.168.122.101 is connected to pcmk, cib, ... group or every node should be?

Appeal:
do you think it's possible to create core dump on place where "Unknown node -> we will not deliver message" message is printed?

I will try to prepare patches for enhance log from cpg, tomorrow so it should be easier to find out what is happening (and hopefully it will allow me to at least reproduce bug by myself in shorter time).

Comment 6 Andrew Beekhof 2012-05-15 22:03:23 UTC
(In reply to comment #5)
> Is this normal that only 192.168.122.101 is connected to pcmk, cib, ... group
> or every node should be?

The latter. All the processes should be connected on every node.

> 
> Appeal:
> do you think it's possible to create core dump on place where "Unknown node ->
> we will not deliver message" message is printed?

I can try, but it involves catching the cluster when its in this state which is tricky.
Yesterday I got lucky :)
If it happens again I'll set a breakpoint at that log and run the cibadmin command (which sends a CPG message) to trigger it.

What information do you want?

Comment 7 Jan Friesse 2012-05-16 09:45:41 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > Is this normal that only 192.168.122.101 is connected to pcmk, cib, ... group
> > or every node should be?
> 
> The latter. All the processes should be connected on every node.
> 

This is what I was scary about :(

> > 
> > Appeal:
> > do you think it's possible to create core dump on place where "Unknown node ->
> > we will not deliver message" message is printed?
> 
> I can try, but it involves catching the cluster when its in this state which is
> tricky.
> Yesterday I got lucky :)
> If it happens again I'll set a breakpoint at that log and run the cibadmin
> command (which sends a CPG message) to trigger it.
> 
> What information do you want?

Actually I wanted to find out what pids are in internal structures ... but it seems to be not so important now, because what I've seen directed me to another path.

Can you please apply patch in following comment and send me blackboxes (ideally again from 2 node cluster) when problem appears?

For now it looks like join msg was not send / delivered or some kind of weird coincidence between downlist and join list happened. Hopefully with more debug informations I will be more clever.

Comment 8 Jan Friesse 2012-05-16 09:48:05 UTC
Created attachment 584918 [details]
Add more debug informations

Patch should be applicable to current master (88dd3e1eeacd64701d665f10acbc40f3795dd32f)

Comment 9 Jan Friesse 2012-05-16 11:50:19 UTC
Created attachment 584944 [details]
Add even more debug informations

Patch should be applicable to current master
(88dd3e1eeacd64701d665f10acbc40f3795dd32f). It's replacement for previous one!

Comment 10 Andrew Beekhof 2012-05-17 00:10:54 UTC
Built and running.

Question: Is it expected that the "unknown" message is ever seen?

I see it semi-regularly for a period of 1-3 seconds but then it seems to resolve itself.
It is much rarer that it persists for minutes or longer - but perhaps thats just because there were no additional membership changes to break the cycle.

Comment 11 Andrew Beekhof 2012-05-17 10:33:52 UTC
I got this from the corosync logs today, but it doesn't seem like there is much more information:


May 17 06:03:41 [10190] pcmk-2 corosync notice  [QUORUM] Members[3]: 101 102 103
May 17 06:03:41 [10190] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 06:03:41 [10190] pcmk-2 corosync notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
May 17 06:03:41 [10190] pcmk-2 corosync notice  [QUORUM] Members[1]: 102
May 17 06:03:41 [10190] pcmk-2 corosync notice  [QUORUM] Members[2]: 101 102
May 17 06:03:41 [10190] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] downlist left_list: 3 received in state 0
May 17 06:03:41 [10190] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:42 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:42 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:42 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:03:48 [10190] pcmk-2 corosync warning [CPG   ] Unknown node -> we will not deliver message
May 17 06:04:09 [10190] pcmk-2 corosync notice  [QUORUM] Members[3]: 101 102 104
May 17 06:04:09 [10190] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 06:04:10 [10190] pcmk-2 corosync notice  [QUORUM] Members[4]: 101 102 103 104
May 17 06:04:10 [10190] pcmk-2 corosync notice  [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 17 06:04:10 [10190] pcmk-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
May 17 06:04:10 [10190] pcmk-2 corosync notice  [QUORUM] Members[4]: 101 102 103 104
May 17 06:04:10 [10190] pcmk-2 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.

Comment 12 Jan Friesse 2012-05-17 11:43:16 UTC
Andrew,

(In reply to comment #10)
> Built and running.
> 

Thanks

> Question: Is it expected that the "unknown" message is ever seen?
> 

NO. It shouldn't ever happen. Basically, all nodes should agree on membership (what is partly true) and cannot lost messages without let user app know.

> I see it semi-regularly for a period of 1-3 seconds but then it seems to
> resolve itself.

This is reason why I need more logs (blackbox).

> It is much rarer that it persists for minutes or longer - but perhaps thats
> just because there were no additional membership changes to break the cycle.

Exactly. Membership change will make problem disappear.

BTW. I've added all new log messages as a DEBUG, so need blackbox output.

Comment 13 Andrew Beekhof 2012-05-21 01:46:29 UTC
(In reply to comment #12)
> BTW. I've added all new log messages as a DEBUG, so need blackbox output.

That is a problem.

While the problem gets triggered reasonably often, its not really practical to expect people to watch, in real-time, for hours, logs from N machines so they can run additional commands in the 1-2 minute window before the processes exit (afterall, nodes joining/leaving is what triggers this).

The blackbox is awesome, as is the ability to grab the contents while the process is running.  Pacemaker now uses that too. But its not a cure-all and, particularly for problems like this, you need the capability to trigger it programatically too.

Pacemaker added "dump the blackbox" to its non-fatal assertion code for this purpose but it could equally be integrated into the error/warning logging code.

Comment 14 Steven Dake 2012-05-21 14:38:07 UTC
Angus,

Regarding Comment #13, any chance of implementing that as a flag to the setup of the flight recorder?

Comment 15 Angus Salkeld 2012-05-29 00:57:43 UTC
(In reply to comment #14)
> Angus,
> 
> Regarding Comment #13, any chance of implementing that as a flag to the
> setup of the flight recorder?

I am not sure what a flag in blackbox world do. What Andrew I think
is suggesting is we setup a soft_assert() in corosync that runs
the blackbox dump (i.e. self triggering). Then instead of logging
we could make a debug build that soft_assert()'s in this particular condition
Thus not requiring Andrew to stare at his screen for ages waiting for this 
event to happen.

something like this:

#define soft_assert(_cond_) \
do { \
if (_cond_) { \
qb_log_blackbox_write_to_file (LOCALSTATEDIR "/lib/corosync/assert"); \
} \
} while (0)

Comment 16 Steven Dake 2012-05-29 03:21:45 UTC
I think what Andrew was suggesting was a config option that automatically dumps the blackbox anytime a WARN or ERR log type occurs.

Andrew,
could you offer more info as to what you desire here?

Regards
-steve

Comment 17 Andrew Beekhof 2012-06-01 04:56:40 UTC
(In reply to comment #16)
> I think what Andrew was suggesting was a config option that automatically
> dumps the blackbox anytime a WARN or ERR log type occurs.

Either this or Angus' idea would be good.
Anything that automates the writing out of blackbox data when something bad happens.

Comment 18 Jan Friesse 2012-06-06 13:25:13 UTC
Created attachment 589886 [details]
Add even more debug informations - try 2

Andrew,
can you please try apply patch to master? It has assert(0) on "unknown node" failure, so it will hang corosync but allow me to find out more information.

Thanks

Comment 19 Jan Friesse 2012-06-11 12:28:02 UTC
Good news,
I was able to reproduce this bug very reliable and very easily. So reproducer:
node1, node2, both running testcpg and corosync

ctrl+z on node1

wait for node2 to create membership

fg %1 on node1

-> node2 accepts downlist which looks like:

sender r(0) ip(10.34.38.107) ; members(old:1 left:0)
comparing: sender r(0) ip(10.34.38.106) ; members(old:2 left:1)

chosen downlist: sender r(0) ip(10.34.38.106) ; members(old:2 left:1)

and also ignores it's own join messages.

Comment 20 Jan Friesse 2012-06-11 12:34:07 UTC
Main question here is, what membership changes we want to see. It looks like Angus idea was to have SAME downlist across each of node, in other words:

node1 - was sleeping and old membership was 1,2 - new membership is 1,2, transitional was 1 -> left_list = 2
node2 - was not sleeping, and was able to correctly deliver lost of node 1 and when node 1 connected, it seems old membership = 2 - new membership is 1,2, transitional was nothing -> left_list = nothing.

Angus idea seems to be that we want to see on each of node node 1 point of view.

Even it may make some kind of sense, it looks more logical to:
node 1 sees that node 2 was left
node 2 sees that NOBODY left

Simple patch is included in following commit. It just changes selection of "best list" by removing lists which includes local node (because such node never left building).

But it's for sure something to discuss, so Angus, Steve (and probably Andrew) which view of world do you think is more correct?

Comment 21 Jan Friesse 2012-06-11 12:40:50 UTC
Also same bug exists in flatiron.

Comment 22 Jan Friesse 2012-06-11 12:41:31 UTC
Created attachment 590918 [details]
Proposed patch

Comment 23 Jan Friesse 2012-06-12 13:42:21 UTC
Created attachment 591194 [details]
Proposed patch - part 2

let's say following situation will happen:
- we have 3 nodes
- on wire messages looks like D1,J1,D2,J2,D3,J3 (D is downlist, J is
  joinlist)
- let's say, D1 and D3 contains node 2
- it means that J2 is applied, but right after that, D1 (or D3) is
  applied what means, node 2 is again considered down

It's solved by collecting joinlists and apply them after downlist, so
order is:
- apply best matching downlist
- apply all joinlists


This patch is not replacement for first patch, it is additional one.

Comment 25 Angus Salkeld 2012-06-12 23:50:54 UTC
Honza, I think chat to dct about what events make sense, I can't honestly
can't remember why I went in this direction. I have see questions (and patches) about this behavior 
(http://lists.corosync.org/pipermail/discuss/2011-November/000190.html).

I have no problem with the change just that the client apps can deal with the change in behavior. That will probably mean heaps of integration testing:(

Comment 26 Angus Salkeld 2012-06-13 07:36:23 UTC
Honza, your main question was "why does every node receive same downlist"?

I honestly don't remember (it was 2 years ago), it might have not been intentional, and therefore a bug.

Comment 27 Jan Friesse 2012-06-13 14:10:10 UTC
Created attachment 591509 [details]
Proposed patch - part 3

It looks like I found another problem with downlist selection algorithm (tested and confirmed also WITHOUT previous two patches)

Let's say we have 2 nodes:
- node 2 is paused
- node 1 create membership (one node)
- node 2 is unpaused

Result is that node 1 downlist is selected, so it means that from node 2
point of view, node 1 was never down.

Patch solves situation by adding additional check for largest previous
membership.

So current tests are:
1) largest (previous #nodes - #nodes know to have left)
2) (then) largest previous membership
3) (and last as a tie-breaker) node with smallest nodeid

Comment 28 Jan Friesse 2012-06-13 14:50:33 UTC
Created attachment 591532 [details]
Proposed patch - part 3 - try 2

Make conditions work correctly.

Comment 29 Jan Friesse 2012-06-15 07:54:49 UTC
Created attachment 592063 [details]
Add assert if we were delivering to unknown node

Andrew,
can you please apply patch and then send me blackbox dumps?

Comment 30 Andrew Beekhof 2012-06-20 05:39:51 UTC
Ok, I hit it

Jun 20 12:31:17 [18009] pcmk-1 corosync debug   [TOTEM ] Delivering MCAST message with seq 1e to pending delivery queue
Jun 20 12:31:17 [18009] pcmk-1 corosync warning [CPG   ] Unknown node 66 -> we will not deliver message
Jun 20 12:31:17 [18009] pcmk-1 corosync debug   [QB    ]  writing total of: 8388620

Just trying to track down the core file and blackbox.

Interestingly, my node IDs are 1-4, not 66.

Comment 31 Andrew Beekhof 2012-06-20 05:42:58 UTC
Created attachment 593132 [details]
Blackbox contents from pcmk-1 after "unknown node abort"

Took me a while to remember the pid in corosync.log is off-by-one.

Comment 32 Andrew Beekhof 2012-06-20 05:47:03 UTC
Backtrace from the core file

Core was generated by `corosync'.
Program terminated with signal 6, Aborted.
#0  0x0000003d5aa35965 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) where
#0  0x0000003d5aa35965 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003d5aa37118 in __GI_abort () at abort.c:91
#2  0x0000003d5aa2e6e2 in __assert_fail_base (fmt=0x3d5ab78448 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x42546a "0", file=file@entry=0x425451 "cpg.c", line=line@entry=1371, function=function@entry=0x425980 "message_handler_req_exec_cpg_mcast") at assert.c:94
#3  0x0000003d5aa2e792 in __GI___assert_fail (assertion=assertion@entry=0x42546a "0", file=file@entry=0x425451 "cpg.c", line=line@entry=1371, function=function@entry=0x425980 "message_handler_req_exec_cpg_mcast") at assert.c:103
#4  0x0000000000412c30 in message_handler_req_exec_cpg_mcast (message=0x7f1872c0e019, nodeid=102) at cpg.c:1371
#5  0x00000030382159dd in app_deliver_fn (endian_conversion_required=0, msg_len=<optimized out>, msg=0x7f1872c0e014, nodeid=102) at totempg.c:530
#6  totempg_deliver_fn (nodeid=102, msg=0x1214f43, msg_len=<optimized out>, endian_conversion_required=0) at totempg.c:639
#7  0x000000303820c276 in messages_deliver_to_app (instance=instance@entry=0x7f1872dd2010, end_point=<optimized out>, skip=0) at totemsrp.c:3842
#8  0x00000030382123af in message_handler_mcast (instance=0x7f1872dd2010, msg=<optimized out>, msg_len=<optimized out>, endian_conversion_needed=<optimized out>) at totemsrp.c:3960
#9  0x000000303820b63c in rrp_deliver_fn (context=0xb657c0, msg=0xba72d8, msg_len=187) at totemrrp.c:1792
#10 0x0000003038206a11 in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0xba7270) at totemudp.c:465
#11 0x000000365880b3ff in _poll_dispatch_and_take_back_ (item=0xb5c1b8, p=<optimized out>) at loop_poll.c:98
#12 0x000000365880afe7 in qb_loop_run_level (level=0xb5bde8) at loop.c:45
#13 qb_loop_run (lp=<optimized out>) at loop.c:206
#14 0x00000000004078d5 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1229

Comment 33 Andrew Beekhof 2012-06-20 05:47:57 UTC
Created attachment 593133 [details]
Backtrace with all variables printed

Comment 34 Jan Friesse 2012-06-20 09:07:00 UTC
Ok,
so today, it's also very simple to reproduce, and also 2 nodes are enough.

1. Node 1,2 running corosync and testcpg
2. Pause node 1 (ctrl+z)
3. On node 1, send some messages by testcpg (it's not answering but this doesn't matter). Simple hit ENTER key few times is enough.
4. Unpause node 1 (fg %1)

Result:

node1 -
Jun 20 10:57:57 debug   [CPG   ] got mcast request on 0x213f790
Jun 20 10:57:57 debug   [CPG   ] got mcast request on 0x213f790
...
Jun 20 10:57:57 debug   [TOTEM ] entering GATHER state from 9.
Jun 20 10:57:57 debug   [TOTEM ] mcasted message added to pending queue
Jun 20 10:57:57 debug   [TOTEM ] Creating commit token because I am the rep.
Jun 20 10:57:57 debug   [TOTEM ] Saving state aru e high seq received e
Jun 20 10:57:57 debug   [TOTEM ] Storing new sequence id for ring a00000000000b30
Jun 20 10:57:57 debug   [TOTEM ] entering COMMIT state.
Jun 20 10:57:57 debug   [TOTEM ] got commit token
Jun 20 10:57:57 debug   [TOTEM ] entering RECOVERY state.
...
un 20 10:57:57 debug   [TOTEM ] Did not need to originate any messages in recovery.
Jun 20 10:57:57 debug   [TOTEM ] got commit token
Jun 20 10:57:57 debug   [TOTEM ] Sending initial ORF token
Jun 20 10:57:57 debug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
Jun 20 10:57:57 debug   [TOTEM ] install seq 0 aru 0 high seq received 0
...
Jun 20 10:57:57 debug   [TOTEM ] Resetting old ring state
Jun 20 10:57:57 debug   [TOTEM ] recovery to regular 1-0
Jun 20 10:57:57 debug   [TOTEM ] Delivering to app f to e
Jun 20 10:57:57 debug   [MAIN  ] Member left: r(0) ip(10.34.38.108)
Jun 20 10:57:57 debug   [MAIN  ] Member joined: r(0) ip(10.34.38.108)
Jun 20 10:57:57 debug   [TOTEM ] entering OPERATIONAL state.
Jun 20 10:57:57 notice  [TOTEM ] A processor joined or left the membership and a new membership (10.34.38.107:720575940379282224) was formed.
Jun 20 10:57:57 debug   [TOTEM ] Received ringid(10.34.38.107:720575940379282224) seq 1
Jun 20 10:57:57 debug   [TOTEM ] Delivering 0 to 1
Jun 20 10:57:57 debug   [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
Jun 20 10:57:57 debug   [TOTEM ] mcasted message added to pending queue
Jun 20 10:57:57 debug   [TOTEM ] Delivering 1 to 4
Jun 20 10:57:57 debug   [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
Jun 20 10:57:57 debug   [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
Jun 20 10:57:57 debug   [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue

^^^ This messages are result of multiple ENTER key hit ... as you can notice, these messages are sent BEFORE sync barrier

node 2:
(Logs starts from time when it successfully removed node 1 from membership)
un 20 10:57:57 debug   [TOTEM ] entering GATHER state from 11.
Jun 20 10:57:57 debug   [TOTEM ] got commit token
Jun 20 10:57:57 debug   [TOTEM ] Saving state aru 3 high seq received 3
Jun 20 10:57:57 debug   [TOTEM ] Storing new sequence id for ring a00000000000b30
Jun 20 10:57:57 debug   [TOTEM ] entering COMMIT state.
Jun 20 10:57:57 debug   [TOTEM ] got commit token
Jun 20 10:57:57 debug   [TOTEM ] entering RECOVERY state.
....
Jun 20 10:57:57 debug   [TOTEM ] Did not need to originate any messages in recovery.
Jun 20 10:57:57 debug   [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff
Jun 20 10:57:57 debug   [TOTEM ] install seq 0 aru 0 high seq received 0
...
Jun 20 10:57:57 debug   [MAIN  ] Member joined: r(0) ip(10.34.38.107)
Jun 20 10:57:57 debug   [TOTEM ] entering OPERATIONAL state.
Jun 20 10:57:57 notice  [TOTEM ] A processor joined or left the membership and a new membership (10.34.38.107:720575940379282224) was formed.
Jun 20 10:57:57 debug   [TOTEM ] mcasted message added to pending queue
Jun 20 10:57:57 debug   [TOTEM ] Delivering 0 to 1
Jun 20 10:57:57 debug   [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
Jun 20 10:57:57 debug   [TOTEM ] Received ringid(10.34.38.107:720575940379282224) seq 1
Jun 20 10:57:57 debug   [TOTEM ] Received ringid(10.34.38.107:720575940379282224) seq 2
Jun 20 10:57:57 debug   [TOTEM ] Delivering 1 to 2
Jun 20 10:57:57 debug   [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
Jun 20 10:57:57 warning [CPG   ] Unknown node -> we will not deliver message
Jun 20 10:57:57 warning [CPG   ] Unknown node -> we will not deliver message

^^^ This are messages created on node 1 (result of hitting ENTER key)

This problem appears even without previous patches (master git 99faa3b864e3ff12f78cdfd8be56e4e6393e3c2e).

Question is, HOW corosync should behave in such situation? (and no, drop messages is probably not good answer or at least, not good answer on CPG layer).

Comment 35 Andrew Beekhof 2012-06-21 01:42:34 UTC
(In reply to comment #34)
> Ok,
> so today, it's also very simple to reproduce, and also 2 nodes are enough.
> 
> 1. Node 1,2 running corosync and testcpg
> 2. Pause node 1 (ctrl+z)

I'm still trying to get my head around the sequence here... 
Does this step mean ctrl+z on the corosync process?
If so, I wonder if the IPC code can detect that somehow and reject the messages with -EAGAIN.

> 3. On node 1, send some messages by testcpg (it's not answering but this
> doesn't matter). Simple hit ENTER key few times is enough.
> 4. Unpause node 1 (fg %1)

Comment 36 Jan Friesse 2012-06-21 05:55:30 UTC
(In reply to comment #35)
> (In reply to comment #34)
> > Ok,
> > so today, it's also very simple to reproduce, and also 2 nodes are enough.
> > 
> > 1. Node 1,2 running corosync and testcpg
> > 2. Pause node 1 (ctrl+z)
> 
> I'm still trying to get my head around the sequence here... 
> Does this step mean ctrl+z on the corosync process?

Ya, corosync process (I believe it's easy to understand because step 3 is send something to testcpg, and there are only 2 running processes ;) ). I'm usually run corosync on foreground (when developing / testing), so here ctrl+z make sense (I can understand that in daemon mode it's not true).

> If so, I wonder if the IPC code can detect that somehow and reject the
> messages with -EAGAIN.

There are two things. We must ensure that we will not try to fix specific case (paused corosync process by ctrl+z), because same thing can happen because of scheduler, slow cpu, ... Second thing is, that IPC now is more or less non-blocking (blocking is only putting message to queue, what is fast operation), so return -EAGAIN is almost impossible now.

> 
> > 3. On node 1, send some messages by testcpg (it's not answering but this
> > doesn't matter). Simple hit ENTER key few times is enough.
> > 4. Unpause node 1 (fg %1)

Comment 37 Jan Friesse 2012-11-08 15:05:18 UTC
Created attachment 640873 [details]
Proposed patch - part 4 - Fix problem with sync operations under very rare circumstances

This patch creates a special message queue for synchronization messages.
This prevents a situation in which messages are queued in the
new_message_queue but have not yet been originated from corrupting the
synchronization process.

Comment 38 Jan Friesse 2012-11-08 15:06:04 UTC
Created attachment 640874 [details]
Proposed patch - part 5 - Handle segfault in backlog_get

If instance->memb_state is not OPERATION or RECOVERY, we was passing NULL
to cs_queue_used call.

Comment 39 Jan Friesse 2012-11-08 15:07:41 UTC
Created attachment 640875 [details]
Proposed patch - part 6 - Add waiting_trans_ack also to fragmentation layer

Patch for support waiting_trans_ack may fail if there is synchronization
happening between delivery of fragmented message. In such situation,
fragmentation layer is waiting for message with correct number, but it
will never arrive.

Solution is to handle (callback) change of waiting_trans_ack and use different queue.

Comment 40 Jan Friesse 2012-11-28 07:39:30 UTC
All patches are now included in both flatiron (will be released as 1.4.5) and needle git (will be released as 2.1.1 and 2.2.0(. Closing this bug as upstream.


Note You need to log in before you can comment on or make changes to this bug.