Created attachment 583724 [details] Full logs and versions Description of problem: 4 node cluster - pcmk-1 = 1702537408 - pcmk-2 = 1719314624 - pcmk-3 = 1736091840 - pcmk-4 = 1752869056 All nodes see pcmk-1, pcmk-2, pcmk-4 pcmk-2 cannot see pcmk-3, everyone else can Once we reach this state, the only messages logged by corosync are of the form: corosync warning [CPG ] Unknown node -> we will not deliver message This repeats for about 15 minutes until pcmk-4 is shut down and things appear to return to normal. The other nodes also show the same message at times, but the pacemaker logs show that many messages are getting through in between. After talking with Angus, if nothing else, it is probably a good idea to add some additional information to that log message. Version-Release number of selected component (if applicable): Name : corosync Version : 1.99.9 Release : 1.fc17 Architecture: x86_64 Install Date: Fri 13 Apr 2012 01:58:59 CEST Group : System Environment/Base Size : 379855 License : BSD Signature : RSA/SHA256, Fri 30 Mar 2012 02:35:59 CEST, Key ID 50e94c991aca3465 Source RPM : corosync-1.99.9-1.fc17.src.rpm Build Date : Wed 28 Mar 2012 07:15:04 CEST Build Host : x86-07.phx2.fedoraproject.org How reproducible: Rare Steps to Reproduce: 1. Run CTS 2. Pray 3. Actual results: Expected results: Additional info: May 11 04:55:10 [3718] pcmk-2 corosync notice [TOTEM ] Initializing transport (UDP/IP Multicast). May 11 04:55:10 [3718] pcmk-2 corosync notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none May 11 04:55:10 [3718] pcmk-2 corosync notice [TOTEM ] The network interface [192.168.122.102] is now up. May 11 04:55:10 [3718] pcmk-2 corosync notice [SERV ] Service engine loaded: corosync configuration map access [0] May 11 04:55:10 [3718] pcmk-2 corosync info [QB ] server name: cmap May 11 04:55:10 [3718] pcmk-2 corosync notice [SERV ] Service engine loaded: corosync configuration service [1] May 11 04:55:10 [3718] pcmk-2 corosync info [QB ] server name: cfg May 11 04:55:10 [3718] pcmk-2 corosync notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] May 11 04:55:10 [3718] pcmk-2 corosync info [QB ] server name: cpg May 11 04:55:10 [3718] pcmk-2 corosync notice [SERV ] Service engine loaded: corosync profile loading service [4] May 11 04:55:10 [3718] pcmk-2 corosync notice [QUORUM] Using quorum provider corosync_votequorum May 11 04:55:10 [3718] pcmk-2 corosync notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] May 11 04:55:10 [3718] pcmk-2 corosync info [QB ] server name: votequorum May 11 04:55:10 [3718] pcmk-2 corosync notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] May 11 04:55:10 [3718] pcmk-2 corosync info [QB ] server name: quorum May 11 04:55:10 [3718] pcmk-2 corosync notice [QUORUM] Members[1]: 1719314624 May 11 04:55:10 [3718] pcmk-2 corosync notice [TOTEM ] A processor joined or left the membership and a new membership was formed. May 11 04:55:10 [3718] pcmk-2 corosync notice [MAIN ] Completed service synchronization, ready to provide service. May 11 04:55:10 [3718] pcmk-2 corosync notice [QUORUM] Members[2]: 1719314624 1736091840 May 11 04:55:10 [3718] pcmk-2 corosync notice [TOTEM ] A processor joined or left the membership and a new membership was formed. May 11 04:55:10 [3718] pcmk-2 corosync notice [MAIN ] Completed service synchronization, ready to provide service. May 11 04:55:11 [3718] pcmk-2 corosync notice [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056 May 11 04:55:11 [3718] pcmk-2 corosync notice [TOTEM ] A processor joined or left the membership and a new membership was formed. May 11 04:55:11 [3718] pcmk-2 corosync notice [QUORUM] This node is within the primary component and will provide service. May 11 04:55:11 [3718] pcmk-2 corosync notice [QUORUM] Members[4]: 1702537408 1719314624 1736091840 1752869056 May 11 04:55:11 [3718] pcmk-2 corosync notice [TOTEM ] Retransmit List: 1 2 3 May 11 04:55:11 [3718] pcmk-2 corosync notice [MAIN ] Completed service synchronization, ready to provide service. May 11 04:55:11 [3718] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 11 04:55:11 [3718] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message ... May 11 05:10:27 [3718] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 11 05:10:30 [3718] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 11 05:10:30 [3718] pcmk-2 corosync notice [QUORUM] Members[3]: 1702537408 1719314624 1736091840 At this point pcmk-2 starts to receive messages from pcmk-3
Reproduced with latest from f17. Now that I know what to look for, perhaps its not so rare afterall. Name : corosync Version : 2.0.0 Release : 1.fc17 Architecture: x86_64 Install Date: Fri 11 May 2012 07:03:45 CEST Group : System Environment/Base Size : 383797 License : BSD Signature : RSA/SHA256, Wed 11 Apr 2012 19:53:39 CEST, Key ID 50e94c991aca3465 Source RPM : corosync-2.0.0-1.fc17.src.rpm Build Date : Wed 11 Apr 2012 08:55:29 CEST Build Host : x86-09.phx2.fedoraproject.org
I managed to catch a live cluster in this state. pcmk-1 == 192.168.122.101 pcmk-2 == 192.168.122.102 I verified that the "we will not deliver message" log occurs whenever the cib on pcmk-2 tries to send a CPG message to the cib on pcmk-1. Probably also in additional cases. Black box data will be attached. [root@pcmk-2 ~]# ps axf ... 24038 ? Ssl 17:00 /usr/sbin/corosync 24051 ? Ssl 0:28 /usr/sbin/pacemakerd -f 24055 ? Ssl 0:45 \_ /usr/libexec/pacemaker/cib 24057 ? Ss 0:30 \_ /usr/libexec/pacemaker/stonithd 24058 ? Ss 2:35 \_ /usr/lib64/heartbeat/lrmd 24059 ? Ss 0:52 \_ /usr/libexec/pacemaker/attrd 24060 ? Ss 0:28 \_ /usr/libexec/pacemaker/pengine 24568 ? Ss 0:00 \_ /usr/libexec/pacemaker/crmd 24501 ? Ssl 0:00 /usr/sbin/console-kit-daemon --no-daemon [root@pcmk-2 ~]# corosync-cpgtool Group Name PID Node ID crmd\x00 17249 1702537408 (192.168.122.101) 24568 1719314624 (192.168.122.102) attrd\x00 17247 1702537408 (192.168.122.101) stonith-ng\x00 17245 1702537408 (192.168.122.101) cib\x00 17243 1702537408 (192.168.122.101) pcmk\x00 17242 1702537408 (192.168.122.101) [root@pcmk-2 ~]# grep 24055 /var/log/messages | grep "info: pcmk_cpg_membership" May 15 02:50:23 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Member[0] 1719314624 May 15 02:50:23 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Left[0] 1702537408 May 15 02:50:59 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Member[0] 1702537408 May 15 02:50:59 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Member[1] 1719314624 May 15 02:50:59 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Member[0] 1702537408 May 15 02:50:59 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Left[0] 1719314624 May 15 02:51:02 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Left[0] 1702537408 May 15 02:51:05 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Member[0] 1702537408 May 15 02:51:40 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Left[0] 1702537408 May 15 02:51:41 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Member[0] 1702537408 May 15 02:56:33 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Left[0] 1702537408 May 15 02:56:47 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Member[0] 1702537408 May 15 02:59:29 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Left[0] 1702537408 May 15 02:59:31 pcmk-2 cib[24055]: info: pcmk_cpg_membership: Member[0] 1702537408 [root@pcmk-1 ~]# ps axf ... 17224 ? Ssl 28:08 /usr/sbin/corosync 17242 ? Ssl 0:35 /usr/sbin/pacemakerd -f 17243 ? Ssl 0:48 \_ /usr/libexec/pacemaker/cib 17245 ? Ssl 1:17 \_ /usr/libexec/pacemaker/stonithd 17246 ? Ss 1:52 \_ /usr/lib64/heartbeat/lrmd 17247 ? Ss 0:55 \_ /usr/libexec/pacemaker/attrd 17248 ? Ss 0:46 \_ /usr/libexec/pacemaker/pengine 17249 ? Ss 0:57 \_ /usr/libexec/pacemaker/crmd 22059 ? Ssl 0:00 /usr/sbin/console-kit-daemon --no-daemon [root@pcmk-1 ~]# corosync-cpgtool Group Name PID Node ID crmd\x00 17249 1702537408 (192.168.122.101) 24568 1719314624 (192.168.122.102) attrd\x00 17247 1702537408 (192.168.122.101) stonith-ng\x00 17245 1702537408 (192.168.122.101) cib\x00 17243 1702537408 (192.168.122.101) pcmk\x00 17242 1702537408 (192.168.122.101) [root@pcmk-1 ~]# grep 17243 /var/log/messages | grep "info: pcmk_cpg_membership" May 15 02:51:02 pcmk-1 cib[17243]: info: pcmk_cpg_membership: Member[0] 1702537408 May 15 02:51:02 pcmk-1 cib[17243]: info: pcmk_cpg_membership: Left[0] 1719314624 May 15 02:51:02 pcmk-1 cib[17243]: info: pcmk_cpg_membership: Member[0] 1702537408 May 15 02:51:02 pcmk-1 cib[17243]: info: pcmk_cpg_membership: Member[1] 1719314624 May 15 02:51:09 pcmk-1 cib[17243]: info: pcmk_cpg_membership: Member[0] 1702537408 May 15 02:51:09 pcmk-1 cib[17243]: info: pcmk_cpg_membership: Left[0] 1719314624
Created attachment 584502 [details] Black box from pcmk-1
Created attachment 584503 [details] Black box from pcmk-2
Andrew, thanks for all blackboxes/reports you've sent. I'm not ignoring this bug (even it make looks so) and it's actually getting higher and higher priority (but sadly there were still things with even greater priority...). Anyway, I have one question and one appeal. Question: corosync-cpgtool Group Name PID Node ID crmd\x00 17249 1702537408 (192.168.122.101) 24568 1719314624 (192.168.122.102) attrd\x00 17247 1702537408 (192.168.122.101) stonith-ng\x00 17245 1702537408 (192.168.122.101) cib\x00 17243 1702537408 (192.168.122.101) pcmk\x00 17242 1702537408 (192.168.122.101) Is this normal that only 192.168.122.101 is connected to pcmk, cib, ... group or every node should be? Appeal: do you think it's possible to create core dump on place where "Unknown node -> we will not deliver message" message is printed? I will try to prepare patches for enhance log from cpg, tomorrow so it should be easier to find out what is happening (and hopefully it will allow me to at least reproduce bug by myself in shorter time).
(In reply to comment #5) > Is this normal that only 192.168.122.101 is connected to pcmk, cib, ... group > or every node should be? The latter. All the processes should be connected on every node. > > Appeal: > do you think it's possible to create core dump on place where "Unknown node -> > we will not deliver message" message is printed? I can try, but it involves catching the cluster when its in this state which is tricky. Yesterday I got lucky :) If it happens again I'll set a breakpoint at that log and run the cibadmin command (which sends a CPG message) to trigger it. What information do you want?
(In reply to comment #6) > (In reply to comment #5) > > Is this normal that only 192.168.122.101 is connected to pcmk, cib, ... group > > or every node should be? > > The latter. All the processes should be connected on every node. > This is what I was scary about :( > > > > Appeal: > > do you think it's possible to create core dump on place where "Unknown node -> > > we will not deliver message" message is printed? > > I can try, but it involves catching the cluster when its in this state which is > tricky. > Yesterday I got lucky :) > If it happens again I'll set a breakpoint at that log and run the cibadmin > command (which sends a CPG message) to trigger it. > > What information do you want? Actually I wanted to find out what pids are in internal structures ... but it seems to be not so important now, because what I've seen directed me to another path. Can you please apply patch in following comment and send me blackboxes (ideally again from 2 node cluster) when problem appears? For now it looks like join msg was not send / delivered or some kind of weird coincidence between downlist and join list happened. Hopefully with more debug informations I will be more clever.
Created attachment 584918 [details] Add more debug informations Patch should be applicable to current master (88dd3e1eeacd64701d665f10acbc40f3795dd32f)
Created attachment 584944 [details] Add even more debug informations Patch should be applicable to current master (88dd3e1eeacd64701d665f10acbc40f3795dd32f). It's replacement for previous one!
Built and running. Question: Is it expected that the "unknown" message is ever seen? I see it semi-regularly for a period of 1-3 seconds but then it seems to resolve itself. It is much rarer that it persists for minutes or longer - but perhaps thats just because there were no additional membership changes to break the cycle.
I got this from the corosync logs today, but it doesn't seem like there is much more information: May 17 06:03:41 [10190] pcmk-2 corosync notice [QUORUM] Members[3]: 101 102 103 May 17 06:03:41 [10190] pcmk-2 corosync notice [TOTEM ] A processor joined or left the membership and a new membership was formed. May 17 06:03:41 [10190] pcmk-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. May 17 06:03:41 [10190] pcmk-2 corosync notice [QUORUM] Members[1]: 102 May 17 06:03:41 [10190] pcmk-2 corosync notice [QUORUM] Members[2]: 101 102 May 17 06:03:41 [10190] pcmk-2 corosync notice [TOTEM ] A processor joined or left the membership and a new membership was formed. May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] downlist left_list: 3 received in state 0 May 17 06:03:41 [10190] pcmk-2 corosync notice [MAIN ] Completed service synchronization, ready to provide service. May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:41 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:42 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:42 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:42 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:03:48 [10190] pcmk-2 corosync warning [CPG ] Unknown node -> we will not deliver message May 17 06:04:09 [10190] pcmk-2 corosync notice [QUORUM] Members[3]: 101 102 104 May 17 06:04:09 [10190] pcmk-2 corosync notice [TOTEM ] A processor joined or left the membership and a new membership was formed. May 17 06:04:10 [10190] pcmk-2 corosync notice [QUORUM] Members[4]: 101 102 103 104 May 17 06:04:10 [10190] pcmk-2 corosync notice [TOTEM ] A processor joined or left the membership and a new membership was formed. May 17 06:04:10 [10190] pcmk-2 corosync notice [QUORUM] This node is within the primary component and will provide service. May 17 06:04:10 [10190] pcmk-2 corosync notice [QUORUM] Members[4]: 101 102 103 104 May 17 06:04:10 [10190] pcmk-2 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
Andrew, (In reply to comment #10) > Built and running. > Thanks > Question: Is it expected that the "unknown" message is ever seen? > NO. It shouldn't ever happen. Basically, all nodes should agree on membership (what is partly true) and cannot lost messages without let user app know. > I see it semi-regularly for a period of 1-3 seconds but then it seems to > resolve itself. This is reason why I need more logs (blackbox). > It is much rarer that it persists for minutes or longer - but perhaps thats > just because there were no additional membership changes to break the cycle. Exactly. Membership change will make problem disappear. BTW. I've added all new log messages as a DEBUG, so need blackbox output.
(In reply to comment #12) > BTW. I've added all new log messages as a DEBUG, so need blackbox output. That is a problem. While the problem gets triggered reasonably often, its not really practical to expect people to watch, in real-time, for hours, logs from N machines so they can run additional commands in the 1-2 minute window before the processes exit (afterall, nodes joining/leaving is what triggers this). The blackbox is awesome, as is the ability to grab the contents while the process is running. Pacemaker now uses that too. But its not a cure-all and, particularly for problems like this, you need the capability to trigger it programatically too. Pacemaker added "dump the blackbox" to its non-fatal assertion code for this purpose but it could equally be integrated into the error/warning logging code.
Angus, Regarding Comment #13, any chance of implementing that as a flag to the setup of the flight recorder?
(In reply to comment #14) > Angus, > > Regarding Comment #13, any chance of implementing that as a flag to the > setup of the flight recorder? I am not sure what a flag in blackbox world do. What Andrew I think is suggesting is we setup a soft_assert() in corosync that runs the blackbox dump (i.e. self triggering). Then instead of logging we could make a debug build that soft_assert()'s in this particular condition Thus not requiring Andrew to stare at his screen for ages waiting for this event to happen. something like this: #define soft_assert(_cond_) \ do { \ if (_cond_) { \ qb_log_blackbox_write_to_file (LOCALSTATEDIR "/lib/corosync/assert"); \ } \ } while (0)
I think what Andrew was suggesting was a config option that automatically dumps the blackbox anytime a WARN or ERR log type occurs. Andrew, could you offer more info as to what you desire here? Regards -steve
(In reply to comment #16) > I think what Andrew was suggesting was a config option that automatically > dumps the blackbox anytime a WARN or ERR log type occurs. Either this or Angus' idea would be good. Anything that automates the writing out of blackbox data when something bad happens.
Created attachment 589886 [details] Add even more debug informations - try 2 Andrew, can you please try apply patch to master? It has assert(0) on "unknown node" failure, so it will hang corosync but allow me to find out more information. Thanks
Good news, I was able to reproduce this bug very reliable and very easily. So reproducer: node1, node2, both running testcpg and corosync ctrl+z on node1 wait for node2 to create membership fg %1 on node1 -> node2 accepts downlist which looks like: sender r(0) ip(10.34.38.107) ; members(old:1 left:0) comparing: sender r(0) ip(10.34.38.106) ; members(old:2 left:1) chosen downlist: sender r(0) ip(10.34.38.106) ; members(old:2 left:1) and also ignores it's own join messages.
Main question here is, what membership changes we want to see. It looks like Angus idea was to have SAME downlist across each of node, in other words: node1 - was sleeping and old membership was 1,2 - new membership is 1,2, transitional was 1 -> left_list = 2 node2 - was not sleeping, and was able to correctly deliver lost of node 1 and when node 1 connected, it seems old membership = 2 - new membership is 1,2, transitional was nothing -> left_list = nothing. Angus idea seems to be that we want to see on each of node node 1 point of view. Even it may make some kind of sense, it looks more logical to: node 1 sees that node 2 was left node 2 sees that NOBODY left Simple patch is included in following commit. It just changes selection of "best list" by removing lists which includes local node (because such node never left building). But it's for sure something to discuss, so Angus, Steve (and probably Andrew) which view of world do you think is more correct?
Also same bug exists in flatiron.
Created attachment 590918 [details] Proposed patch
Created attachment 591194 [details] Proposed patch - part 2 let's say following situation will happen: - we have 3 nodes - on wire messages looks like D1,J1,D2,J2,D3,J3 (D is downlist, J is joinlist) - let's say, D1 and D3 contains node 2 - it means that J2 is applied, but right after that, D1 (or D3) is applied what means, node 2 is again considered down It's solved by collecting joinlists and apply them after downlist, so order is: - apply best matching downlist - apply all joinlists This patch is not replacement for first patch, it is additional one.
Honza, I think chat to dct about what events make sense, I can't honestly can't remember why I went in this direction. I have see questions (and patches) about this behavior (http://lists.corosync.org/pipermail/discuss/2011-November/000190.html). I have no problem with the change just that the client apps can deal with the change in behavior. That will probably mean heaps of integration testing:(
Honza, your main question was "why does every node receive same downlist"? I honestly don't remember (it was 2 years ago), it might have not been intentional, and therefore a bug.
Created attachment 591509 [details] Proposed patch - part 3 It looks like I found another problem with downlist selection algorithm (tested and confirmed also WITHOUT previous two patches) Let's say we have 2 nodes: - node 2 is paused - node 1 create membership (one node) - node 2 is unpaused Result is that node 1 downlist is selected, so it means that from node 2 point of view, node 1 was never down. Patch solves situation by adding additional check for largest previous membership. So current tests are: 1) largest (previous #nodes - #nodes know to have left) 2) (then) largest previous membership 3) (and last as a tie-breaker) node with smallest nodeid
Created attachment 591532 [details] Proposed patch - part 3 - try 2 Make conditions work correctly.
Created attachment 592063 [details] Add assert if we were delivering to unknown node Andrew, can you please apply patch and then send me blackbox dumps?
Ok, I hit it Jun 20 12:31:17 [18009] pcmk-1 corosync debug [TOTEM ] Delivering MCAST message with seq 1e to pending delivery queue Jun 20 12:31:17 [18009] pcmk-1 corosync warning [CPG ] Unknown node 66 -> we will not deliver message Jun 20 12:31:17 [18009] pcmk-1 corosync debug [QB ] writing total of: 8388620 Just trying to track down the core file and blackbox. Interestingly, my node IDs are 1-4, not 66.
Created attachment 593132 [details] Blackbox contents from pcmk-1 after "unknown node abort" Took me a while to remember the pid in corosync.log is off-by-one.
Backtrace from the core file Core was generated by `corosync'. Program terminated with signal 6, Aborted. #0 0x0000003d5aa35965 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) where #0 0x0000003d5aa35965 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003d5aa37118 in __GI_abort () at abort.c:91 #2 0x0000003d5aa2e6e2 in __assert_fail_base (fmt=0x3d5ab78448 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x42546a "0", file=file@entry=0x425451 "cpg.c", line=line@entry=1371, function=function@entry=0x425980 "message_handler_req_exec_cpg_mcast") at assert.c:94 #3 0x0000003d5aa2e792 in __GI___assert_fail (assertion=assertion@entry=0x42546a "0", file=file@entry=0x425451 "cpg.c", line=line@entry=1371, function=function@entry=0x425980 "message_handler_req_exec_cpg_mcast") at assert.c:103 #4 0x0000000000412c30 in message_handler_req_exec_cpg_mcast (message=0x7f1872c0e019, nodeid=102) at cpg.c:1371 #5 0x00000030382159dd in app_deliver_fn (endian_conversion_required=0, msg_len=<optimized out>, msg=0x7f1872c0e014, nodeid=102) at totempg.c:530 #6 totempg_deliver_fn (nodeid=102, msg=0x1214f43, msg_len=<optimized out>, endian_conversion_required=0) at totempg.c:639 #7 0x000000303820c276 in messages_deliver_to_app (instance=instance@entry=0x7f1872dd2010, end_point=<optimized out>, skip=0) at totemsrp.c:3842 #8 0x00000030382123af in message_handler_mcast (instance=0x7f1872dd2010, msg=<optimized out>, msg_len=<optimized out>, endian_conversion_needed=<optimized out>) at totemsrp.c:3960 #9 0x000000303820b63c in rrp_deliver_fn (context=0xb657c0, msg=0xba72d8, msg_len=187) at totemrrp.c:1792 #10 0x0000003038206a11 in net_deliver_fn (fd=<optimized out>, revents=<optimized out>, data=0xba7270) at totemudp.c:465 #11 0x000000365880b3ff in _poll_dispatch_and_take_back_ (item=0xb5c1b8, p=<optimized out>) at loop_poll.c:98 #12 0x000000365880afe7 in qb_loop_run_level (level=0xb5bde8) at loop.c:45 #13 qb_loop_run (lp=<optimized out>) at loop.c:206 #14 0x00000000004078d5 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:1229
Created attachment 593133 [details] Backtrace with all variables printed
Ok, so today, it's also very simple to reproduce, and also 2 nodes are enough. 1. Node 1,2 running corosync and testcpg 2. Pause node 1 (ctrl+z) 3. On node 1, send some messages by testcpg (it's not answering but this doesn't matter). Simple hit ENTER key few times is enough. 4. Unpause node 1 (fg %1) Result: node1 - Jun 20 10:57:57 debug [CPG ] got mcast request on 0x213f790 Jun 20 10:57:57 debug [CPG ] got mcast request on 0x213f790 ... Jun 20 10:57:57 debug [TOTEM ] entering GATHER state from 9. Jun 20 10:57:57 debug [TOTEM ] mcasted message added to pending queue Jun 20 10:57:57 debug [TOTEM ] Creating commit token because I am the rep. Jun 20 10:57:57 debug [TOTEM ] Saving state aru e high seq received e Jun 20 10:57:57 debug [TOTEM ] Storing new sequence id for ring a00000000000b30 Jun 20 10:57:57 debug [TOTEM ] entering COMMIT state. Jun 20 10:57:57 debug [TOTEM ] got commit token Jun 20 10:57:57 debug [TOTEM ] entering RECOVERY state. ... un 20 10:57:57 debug [TOTEM ] Did not need to originate any messages in recovery. Jun 20 10:57:57 debug [TOTEM ] got commit token Jun 20 10:57:57 debug [TOTEM ] Sending initial ORF token Jun 20 10:57:57 debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0 Jun 20 10:57:57 debug [TOTEM ] install seq 0 aru 0 high seq received 0 ... Jun 20 10:57:57 debug [TOTEM ] Resetting old ring state Jun 20 10:57:57 debug [TOTEM ] recovery to regular 1-0 Jun 20 10:57:57 debug [TOTEM ] Delivering to app f to e Jun 20 10:57:57 debug [MAIN ] Member left: r(0) ip(10.34.38.108) Jun 20 10:57:57 debug [MAIN ] Member joined: r(0) ip(10.34.38.108) Jun 20 10:57:57 debug [TOTEM ] entering OPERATIONAL state. Jun 20 10:57:57 notice [TOTEM ] A processor joined or left the membership and a new membership (10.34.38.107:720575940379282224) was formed. Jun 20 10:57:57 debug [TOTEM ] Received ringid(10.34.38.107:720575940379282224) seq 1 Jun 20 10:57:57 debug [TOTEM ] Delivering 0 to 1 Jun 20 10:57:57 debug [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue Jun 20 10:57:57 debug [TOTEM ] mcasted message added to pending queue Jun 20 10:57:57 debug [TOTEM ] Delivering 1 to 4 Jun 20 10:57:57 debug [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue Jun 20 10:57:57 debug [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue Jun 20 10:57:57 debug [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue ^^^ This messages are result of multiple ENTER key hit ... as you can notice, these messages are sent BEFORE sync barrier node 2: (Logs starts from time when it successfully removed node 1 from membership) un 20 10:57:57 debug [TOTEM ] entering GATHER state from 11. Jun 20 10:57:57 debug [TOTEM ] got commit token Jun 20 10:57:57 debug [TOTEM ] Saving state aru 3 high seq received 3 Jun 20 10:57:57 debug [TOTEM ] Storing new sequence id for ring a00000000000b30 Jun 20 10:57:57 debug [TOTEM ] entering COMMIT state. Jun 20 10:57:57 debug [TOTEM ] got commit token Jun 20 10:57:57 debug [TOTEM ] entering RECOVERY state. .... Jun 20 10:57:57 debug [TOTEM ] Did not need to originate any messages in recovery. Jun 20 10:57:57 debug [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans queue empty 1 count 0, aru ffffffff Jun 20 10:57:57 debug [TOTEM ] install seq 0 aru 0 high seq received 0 ... Jun 20 10:57:57 debug [MAIN ] Member joined: r(0) ip(10.34.38.107) Jun 20 10:57:57 debug [TOTEM ] entering OPERATIONAL state. Jun 20 10:57:57 notice [TOTEM ] A processor joined or left the membership and a new membership (10.34.38.107:720575940379282224) was formed. Jun 20 10:57:57 debug [TOTEM ] mcasted message added to pending queue Jun 20 10:57:57 debug [TOTEM ] Delivering 0 to 1 Jun 20 10:57:57 debug [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue Jun 20 10:57:57 debug [TOTEM ] Received ringid(10.34.38.107:720575940379282224) seq 1 Jun 20 10:57:57 debug [TOTEM ] Received ringid(10.34.38.107:720575940379282224) seq 2 Jun 20 10:57:57 debug [TOTEM ] Delivering 1 to 2 Jun 20 10:57:57 debug [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue Jun 20 10:57:57 warning [CPG ] Unknown node -> we will not deliver message Jun 20 10:57:57 warning [CPG ] Unknown node -> we will not deliver message ^^^ This are messages created on node 1 (result of hitting ENTER key) This problem appears even without previous patches (master git 99faa3b864e3ff12f78cdfd8be56e4e6393e3c2e). Question is, HOW corosync should behave in such situation? (and no, drop messages is probably not good answer or at least, not good answer on CPG layer).
(In reply to comment #34) > Ok, > so today, it's also very simple to reproduce, and also 2 nodes are enough. > > 1. Node 1,2 running corosync and testcpg > 2. Pause node 1 (ctrl+z) I'm still trying to get my head around the sequence here... Does this step mean ctrl+z on the corosync process? If so, I wonder if the IPC code can detect that somehow and reject the messages with -EAGAIN. > 3. On node 1, send some messages by testcpg (it's not answering but this > doesn't matter). Simple hit ENTER key few times is enough. > 4. Unpause node 1 (fg %1)
(In reply to comment #35) > (In reply to comment #34) > > Ok, > > so today, it's also very simple to reproduce, and also 2 nodes are enough. > > > > 1. Node 1,2 running corosync and testcpg > > 2. Pause node 1 (ctrl+z) > > I'm still trying to get my head around the sequence here... > Does this step mean ctrl+z on the corosync process? Ya, corosync process (I believe it's easy to understand because step 3 is send something to testcpg, and there are only 2 running processes ;) ). I'm usually run corosync on foreground (when developing / testing), so here ctrl+z make sense (I can understand that in daemon mode it's not true). > If so, I wonder if the IPC code can detect that somehow and reject the > messages with -EAGAIN. There are two things. We must ensure that we will not try to fix specific case (paused corosync process by ctrl+z), because same thing can happen because of scheduler, slow cpu, ... Second thing is, that IPC now is more or less non-blocking (blocking is only putting message to queue, what is fast operation), so return -EAGAIN is almost impossible now. > > > 3. On node 1, send some messages by testcpg (it's not answering but this > > doesn't matter). Simple hit ENTER key few times is enough. > > 4. Unpause node 1 (fg %1)
Created attachment 640873 [details] Proposed patch - part 4 - Fix problem with sync operations under very rare circumstances This patch creates a special message queue for synchronization messages. This prevents a situation in which messages are queued in the new_message_queue but have not yet been originated from corrupting the synchronization process.
Created attachment 640874 [details] Proposed patch - part 5 - Handle segfault in backlog_get If instance->memb_state is not OPERATION or RECOVERY, we was passing NULL to cs_queue_used call.
Created attachment 640875 [details] Proposed patch - part 6 - Add waiting_trans_ack also to fragmentation layer Patch for support waiting_trans_ack may fail if there is synchronization happening between delivery of fragmented message. In such situation, fragmentation layer is waiting for message with correct number, but it will never arrive. Solution is to handle (callback) change of waiting_trans_ack and use different queue.
All patches are now included in both flatiron (will be released as 1.4.5) and needle git (will be released as 2.1.1 and 2.2.0(. Closing this bug as upstream.