Red Hat Bugzilla – Bug 854216
[TOTEM] FAILED TO RECEIVE + corosync crash
Last modified: 2015-09-27 22:28:34 EDT
This seems to be very similar to [bug 671575] (RHEL5). The observation was made in the cluster comprising of 3 RHEL6 nodes [*] as VM guests, with host being Fedora 17 [**]. Unfortunately, some parts of the story got lost (I would be more cautious if I had expected something like this), but the rest of information should be comprehensive. The plot: a. the cluster up and running, all seems OK b. cluster added into luci, possibly with some implied consequences (viz. /var/log/messages in the sosreports to be attached as a follow up) c. messages like "[TOTEM ] Retransmit List: 65 66" (always a pair of IDs, seemingly and coherently corresponding to the other 2 nodes out of 3) started to show up in syslog on all the affected nodes immediately d. fast-forward 1h20m, the affected node: > corosync[3245]: [TOTEM ] FAILED TO RECEIVE > abrt[8968]: File '/usr/sbin/corosync' seems to be deleted > abrtd: Directory 'ccpp-2012-08-16-23:40:39-3245' creation detected > abrt[8968]: Saved core dump of pid 3245 (/usr/sbin/corosync) to > /var/spool/abrt/ccpp-2012-08-16-23:40:39-3245 > rgmanager[3492]: #67: Shutting down uncleanly > fenced[3301]: cluster is down, exiting > fenced[3301]: daemon cpg_dispatch error 2 > fenced[3301]: cpg_dispatch error 2 > dlm_controld[3321]: cluster is down, exiting > dlm_controld[3321]: daemon cpg_dispatch error 2 > dlm_controld[3321]: cpg_dispatch error 2 > gfs_controld[3374]: cluster is down, exiting > gfs_controld[3374]: daemon cpg_dispatch error 2 > kernel: dlm: closing connection to node 3 > kernel: dlm: closing connection to node 2 > kernel: dlm: closing connection to node 1 > kernel: dlm: rgmanager: no userland control daemon, stopping lockspace > kernel: Bridge firewalling registered > kernel: ip_tables: (C) 2000-2006 Netfilter Core Team [ note: I do not understand how the firewall is related to this, neither the VMs nor the host was used at the moment of accident, host was used shortly before it at last ]; the other nodes around the time of the accident: > corosync[3249]: [TOTEM ] A processor failed, forming new configuration. > kernel: dlm: closing connection to node 2 > kernel: dlm: closing connection to node 3 > corosync[3249]: [QUORUM] Members[2]: 1 3 > corosync[3249]: [CMAN ] quorum lost, blocking activity > corosync[3249]: [QUORUM] This node is within the non-primary component > and will NOT provide any services. > corosync[3249]: [QUORUM] Members[1]: 1 > corosync[3249]: [TOTEM ] A processor joined or left the membership > and a new membership was formed. > corosync[3249]: [TOTEM ] A processor joined or left the membership > and a new membership was formed. > rgmanager[3496]: #1: Quorum Dissolved > corosync[3249]: [CMAN ] quorum regained, resuming activity > corosync[3249]: [QUORUM] This node is within the primary component > and will provide service. > corosync[3249]: [QUORUM] Members[2]: 1 3 > corosync[3249]: [QUORUM] Members[2]: 1 3 > corosync[3249]: [TOTEM ] A processor joined or left the membership > and a new membership was formed. > rgmanager[3496]: Quorum Regained > rgmanager[3496]: State change: Local UP > rgmanager[3496]: State change: rhel63-64kvm-3 UP > rgmanager[3496]: Loading Service Data > corosync[3249]: [CPG ] chosen downlist: sender r(0) > ip(192.168.122.121) ; members(old:3 left:1) > corosync[3249]: [MAIN ] Completed service synchronization, > ready to provide service. > rgmanager[3496]: Skipping stop-before-start: overridden > by administrator Interestingly (from my standpoint), the affected node was pretty much burning CPU cycles with two processes: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3301 root 20 0 192m 2028 976 R 52.2 0.3 505:01.26 fenced 4 root 20 0 0 0 0 R 44.5 0.0 432:25.04 ksoftirqd/0 IIRC, fenced was trying to connect to some socket and kept failing, because corosync was no longer alive. This is perhaps problem of not configuring the fencing properly, otherwise this is another thing to look at (perhaps add sleeps between tries or limit the number of tries?). Digging info from the coredump (yet to be attached as well): Thread 1: #0 0x00007fa6ec28b8a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007fa6ec28d085 in abort () at abort.c:92 #2 0x00007fa6ec284a1e in __assert_fail_base (fmt=<value optimized out>, assertion=0x7fa6ed045ef5 "token_memb_entries >= 1", file=0x7fa6ed045e8d "totemsrp.c", line=<value optimized out>, function=<value optimized out>) at assert.c:96 #3 0x00007fa6ec284ae0 in __assert_fail (assertion=0x7fa6ed045ef5 "token_memb_entries >= 1", file=0x7fa6ed045e8d "totemsrp.c", line=1211, function=0x7fa6ed047150 "memb_consensus_agreed") at assert.c:105 #4 0x00007fa6ed034e86 in memb_consensus_agreed (instance=0x7fa6ed40d010) at totemsrp.c:1211 #5 0x00007fa6ed039513 in memb_join_process (instance=0x7fa6ed40d010, memb_join=0x1ba561c) at totemsrp.c:4007 #6 0x00007fa6ed039839 in message_handler_memb_join (instance=0x7fa6ed40d010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:4250 #7 0x00007fa6ed032d18 in rrp_deliver_fn (context=0x1b59050, msg=0x1ba561c, msg_len=333) at totemrrp.c:1747 #8 0x00007fa6ed02d9a8 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x1ba4f50) at totemudp.c:1252 #9 0x00007fa6ed029132 in poll_run (handle=4105205169732452352) at coropoll.c:513 #10 0x0000000000406eb9 in main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at main.c:1852 Thread 2: #0 0x00007fa6ec336fc3 in __poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87 #1 0x0000000000409b6e in prioritized_timer_thread (data=<value optimized out>) at timer.c:127 #2 0x00007fa6ec7f6851 in start_thread (arg=0x7fa6ed460da0) at pthread_create.c:301 #3 0x00007fa6ec34067d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 3: #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86 #1 0x0000000000407c30 in corosync_exit_thread_handler (arg=0x0) at main.c:199 #2 0x00007fa6ec7f6851 in start_thread (arg=0x7fa6e8c26700) at pthread_create.c:301 #3 0x00007fa6ec34067d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 4-8: #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86 #1 0x00007fa6ecc16638 in ipc_sem_wait (conn=0x1bc4ba0) at ../include/corosync/coroipc_ipc.h:209 #2 pthread_ipc_consumer (conn=0x1bc4ba0) at coroipcs.c:672 #3 0x00007fa6ec7f6851 in start_thread (arg=0x7fa6ed349440) at pthread_create.c:301 #4 0x00007fa6ec34067d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 [*] guest details # rpm -q corosync cman rgmanager fence-agents corosync-1.4.1-7.el6.x86_64 cman-3.0.12.1-32.el6.x86_64 rgmanager-3.0.12.1-12.el6.x86_64 fence-agents-3.1.5-17.el6.x86_64 # uname -mr 2.6.32-279.el6.x86_64 x86_64 [**] host details (may have changed slightly and I can back track it if this matters): $ repoquery --tree-requires libvirt --level 1 | \ sed -e 's|\.noarch||' -e 's|\[.*\]||' libvirt-0:0.9.11.4-3.fc17.x86_64 \_ bash-0:4.2.37-2.fc17.x86_64 \_ libvirt-client-0:0.9.11.4-3.fc17.i686 \_ libvirt-client-0:0.9.11.4-3.fc17.x86_64 \_ libvirt-daemon-0:0.9.11.4-3.fc17.x86_64 \_ libvirt-daemon-config-network-0:0.9.11.4-3.fc17.x86_64 \_ libvirt-daemon-config-nwfilter-0:0.9.11.4-3.fc17.x86_64 $ uname -mr 3.5.1-1.fc17.x86_64 x86_64
Honza, Please try the patch in Bug #636583 for Jan.
Created attachment 638650 [details] Proposed patch If failed_to_recv is set, consensus can be empty If failed_to_recv is set (node detect itself not able to receive message), we can end up with assert, because my_failed_list and my_member_list are same list. This is happening because we are not following specification and we allow to mark node itself as failed. Because if failed_to_recv is set and we reached consensus across nodes, single node membership is created (ignoring both fail list and member_list), we can skip assert. Signed-off-by: Jan Friesse <jfriesse@redhat.com>
"Unit" test: https://github.com/jfriesse/csts/commit/9d2491f8dc1bedd1cd7a6f4e7b8f12380a0926b0
*** Bug 917914 has been marked as a duplicate of this bug. ***
*** Bug 981111 has been marked as a duplicate of this bug. ***
Verified using failed-to-receive-crash.sh. On corosync-1.4.1-15.el6.x86_64 (RHEL6.4) FAILED TO RECEIVE messages are logged and (for me) in like 30% of the cases it asserts and dumps a core. On corosync-1.4.1-17.el6.x86_64 (RHEL6.5) the messages are just logged without any core during the test period: Sep 11 15:52:01 virt-014 corosync[22115]: [TOTEM ] FAILED TO RECEIVE Sep 11 15:52:02 virt-014 corosync[22115]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 15:52:02 virt-014 corosync[22115]: [CPG ] chosen downlist: sender r(0) ip(10.34.71.14) ; members(old:2 left:1) Sep 11 15:52:02 virt-014 corosync[22115]: [MAIN ] Completed service synchronization, ready to provide service. Sep 11 15:52:03 virt-014 corosync[22115]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 11 15:52:11 virt-014 corosync[22115]: [CPG ] chosen downlist: sender r(0) ip(10.34.71.14) ; members(old:1 left:0) Sep 11 15:52:12 virt-014 corosync[22115]: [MAIN ] Completed service synchronization, ready to provide service.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1531.html