Bug 854216

Summary:

[TOTEM] FAILED TO RECEIVE + corosync crash

Product:

Red Hat Enterprise Linux 6

Reporter:

Jan Pokorný [poki] <jpokorny>

Component:

corosync

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.3

CC:

abeekhof, cphillip, dubrsl, fyjm2010, jfriesse, jkortus, jruemker, sbradley, sdake, slevine, sradvan

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

corosync-1.4.1-16.el6

Doc Type:

Bug Fix

Doc Text:

Cause: Corosync is running on faulty (or rate limited) network Consequence: Message packets can be lost. This is no problem, because corosync is ready for lost/duplicate packets. If one message is lost for failed_to_recv times (configurable in config file), corosync will log "FAILED TO RECEIVE" and begin new round of membership algorithm. Very often (like 75% chance) is that corosync will segfault. Fix: There is assert inside function which compare membership. Because packets may be lost in this, we can result in situation where: failed_list is same as membership_list. This is normally problem (and doesn't happen) and unspecified, so there is assert. But for FAILED TO RECEIVE case, we can ignore that assert, because right after calling this function, we will create single node membership anyway. Result: No segfaults

Story Points:

---

Clone Of:

Clones:

875922 (view as bug list)

Environment:

Last Closed:

2013-11-21 04:31:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

835616, 883504, 960054

Attachments:

Description	Flags
Proposed patch	none

Description Jan Pokorný [poki] 2012-09-04 12:05:55 UTC

This seems to be very similar to [bug 671575] (RHEL5).

The observation was made in the cluster comprising of 3 RHEL6 nodes [*]
as VM guests, with host being Fedora 17 [**].

Unfortunately, some parts of the story got lost (I would be more
cautious if I had expected something like this), but the rest
of information should be comprehensive.

The plot:

a. the cluster up and running, all seems OK
b. cluster added into luci, possibly with some implied consequences (viz.
   /var/log/messages in the sosreports to be attached as a follow up)
c. messages like "[TOTEM ] Retransmit List: 65 66" (always a pair of IDs,
   seemingly and coherently corresponding to the other 2 nodes out of 3)
   started to show up in syslog on all the affected nodes immediately
d. fast-forward 1h20m, the affected node:
> corosync[3245]:   [TOTEM ] FAILED TO RECEIVE
> abrt[8968]: File '/usr/sbin/corosync' seems to be deleted
> abrtd: Directory 'ccpp-2012-08-16-23:40:39-3245' creation detected
> abrt[8968]: Saved core dump of pid 3245 (/usr/sbin/corosync) to
>             /var/spool/abrt/ccpp-2012-08-16-23:40:39-3245
> rgmanager[3492]: #67: Shutting down uncleanly
> fenced[3301]: cluster is down, exiting
> fenced[3301]: daemon cpg_dispatch error 2
> fenced[3301]: cpg_dispatch error 2
> dlm_controld[3321]: cluster is down, exiting
> dlm_controld[3321]: daemon cpg_dispatch error 2
> dlm_controld[3321]: cpg_dispatch error 2
> gfs_controld[3374]: cluster is down, exiting
> gfs_controld[3374]: daemon cpg_dispatch error 2
> kernel: dlm: closing connection to node 3
> kernel: dlm: closing connection to node 2
> kernel: dlm: closing connection to node 1
> kernel: dlm: rgmanager: no userland control daemon, stopping lockspace
> kernel: Bridge firewalling registered
> kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
[ note: I do not understand how the firewall is related to this, neither
the VMs nor the host was used at the moment of accident, host was used
shortly before it at last ];
the other nodes around the time of the accident:
> corosync[3249]:   [TOTEM ] A processor failed, forming new configuration.
> kernel: dlm: closing connection to node 2
> kernel: dlm: closing connection to node 3
> corosync[3249]:   [QUORUM] Members[2]: 1 3
> corosync[3249]:   [CMAN  ] quorum lost, blocking activity
> corosync[3249]:   [QUORUM] This node is within the non-primary component
>                   and will NOT provide any services.
> corosync[3249]:   [QUORUM] Members[1]: 1
> corosync[3249]:   [TOTEM ] A processor joined or left the membership
>                   and a new membership was formed.
> corosync[3249]:   [TOTEM ] A processor joined or left the membership
>                   and a new membership was formed.
> rgmanager[3496]: #1: Quorum Dissolved
> corosync[3249]:   [CMAN  ] quorum regained, resuming activity
> corosync[3249]:   [QUORUM] This node is within the primary component
>                   and will provide service.
> corosync[3249]:   [QUORUM] Members[2]: 1 3
> corosync[3249]:   [QUORUM] Members[2]: 1 3
> corosync[3249]:   [TOTEM ] A processor joined or left the membership
>                   and a new membership was formed.
> rgmanager[3496]: Quorum Regained
> rgmanager[3496]: State change: Local UP
> rgmanager[3496]: State change: rhel63-64kvm-3 UP
> rgmanager[3496]: Loading Service Data
> corosync[3249]:   [CPG   ] chosen downlist: sender r(0)
>                   ip(192.168.122.121) ; members(old:3 left:1)
> corosync[3249]:   [MAIN  ] Completed service synchronization,
>                   ready to provide service.
> rgmanager[3496]: Skipping stop-before-start: overridden
>                  by administrator

Interestingly (from my standpoint), the affected node was pretty much
burning CPU cycles with two processes:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
3301 root      20   0  192m 2028  976 R 52.2  0.3 505:01.26 fenced
   4 root      20   0     0    0    0 R 44.5  0.0 432:25.04 ksoftirqd/0

IIRC, fenced was trying to connect to some socket and kept failing, because
corosync was no longer alive.  This is perhaps problem of not configuring
the fencing properly, otherwise this is another thing to look at (perhaps
add sleeps between tries or limit the number of tries?).


Digging info from the coredump (yet to be attached as well):

Thread 1:
#0  0x00007fa6ec28b8a5 in raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fa6ec28d085 in abort ()
    at abort.c:92
#2  0x00007fa6ec284a1e in __assert_fail_base (fmt=<value optimized out>,
      assertion=0x7fa6ed045ef5 "token_memb_entries >= 1",
      file=0x7fa6ed045e8d "totemsrp.c", line=<value optimized out>,
      function=<value optimized out>)
    at assert.c:96
#3  0x00007fa6ec284ae0 in __assert_fail
      (assertion=0x7fa6ed045ef5 "token_memb_entries >= 1",
      file=0x7fa6ed045e8d "totemsrp.c", line=1211,
      function=0x7fa6ed047150 "memb_consensus_agreed")
    at assert.c:105
#4  0x00007fa6ed034e86 in memb_consensus_agreed (instance=0x7fa6ed40d010)
    at totemsrp.c:1211
#5  0x00007fa6ed039513 in memb_join_process (instance=0x7fa6ed40d010,
      memb_join=0x1ba561c)
    at totemsrp.c:4007
#6  0x00007fa6ed039839 in message_handler_memb_join
      (instance=0x7fa6ed40d010, msg=<value optimized out>,
      msg_len=<value optimized out>,
      endian_conversion_needed=<value optimized out>)
    at totemsrp.c:4250
#7  0x00007fa6ed032d18 in rrp_deliver_fn (context=0x1b59050, msg=0x1ba561c,
      msg_len=333)
    at totemrrp.c:1747
#8  0x00007fa6ed02d9a8 in net_deliver_fn (handle=<value optimized out>,
      fd=<value optimized out>, revents=<value optimized out>,
      data=0x1ba4f50)
    at totemudp.c:1252
#9  0x00007fa6ed029132 in poll_run (handle=4105205169732452352)
    at coropoll.c:513
#10 0x0000000000406eb9 in main (argc=<value optimized out>,
      argv=<value optimized out>, envp=<value optimized out>)
    at main.c:1852

Thread 2:
#0  0x00007fa6ec336fc3 in __poll (fds=<value optimized out>,
      nfds=<value optimized out>, timeout=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x0000000000409b6e in prioritized_timer_thread
      (data=<value optimized out>)
    at timer.c:127
#2  0x00007fa6ec7f6851 in start_thread (arg=0x7fa6ed460da0)
    at pthread_create.c:301
#3  0x00007fa6ec34067d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 3:
#0  sem_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1  0x0000000000407c30 in corosync_exit_thread_handler (arg=0x0)
    at main.c:199
#2  0x00007fa6ec7f6851 in start_thread (arg=0x7fa6e8c26700)
    at pthread_create.c:301
#3  0x00007fa6ec34067d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 4-8:
#0  sem_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1  0x00007fa6ecc16638 in ipc_sem_wait (conn=0x1bc4ba0)
    at ../include/corosync/coroipc_ipc.h:209
#2  pthread_ipc_consumer (conn=0x1bc4ba0)
    at coroipcs.c:672
#3  0x00007fa6ec7f6851 in start_thread (arg=0x7fa6ed349440)
    at pthread_create.c:301
#4  0x00007fa6ec34067d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115


[*] guest details

# rpm -q corosync cman rgmanager fence-agents
corosync-1.4.1-7.el6.x86_64
cman-3.0.12.1-32.el6.x86_64
rgmanager-3.0.12.1-12.el6.x86_64
fence-agents-3.1.5-17.el6.x86_64

# uname -mr
2.6.32-279.el6.x86_64 x86_64


[**] host details (may have changed slightly and I can back track it
    if this matters):

$ repoquery --tree-requires libvirt --level 1 | \
    sed -e 's|\.noarch||' -e 's|\[.*\]||'
libvirt-0:0.9.11.4-3.fc17.x86_64
 \_  bash-0:4.2.37-2.fc17.x86_64
 \_  libvirt-client-0:0.9.11.4-3.fc17.i686
 \_  libvirt-client-0:0.9.11.4-3.fc17.x86_64
 \_  libvirt-daemon-0:0.9.11.4-3.fc17.x86_64
 \_  libvirt-daemon-config-network-0:0.9.11.4-3.fc17.x86_64
 \_  libvirt-daemon-config-nwfilter-0:0.9.11.4-3.fc17.x86_64

$ uname -mr
3.5.1-1.fc17.x86_64 x86_64

Comment 4 Steven Dake 2012-09-04 15:33:03 UTC

Honza,

Please try the patch in Bug #636583 for Jan.

Comment 5 Jan Friesse 2012-11-05 14:34:21 UTC

Created attachment 638650 [details]
Proposed patch


If failed_to_recv is set, consensus can be empty

If failed_to_recv is set (node detect itself not able to receive
message), we can end up with assert, because my_failed_list and
my_member_list are same list. This is happening because we are not
following specification and we allow to mark node itself as failed.
Because if failed_to_recv is set and we reached consensus across nodes,
single node membership is created (ignoring both fail list and
member_list), we can skip assert.

Signed-off-by: Jan Friesse <jfriesse>

Comment 6 Jan Friesse 2012-11-05 14:42:09 UTC

"Unit" test: https://github.com/jfriesse/csts/commit/9d2491f8dc1bedd1cd7a6f4e7b8f12380a0926b0

Comment 7 Jan Friesse 2013-03-05 10:27:18 UTC

*** Bug 917914 has been marked as a duplicate of this bug. ***

Comment 17 Jan Friesse 2013-07-04 05:14:30 UTC

*** Bug 981111 has been marked as a duplicate of this bug. ***

Comment 18 Jaroslav Kortus 2013-09-11 13:58:58 UTC

Verified using failed-to-receive-crash.sh.

On corosync-1.4.1-15.el6.x86_64 (RHEL6.4) FAILED TO RECEIVE messages are logged and (for me) in like 30% of the cases it asserts and dumps a core.

On corosync-1.4.1-17.el6.x86_64 (RHEL6.5) the messages are just logged without any core during the test period:
Sep 11 15:52:01 virt-014 corosync[22115]:   [TOTEM ] FAILED TO RECEIVE
Sep 11 15:52:02 virt-014 corosync[22115]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 11 15:52:02 virt-014 corosync[22115]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.71.14) ; members(old:2 left:1)
Sep 11 15:52:02 virt-014 corosync[22115]:   [MAIN  ] Completed service synchronization, ready to provide service.
Sep 11 15:52:03 virt-014 corosync[22115]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 11 15:52:11 virt-014 corosync[22115]:   [CPG   ] chosen downlist: sender r(0) ip(10.34.71.14) ; members(old:1 left:0)
Sep 11 15:52:12 virt-014 corosync[22115]:   [MAIN  ] Completed service synchronization, ready to provide service.

Comment 20 errata-xmlrpc 2013-11-21 04:31:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1531.html