Bug 917914

Summary: the cluster is down because of [TOTEM ] FAILED TO RECEIVE and INFO: task clvmd:4741 blocked
Product: Red Hat Enterprise Linux 6 Reporter: muse <fyjm2010>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.4CC: ccaulfie, cluster-maint, fdinitto, rpeterso, sdake, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-05 10:27:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description muse 2013-03-05 06:25:39 UTC
Description of problem:
The RHEL6.4 is installed on the Cisco UCS B200 NIC 1240. Before I create the cluster, the OS is working well, but after I create the cluster node from luci and within 2 minutes, the cluster service will be no responsed and crashed after the cluster is created. The cluster service can not be stopped by useing service XXX stop, and the reboot is very very slow and unsuccessful. After serveral hours, the cluster service management can be closed with errors.
Actually, the cluster only has the nodes name. There is no possible to add other resources because of the service crashed.

OS info 
rpm -q corosync cman rgmanager fence-agents gfs2-utils lvm2-cluster
corosync-1.4.1-15.el6.x86_64
cman-3.0.12.1-49.el6.x86_64
rgmanager-3.0.12.1-17.el6.x86_64
fence-agents-3.1.5-25.el6.x86_64
gfs2-utils-3.0.12.1-49.el6.x86_64
lvm2-cluster-2.02.98-9.el6.x86_64

uname -a
Linux 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux


Error Log:
Mar  4 16:10:22 ARCUCSB2007712M corosync[4519]:   [TOTEM ] Retransmit List: 78 7a 7b 7c 7e 7f 80 82 83
Mar  4 16:10:24 ARCUCSB2007712M corosync[4519]:   [TOTEM ] Retransmit List: 78 7a 7b 7c 7e 7f 80 82 83
Mar  4 16:10:26 ARCUCSB2007712M corosync[4519]:   [TOTEM ] Retransmit List: 78 7a 7b 7c 7e 7f 80 82 83
Mar  4 16:10:28 ARCUCSB2007712M corosync[4519]:   [TOTEM ] Retransmit List: 78 7a 7b 7c 7e 7f 80 82 83
Mar  4 16:10:28 ARCUCSB2007712M corosync[4519]:   [TOTEM ] FAILED TO RECEIVE
Mar  4 16:10:30 ARCUCSB2007712M abrt[8051]: File '/usr/sbin/corosync' seems to be deleted
Mar  4 16:10:30 ARCUCSB2007712M abrt[8051]: Saved core dump of pid 4519 (/usr/sbin/corosync) to /var/spool/abrt/ccpp-2013-03-04-16:10:30-451                9 (67997696 bytes)
Mar  4 16:10:30 ARCUCSB2007712M abrtd: Directory 'ccpp-2013-03-04-16:10:30-4519' creation detected
Mar  4 16:10:30 ARCUCSB2007712M fenced[4582]: cluster is down, exiting
Mar  4 16:10:30 ARCUCSB2007712M fenced[4582]: daemon cpg_dispatch error 2
Mar  4 16:10:30 ARCUCSB2007712M dlm_controld[4604]: cluster is down, exiting
Mar  4 16:10:30 ARCUCSB2007712M dlm_controld[4604]: daemon cpg_dispatch error 2
Mar  4 16:10:30 ARCUCSB2007712M dlm_controld[4604]: cpg_dispatch error 2
Mar  4 16:10:30 ARCUCSB2007712M gfs_controld[4657]: cluster is down, exiting
Mar  4 16:10:30 ARCUCSB2007712M gfs_controld[4657]: daemon cpg_dispatch error 2
Mar  4 16:10:31 ARCUCSB2007712M kernel: Bridge firewalling registered
Mar  4 16:10:36 ARCUCSB2007712M kernel: dlm: closing connection to node 2
Mar  4 16:10:36 ARCUCSB2007712M kernel: dlm: closing connection to node 1
Mar  4 16:10:36 ARCUCSB2007712M kernel: dlm: rgmanager: no userland control daemon, stopping lockspace
Mar  4 16:10:36 ARCUCSB2007712M kernel: dlm: clvmd: no userland control daemon, stopping lockspace
Mar  4 16:12:56 ARCUCSB2007712M kernel: INFO: task clvmd:4741 blocked for more than 120 seconds.
Mar  4 16:12:56 ARCUCSB2007712M kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  4 16:12:56 ARCUCSB2007712M kernel: clvmd         D 0000000000000002     0  4741      1 0x00000080
Mar  4 16:12:56 ARCUCSB2007712M kernel: ffff881070defd68 0000000000000082 ffff881070defd30 ffff881070defd2c
Mar  4 16:12:56 ARCUCSB2007712M kernel: 0000000000000000 ffff88087fc28800 ffff880028276700 0000000000000400
Mar  4 16:12:56 ARCUCSB2007712M kernel: ffff881070c4d058 ffff881070deffd8 000000000000fb88 ffff881070c4d058
Mar  4 16:12:56 ARCUCSB2007712M kernel: Call Trace:
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffffa027cea8>] dlm_release_lockspace+0x48/0x480 [dlm]
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffffa0285f92>] device_write+0x2b2/0x720 [dlm]
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffff811a1ae0>] ? mntput_no_expire+0x30/0x110
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffff81180f98>] vfs_write+0xb8/0x1a0
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffff81181891>] sys_write+0x51/0x90
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffff810dc565>] ? __audit_syscall_exit+0x265/0x290
Mar  4 16:12:56 ARCUCSB2007712M kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Mar  4 16:14:56 ARCUCSB2007712M kernel: INFO: task clvmd:4741 blocked for more than 120 seconds.
Mar  4 16:14:56 ARCUCSB2007712M kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  4 16:14:56 ARCUCSB2007712M kernel: clvmd         D 0000000000000002     0  4741      1 0x00000080
Mar  4 16:14:56 ARCUCSB2007712M kernel: ffff881070defd68 0000000000000082 ffff881070defd30 ffff881070defd2c
Mar  4 16:14:56 ARCUCSB2007712M kernel: 0000000000000000 ffff88087fc28800 ffff880028276700 0000000000000400
Mar  4 16:14:56 ARCUCSB2007712M kernel: ffff881070c4d058 ffff881070deffd8 000000000000fb88 ffff881070c4d058
Mar  4 16:14:56 ARCUCSB2007712M kernel: Call Trace:
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffffa027cea8>] dlm_release_lockspace+0x48/0x480 [dlm]
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffffa0285f92>] device_write+0x2b2/0x720 [dlm]
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffff811a1ae0>] ? mntput_no_expire+0x30/0x110
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffff81180f98>] vfs_write+0xb8/0x1a0
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffff81181891>] sys_write+0x51/0x90
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffff810dc565>] ? __audit_syscall_exit+0x265/0x290
Mar  4 16:14:56 ARCUCSB2007712M kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Mar  4 16:16:56 ARCUCSB2007712M kernel: INFO: task clvmd:4741 blocked for more than 120 seconds.
Mar  4 16:16:56 ARCUCSB2007712M kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  4 16:16:56 ARCUCSB2007712M kernel: clvmd         D 0000000000000002     0  4741      1 0x00000080
Mar  4 16:16:56 ARCUCSB2007712M kernel: ffff881070defd68 0000000000000082 ffff881070defd30 ffff881070defd2c
Mar  4 16:16:56 ARCUCSB2007712M kernel: 0000000000000000 ffff88087fc28800 ffff880028276700 0000000000000400
Mar  4 16:16:56 ARCUCSB2007712M kernel: ffff881070c4d058 ffff881070deffd8 000000000000fb88 ffff881070c4d058
Mar  4 16:16:56 ARCUCSB2007712M kernel: Call Trace:
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffffa027cea8>] dlm_release_lockspace+0x48/0x480 [dlm]
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffffa0285f92>] device_write+0x2b2/0x720 [dlm]
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffff811a1ae0>] ? mntput_no_expire+0x30/0x110
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffff81180f98>] vfs_write+0xb8/0x1a0
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffff81181891>] sys_write+0x51/0x90
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffff810dc565>] ? __audit_syscall_exit+0x265/0x290
Mar  4 16:16:56 ARCUCSB2007712M kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Mar  4 16:18:56 ARCUCSB2007712M kernel: INFO: task clvmd:4741 blocked for more than 120 seconds.
Mar  4 16:18:56 ARCUCSB2007712M kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  4 16:18:56 ARCUCSB2007712M kernel: clvmd         D 0000000000000002     0  4741      1 0x00000080
Mar  4 16:18:56 ARCUCSB2007712M kernel: ffff881070defd68 0000000000000082 ffff881070defd30 ffff881070defd2c
Mar  4 16:18:56 ARCUCSB2007712M kernel: 0000000000000000 ffff88087fc28800 ffff880028276700 0000000000000400
Mar  4 16:18:56 ARCUCSB2007712M kernel: ffff881070c4d058 ffff881070deffd8 000000000000fb88 ffff881070c4d058
Mar  4 16:18:56 ARCUCSB2007712M kernel: Call Trace:
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffff8150ed3e>] __mutex_lock_slowpath+0x13e/0x180
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffff8150ebdb>] mutex_lock+0x2b/0x50
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffffa027cea8>] dlm_release_lockspace+0x48/0x480 [dlm]
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffffa0285f92>] device_write+0x2b2/0x720 [dlm]
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffff811a1ae0>] ? mntput_no_expire+0x30/0x110
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffff81180f98>] vfs_write+0xb8/0x1a0
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffff81181891>] sys_write+0x51/0x90
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffff810dc565>] ? __audit_syscall_exit+0x265/0x290
Mar  4 16:18:56 ARCUCSB2007712M kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Comment 1 Fabio Massimo Di Nitto 2013-03-05 06:32:09 UTC
(In reply to comment #0)
> Description of problem:
> The RHEL6.4 is installed on the Cisco UCS B200 NIC 1240. Before I create the
> cluster, the OS is working well, but after I create the cluster node from
> luci and within 2 minutes, the cluster service will be no responsed and
> crashed after the cluster is created. The cluster service can not be stopped
> by useing service XXX stop, and the reboot is very very slow and
> unsuccessful. After serveral hours, the cluster service management can be
> closed with errors.
> Actually, the cluster only has the nodes name. There is no possible to add
> other resources because of the service crashed.
> 
> OS info 
> rpm -q corosync cman rgmanager fence-agents gfs2-utils lvm2-cluster
> corosync-1.4.1-15.el6.x86_64
> cman-3.0.12.1-49.el6.x86_64
> rgmanager-3.0.12.1-17.el6.x86_64
> fence-agents-3.1.5-25.el6.x86_64
> gfs2-utils-3.0.12.1-49.el6.x86_64
> lvm2-cluster-2.02.98-9.el6.x86_64
> 
> uname -a
> Linux 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64
> x86_64 x86_64 GNU/Linux
> 
> 
> Error Log:
> Mar  4 16:10:22 ARCUCSB2007712M corosync[4519]:   [TOTEM ] Retransmit List:
> 78 7a 7b 7c 7e 7f 80 82 83
> Mar  4 16:10:24 ARCUCSB2007712M corosync[4519]:   [TOTEM ] Retransmit List:
> 78 7a 7b 7c 7e 7f 80 82 83
> Mar  4 16:10:26 ARCUCSB2007712M corosync[4519]:   [TOTEM ] Retransmit List:
> 78 7a 7b 7c 7e 7f 80 82 83
> Mar  4 16:10:28 ARCUCSB2007712M corosync[4519]:   [TOTEM ] Retransmit List:
> 78 7a 7b 7c 7e 7f 80 82 83
> Mar  4 16:10:28 ARCUCSB2007712M corosync[4519]:   [TOTEM ] FAILED TO RECEIVE
> Mar  4 16:10:30 ARCUCSB2007712M abrt[8051]: File '/usr/sbin/corosync' seems
> to be deleted


This generally indicates a network (multicast) issue between the nodes.

Also, why would abrt reports that corosync has been deleted?

Comment 2 Jan Friesse 2013-03-05 10:27:18 UTC

*** This bug has been marked as a duplicate of bug 854216 ***