Bug 1104899
| Summary: | Starting clvmd hangs with incorrect firewall rules | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Madison Kelly <mkelly> |
| Component: | cluster | Assignee: | Christine Caulfield <ccaulfie> |
| Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.5 | CC: | abeekhof, agk, ccaulfie, cluster-maint, dvossel, dwysocha, heinzm, jbrassow, msnitzer, prajnoha, prockai, rpeterso, teigland, thornber, tlavigne, zkabelac |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-08-20 08:13:15 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Madison Kelly
2014-06-04 23:48:24 UTC
After cleanly resetting both nodes, I changed lvm.conf to: ==== [root@an-a04n01 ~]# diff -U0 /etc/lvm/lvm.conf.anvil /etc/lvm/lvm.conf --- /etc/lvm/lvm.conf.anvil 2013-10-30 04:10:42.000000000 -0400 +++ /etc/lvm/lvm.conf 2014-06-04 19:50:14.823159186 -0400 @@ -84,0 +85 @@ + #filter = [ "a|/dev/drbd*|", "r/.*/" ] @@ -462 +463 @@ - locking_type = 1 + locking_type = 3 ==== (only 'locking_type = 3' changed). I tried again starting clvmd and this time it started on 'an-a04n01', but timed out (as it did in comment #1) on 'an-a04n02'... ==== [root@an-a04n01 ~]# cman_tool status Version: 6.2.0 Config Version: 9 Cluster Name: an-anvil-04 Cluster Id: 6054 Cluster Member: Yes Cluster Generation: 92 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Node votes: 1 Quorum: 1 Active subsystems: 8 Flags: 2node Ports Bound: 0 Node name: an-a04n01.alteeve.ca Node ID: 1 Multicast addresses: 239.192.23.189 Node addresses: 10.20.40.1 [root@an-a04n01 ~]# vim /etc/lvm/lvm.conf [root@an-a04n01 ~]# diff -U0 /etc/lvm/lvm.conf.anvil /etc/lvm/lvm.conf --- /etc/lvm/lvm.conf.anvil 2013-10-30 04:10:42.000000000 -0400 +++ /etc/lvm/lvm.conf 2014-06-04 19:50:14.823159186 -0400 @@ -84,0 +85 @@ + #filter = [ "a|/dev/drbd*|", "r/.*/" ] @@ -462 +463 @@ - locking_type = 1 + locking_type = 3 [root@an-a04n01 ~]# rsync -av /etc/lvm/lvm.conf root@an-a04n02:/etc/lvm/ sending incremental file list lvm.conf sent 1758 bytes received 421 bytes 1452.67 bytes/sec total size is 45034 speedup is 20.67 [root@an-a04n01 ~]# /etc/init.d/clvmd status clvmd is stopped [root@an-a04n01 ~]# /etc/init.d/clvmd start Starting clvmd: Activating VG(s): No volume groups found [ OK ] ==== In an-a04n01's syslog: ==== Jun 4 19:50:57 an-a04n01 kernel: dlm: Using TCP for communications Jun 4 19:50:58 an-a04n01 clvmd: Cluster LVM daemon started - connected to CMAN Jun 4 19:51:06 an-a04n01 kernel: dlm: connecting to 2 ==== I then started clvmd on 'an-a04n02' about 2~3 seconds later: ==== [root@an-a04n02 ~]# diff -U0 /etc/lvm/lvm.conf.anvil /etc/lvm/lvm.conf --- /etc/lvm/lvm.conf.anvil 2013-10-30 04:10:42.000000000 -0400 +++ /etc/lvm/lvm.conf 2014-06-04 19:50:14.000000000 -0400 @@ -84,0 +85 @@ + #filter = [ "a|/dev/drbd*|", "r/.*/" ] @@ -462 +463 @@ - locking_type = 1 + locking_type = 3 [root@an-a04n02 ~]# /etc/init.d/clvmd status clvmd is stopped [root@an-a04n02 ~]# /etc/init.d/clvmd start Starting clvmd: clvmd startup timed out ==== In system log: ==== Jun 4 19:51:06 an-a04n02 kernel: dlm: Using TCP for communications Jun 4 19:51:06 an-a04n02 kernel: dlm: connecting to 1 Jun 4 19:54:04 an-a04n02 kernel: INFO: task clvmd:3156 blocked for more than 120 seconds. Jun 4 19:54:04 an-a04n02 kernel: Not tainted 2.6.32-431.17.1.el6.x86_64 #1 Jun 4 19:54:04 an-a04n02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 19:54:04 an-a04n02 kernel: clvmd D 0000000000000001 0 3156 1 0x00000080 Jun 4 19:54:04 an-a04n02 kernel: ffff880338a07c98 0000000000000082 0000000000000000 0000000000000000 Jun 4 19:54:04 an-a04n02 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Jun 4 19:54:04 an-a04n02 kernel: ffff880337aedaf8 ffff880338a07fd8 000000000000fbc8 ffff880337aedaf8 Jun 4 19:54:04 an-a04n02 kernel: Call Trace: Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff81528a95>] schedule_timeout+0x215/0x2e0 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff81527bfe>] ? thread_return+0x4e/0x760 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff81285172>] ? kobject_uevent_env+0x202/0x620 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff81528713>] wait_for_common+0x123/0x180 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff8152882d>] wait_for_completion+0x1d/0x20 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffffa032af79>] dlm_new_lockspace+0x999/0xa30 [dlm] Jun 4 19:54:04 an-a04n02 kernel: [<ffffffffa0332ff1>] device_write+0x311/0x720 [dlm] Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff8123316b>] ? selinux_file_permission+0xfb/0x150 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff81226056>] ? security_file_permission+0x16/0x20 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff81188c38>] vfs_write+0xb8/0x1a0 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff81189531>] sys_write+0x51/0x90 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff810e1abe>] ? __audit_syscall_exit+0x25e/0x290 Jun 4 19:54:04 an-a04n02 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b ==== Running 'pvscan' on 'an-a04n01' shows this: ==== [root@an-a04n01 ~]# pvscan clvmd not running on node an-a04n02.alteeve.ca Unable to obtain global lock. ==== Trying to stop clvmd on an-a04n02: ==== [root@an-a04n02 ~]# /etc/init.d/clvmd stop Signaling clvmd to exit [ OK ] Waiting for clvmd to exit: [FAILED] clvmd failed to exit [FAILED] [root@an-a04n02 ~]# echo $? 1 ==== Nothing further was written to syslog (though the 120 second timeout error is still being printed). Trying to stop pacemaker on an-a04n02 also fails: ==== [root@an-a04n02 ~]# /etc/init.d/pacemaker stop Waiting for shutdown of managed resources. [ OK ] DLM lockspace still in use (shell doesn't return, but does return on ^C) ==== System log shows basically the same thing: ==== Jun 4 19:58:46 an-a04n02 attrd[3075]: notice: attrd_trigger_update: Sending flush op to all hosts for: standby (true) Jun 4 19:58:46 an-a04n02 attrd[3075]: notice: attrd_perform_update: Sent update 10: standby=true Jun 4 19:58:46 an-a04n02 pacemaker: Waiting for shutdown of managed resources Jun 4 19:58:46 an-a04n02 crmd[3077]: notice: process_lrm_event: LRM operation fence_n02_ipmi_stop_0 (call=20, rc=0, cib-update=14, confirmed=true) ok Jun 4 19:58:48 an-a04n02 pacemaker: DLM lockspace still in use ==== If I crash an-a04n02 with 'echo c > /proc/sysrq-trigger', it gets fenced properly (fence_ipmilan in case it matters) Then 'pvscan' on 'an-a04n01' works. ==== [root@an-a04n01 ~]# pvscan No matching physical volumes found ==== As per Bob's comments in #linux-cluster, here is the group_tool output from 'an-a04n01' with clvmd started and 'an-a04n02' not yet back in the cluster: ==== [root@an-a04n01 ~]# group_tool status fence domain member count 1 victim count 0 victim now 0 master nodeid 1 wait state none members 1 dlm lockspaces name clvmd id 0x4104eefa flags 0x00000000 change member 1 joined 0 remove 1 failed 1 seq 3,3 members 1 [root@an-a04n01 ~]# group_tool dump [root@an-a04n01 ~]# group_tool dump dlm dump dlm 1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401924707 dlm_controld 3.0.12.1 started 1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401924707 found /dev/misc/dlm-control minor 57 1401924707 found /dev/misc/dlm-monitor minor 56 1401924707 found /dev/misc/dlm_plock minor 55 1401924707 /dev/misc/dlm-monitor fd 12 1401924707 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 1401924707 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 1401924707 cluster node 1 added seq 84 1401924707 set_configfs_node 1 10.20.40.1 local 1 1401924707 cluster node 2 added seq 84 1401924707 set_configfs_node 2 10.20.40.2 local 0 1401924707 totem/rrp_mode = 'none' 1401924707 set protocol 0 1401924707 group_mode 3 compat 0 1401924707 setup_cpg_daemon 15 1401924707 dlm:controld conf 2 1 0 memb 1 2 join 1 left 1401924707 run protocol from nodeid 2 1401924707 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1 1401924707 plocks 18 1401924707 plock cpg message size: 104 bytes 1401924709 client connection 5 fd 19 1401924740 cluster node 2 removed seq 88 1401924740 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2" 1401924740 dlm:controld conf 1 0 1 memb 1 join left 2 1401924891 cluster node 2 added seq 92 1401924891 set_configfs_node 2 10.20.40.2 local 0 1401924895 dlm:controld conf 2 1 0 memb 1 2 join 2 left 1401925857 uevent: add@/kernel/dlm/clvmd 1401925857 kernel: add@ clvmd 1401925857 uevent: online@/kernel/dlm/clvmd 1401925857 kernel: online@ clvmd 1401925857 dlm:ls:clvmd conf 1 1 0 memb 1 join 1 left 1401925857 clvmd add_change cg 1 joined nodeid 1 1401925857 clvmd add_change cg 1 we joined 1401925857 clvmd add_change cg 1 counts member 1 joined 1 remove 0 failed 0 1401925857 clvmd check_fencing done 1401925857 clvmd check_quorum disabled 1401925857 clvmd check_fs none registered 1401925857 clvmd send_start cg 1 flags 1 data2 0 counts 0 1 1 0 0 1401925857 clvmd receive_start 1:1 len 76 1401925857 clvmd match_change 1:1 matches cg 1 1401925857 clvmd wait_messages cg 1 got all 1 1401925857 clvmd start_kernel cg 1 member_count 1 1401925857 write "1090842362" to "/sys/kernel/dlm/clvmd/id" 1401925857 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1" 1401925857 write "1" to "/sys/kernel/dlm/clvmd/control" 1401925857 write "0" to "/sys/kernel/dlm/clvmd/event_done" 1401925857 uevent: add@/devices/virtual/misc/dlm_clvmd 1401925866 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left 1401925866 clvmd add_change cg 2 joined nodeid 2 1401925866 clvmd add_change cg 2 counts member 2 joined 1 remove 0 failed 0 1401925866 clvmd stop_kernel cg 2 1401925866 write "0" to "/sys/kernel/dlm/clvmd/control" 1401925866 clvmd check_fencing done 1401925866 clvmd check_quorum disabled 1401925866 clvmd check_fs none registered 1401925866 clvmd send_start cg 2 flags 2 data2 0 counts 1 2 1 0 0 1401925866 clvmd receive_start 2:1 len 80 1401925866 clvmd match_change 2:1 matches cg 2 1401925866 clvmd wait_messages cg 2 need 1 of 2 1401925866 clvmd receive_start 1:2 len 80 1401925866 clvmd match_change 1:2 matches cg 2 1401925866 clvmd wait_messages cg 2 got all 2 1401925866 clvmd start_kernel cg 2 member_count 2 1401925866 dir_member 1 1401925866 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" 1401925866 write "1" to "/sys/kernel/dlm/clvmd/control" 1401925866 clvmd set_plock_ckpt_node from 1 to 1 1401925866 clvmd unlink ckpt 0 1401925866 clvmd store_plocks r_count 0 p_count 0 total_size 0 max_section_size 0 1401925866 clvmd store_plocks open ckpt handle 6a2342ec00000000 1401925866 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0 1401925866 clvmd send_plocks_stored cg 2 flags a data2 0 counts 1 2 1 0 0 1401925866 clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 0 1401926434 cluster node 2 removed seq 96 1401926434 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2" 1401926434 dlm:controld conf 1 0 1 memb 1 join left 2 1401926434 dlm:ls:clvmd conf 1 0 1 memb 1 join left 2 1401926434 clvmd add_change cg 3 remove nodeid 2 reason 3 1401926434 clvmd add_change cg 3 counts member 1 joined 0 remove 1 failed 1 1401926434 clvmd stop_kernel cg 3 1401926434 write "0" to "/sys/kernel/dlm/clvmd/control" 1401926434 clvmd check_fencing 2 wait add 1401925866 fail 1401926434 last 1401924759 1401926451 clvmd check_fencing 2 done add 1401925866 fail 1401926434 last 1401926451 1401926451 clvmd check_fencing done 1401926451 clvmd check_quorum disabled 1401926451 clvmd check_fs none registered 1401926451 clvmd send_start cg 3 flags 2 data2 0 counts 2 1 0 1 1 1401926451 clvmd receive_start 1:3 len 76 1401926451 clvmd match_change 1:3 matches cg 3 1401926451 clvmd wait_messages cg 3 got all 1 1401926451 clvmd start_kernel cg 3 member_count 1 1401926451 dir_member 2 1401926451 dir_member 1 1401926451 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" 1401926451 write "1" to "/sys/kernel/dlm/clvmd/control" ==== After starting 'an-a04n02' ==== [root@an-a04n01 ~]# pcs status Cluster name: an-anvil-04 Last updated: Wed Jun 4 20:08:03 2014 Last change: Wed May 28 23:56:25 2014 via cibadmin on an-a04n01.alteeve.ca Stack: cman Current DC: an-a04n01.alteeve.ca - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 2 Resources configured Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ] Full list of resources: fence_n01_ipmi (stonith:fence_ipmilan): Started an-a04n01.alteeve.ca fence_n02_ipmi (stonith:fence_ipmilan): Started an-a04n02.alteeve.ca [root@an-a04n01 ~]# cman_tool status Version: 6.2.0 Config Version: 9 Cluster Name: an-anvil-04 Cluster Id: 6054 Cluster Member: Yes Cluster Generation: 100 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Node votes: 1 Quorum: 1 Active subsystems: 9 Flags: 2node Ports Bound: 0 11 Node name: an-a04n01.alteeve.ca Node ID: 1 Multicast addresses: 239.192.23.189 Node addresses: 10.20.40.1 [root@an-a04n01 ~]# group_tool status fence domain member count 2 victim count 0 victim now 0 master nodeid 1 wait state none members 1 2 dlm lockspaces name clvmd id 0x4104eefa flags 0x00000000 change member 1 joined 0 remove 1 failed 1 seq 3,3 members 1 [root@an-a04n01 ~]# group_tool dump [root@an-a04n01 ~]# group_tool dump dlm dump dlm 1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401924707 dlm_controld 3.0.12.1 started 1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401924707 found /dev/misc/dlm-control minor 57 1401924707 found /dev/misc/dlm-monitor minor 56 1401924707 found /dev/misc/dlm_plock minor 55 1401924707 /dev/misc/dlm-monitor fd 12 1401924707 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 1401924707 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 1401924707 cluster node 1 added seq 84 1401924707 set_configfs_node 1 10.20.40.1 local 1 1401924707 cluster node 2 added seq 84 1401924707 set_configfs_node 2 10.20.40.2 local 0 1401924707 totem/rrp_mode = 'none' 1401924707 set protocol 0 1401924707 group_mode 3 compat 0 1401924707 setup_cpg_daemon 15 1401924707 dlm:controld conf 2 1 0 memb 1 2 join 1 left 1401924707 run protocol from nodeid 2 1401924707 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1 1401924707 plocks 18 1401924707 plock cpg message size: 104 bytes 1401924709 client connection 5 fd 19 1401924740 cluster node 2 removed seq 88 1401924740 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2" 1401924740 dlm:controld conf 1 0 1 memb 1 join left 2 1401924891 cluster node 2 added seq 92 1401924891 set_configfs_node 2 10.20.40.2 local 0 1401924895 dlm:controld conf 2 1 0 memb 1 2 join 2 left 1401925857 uevent: add@/kernel/dlm/clvmd 1401925857 kernel: add@ clvmd 1401925857 uevent: online@/kernel/dlm/clvmd 1401925857 kernel: online@ clvmd 1401925857 dlm:ls:clvmd conf 1 1 0 memb 1 join 1 left 1401925857 clvmd add_change cg 1 joined nodeid 1 1401925857 clvmd add_change cg 1 we joined 1401925857 clvmd add_change cg 1 counts member 1 joined 1 remove 0 failed 0 1401925857 clvmd check_fencing done 1401925857 clvmd check_quorum disabled 1401925857 clvmd check_fs none registered 1401925857 clvmd send_start cg 1 flags 1 data2 0 counts 0 1 1 0 0 1401925857 clvmd receive_start 1:1 len 76 1401925857 clvmd match_change 1:1 matches cg 1 1401925857 clvmd wait_messages cg 1 got all 1 1401925857 clvmd start_kernel cg 1 member_count 1 1401925857 write "1090842362" to "/sys/kernel/dlm/clvmd/id" 1401925857 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1" 1401925857 write "1" to "/sys/kernel/dlm/clvmd/control" 1401925857 write "0" to "/sys/kernel/dlm/clvmd/event_done" 1401925857 uevent: add@/devices/virtual/misc/dlm_clvmd 1401925866 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left 1401925866 clvmd add_change cg 2 joined nodeid 2 1401925866 clvmd add_change cg 2 counts member 2 joined 1 remove 0 failed 0 1401925866 clvmd stop_kernel cg 2 1401925866 write "0" to "/sys/kernel/dlm/clvmd/control" 1401925866 clvmd check_fencing done 1401925866 clvmd check_quorum disabled 1401925866 clvmd check_fs none registered 1401925866 clvmd send_start cg 2 flags 2 data2 0 counts 1 2 1 0 0 1401925866 clvmd receive_start 2:1 len 80 1401925866 clvmd match_change 2:1 matches cg 2 1401925866 clvmd wait_messages cg 2 need 1 of 2 1401925866 clvmd receive_start 1:2 len 80 1401925866 clvmd match_change 1:2 matches cg 2 1401925866 clvmd wait_messages cg 2 got all 2 1401925866 clvmd start_kernel cg 2 member_count 2 1401925866 dir_member 1 1401925866 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" 1401925866 write "1" to "/sys/kernel/dlm/clvmd/control" 1401925866 clvmd set_plock_ckpt_node from 1 to 1 1401925866 clvmd unlink ckpt 0 1401925866 clvmd store_plocks r_count 0 p_count 0 total_size 0 max_section_size 0 1401925866 clvmd store_plocks open ckpt handle 6a2342ec00000000 1401925866 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0 1401925866 clvmd send_plocks_stored cg 2 flags a data2 0 counts 1 2 1 0 0 1401925866 clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 0 1401926434 cluster node 2 removed seq 96 1401926434 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2" 1401926434 dlm:controld conf 1 0 1 memb 1 join left 2 1401926434 dlm:ls:clvmd conf 1 0 1 memb 1 join left 2 1401926434 clvmd add_change cg 3 remove nodeid 2 reason 3 1401926434 clvmd add_change cg 3 counts member 1 joined 0 remove 1 failed 1 1401926434 clvmd stop_kernel cg 3 1401926434 write "0" to "/sys/kernel/dlm/clvmd/control" 1401926434 clvmd check_fencing 2 wait add 1401925866 fail 1401926434 last 1401924759 1401926451 clvmd check_fencing 2 done add 1401925866 fail 1401926434 last 1401926451 1401926451 clvmd check_fencing done 1401926451 clvmd check_quorum disabled 1401926451 clvmd check_fs none registered 1401926451 clvmd send_start cg 3 flags 2 data2 0 counts 2 1 0 1 1 1401926451 clvmd receive_start 1:3 len 76 1401926451 clvmd match_change 1:3 matches cg 3 1401926451 clvmd wait_messages cg 3 got all 1 1401926451 clvmd start_kernel cg 3 member_count 1 1401926451 dir_member 2 1401926451 dir_member 1 1401926451 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" 1401926451 write "1" to "/sys/kernel/dlm/clvmd/control" 1401926860 cluster node 2 added seq 100 1401926860 set_configfs_node 2 10.20.40.2 local 0 1401926864 dlm:controld conf 2 1 0 memb 1 2 join 2 left ==== Same commands from an-a04n02: ==== [root@an-a04n02 ~]# pcs status Cluster name: an-anvil-04 Last updated: Wed Jun 4 20:10:15 2014 Last change: Wed May 28 23:56:25 2014 via cibadmin on an-a04n01.alteeve.ca Stack: cman Current DC: an-a04n01.alteeve.ca - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 2 Resources configured Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ] Full list of resources: fence_n01_ipmi (stonith:fence_ipmilan): Started an-a04n01.alteeve.ca fence_n02_ipmi (stonith:fence_ipmilan): Started an-a04n02.alteeve.ca [root@an-a04n02 ~]# cman_tool status Version: 6.2.0 Config Version: 9 Cluster Name: an-anvil-04 Cluster Id: 6054 Cluster Member: Yes Cluster Generation: 100 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Node votes: 1 Quorum: 1 Active subsystems: 8 Flags: 2node Ports Bound: 0 Node name: an-a04n02.alteeve.ca Node ID: 2 Multicast addresses: 239.192.23.189 Node addresses: 10.20.40.2 [root@an-a04n02 ~]# group_tool status fence domain member count 2 victim count 0 victim now 0 master nodeid 1 wait state none members 1 2 [root@an-a04n02 ~]# group_tool dump [root@an-a04n02 ~]# group_tool dump dlm dump dlm 1401926863 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401926863 dlm_controld 3.0.12.1 started 1401926864 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401926864 found /dev/misc/dlm-control minor 57 1401926864 found /dev/misc/dlm-monitor minor 56 1401926864 found /dev/misc/dlm_plock minor 55 1401926864 /dev/misc/dlm-monitor fd 12 1401926864 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 1401926864 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 1401926864 cluster node 1 added seq 100 1401926864 set_configfs_node 1 10.20.40.1 local 0 1401926864 cluster node 2 added seq 100 1401926864 set_configfs_node 2 10.20.40.2 local 1 1401926864 totem/rrp_mode = 'none' 1401926864 set protocol 0 1401926864 group_mode 3 compat 0 1401926864 setup_cpg_daemon 15 1401926864 dlm:controld conf 2 1 0 memb 1 2 join 2 left 1401926864 run protocol from nodeid 1 1401926864 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1 1401926864 plocks 18 1401926864 plock cpg message size: 104 bytes 1401926865 client connection 5 fd 19 ==== Now starting 'clvmd' on an-a04n02 ==== [root@an-a04n02 ~]# /etc/init.d/clvmd start Starting clvmd: clvmd startup timed out ==== Same group_tool calls from an-a04n01: ==== [root@an-a04n01 ~]# group_tool status fence domain member count 2 victim count 0 victim now 0 master nodeid 1 wait state none members 1 2 dlm lockspaces name clvmd id 0x4104eefa flags 0x00000000 change member 2 joined 1 remove 0 failed 0 seq 4,4 members 1 2 [root@an-a04n01 ~]# group_tool dump [root@an-a04n01 ~]# group_tool dump dlm dump dlm 1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401924707 dlm_controld 3.0.12.1 started 1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401924707 found /dev/misc/dlm-control minor 57 1401924707 found /dev/misc/dlm-monitor minor 56 1401924707 found /dev/misc/dlm_plock minor 55 1401924707 /dev/misc/dlm-monitor fd 12 1401924707 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 1401924707 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 1401924707 cluster node 1 added seq 84 1401924707 set_configfs_node 1 10.20.40.1 local 1 1401924707 cluster node 2 added seq 84 1401924707 set_configfs_node 2 10.20.40.2 local 0 1401924707 totem/rrp_mode = 'none' 1401924707 set protocol 0 1401924707 group_mode 3 compat 0 1401924707 setup_cpg_daemon 15 1401924707 dlm:controld conf 2 1 0 memb 1 2 join 1 left 1401924707 run protocol from nodeid 2 1401924707 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1 1401924707 plocks 18 1401924707 plock cpg message size: 104 bytes 1401924709 client connection 5 fd 19 1401924740 cluster node 2 removed seq 88 1401924740 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2" 1401924740 dlm:controld conf 1 0 1 memb 1 join left 2 1401924891 cluster node 2 added seq 92 1401924891 set_configfs_node 2 10.20.40.2 local 0 1401924895 dlm:controld conf 2 1 0 memb 1 2 join 2 left 1401925857 uevent: add@/kernel/dlm/clvmd 1401925857 kernel: add@ clvmd 1401925857 uevent: online@/kernel/dlm/clvmd 1401925857 kernel: online@ clvmd 1401925857 dlm:ls:clvmd conf 1 1 0 memb 1 join 1 left 1401925857 clvmd add_change cg 1 joined nodeid 1 1401925857 clvmd add_change cg 1 we joined 1401925857 clvmd add_change cg 1 counts member 1 joined 1 remove 0 failed 0 1401925857 clvmd check_fencing done 1401925857 clvmd check_quorum disabled 1401925857 clvmd check_fs none registered 1401925857 clvmd send_start cg 1 flags 1 data2 0 counts 0 1 1 0 0 1401925857 clvmd receive_start 1:1 len 76 1401925857 clvmd match_change 1:1 matches cg 1 1401925857 clvmd wait_messages cg 1 got all 1 1401925857 clvmd start_kernel cg 1 member_count 1 1401925857 write "1090842362" to "/sys/kernel/dlm/clvmd/id" 1401925857 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1" 1401925857 write "1" to "/sys/kernel/dlm/clvmd/control" 1401925857 write "0" to "/sys/kernel/dlm/clvmd/event_done" 1401925857 uevent: add@/devices/virtual/misc/dlm_clvmd 1401925866 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left 1401925866 clvmd add_change cg 2 joined nodeid 2 1401925866 clvmd add_change cg 2 counts member 2 joined 1 remove 0 failed 0 1401925866 clvmd stop_kernel cg 2 1401925866 write "0" to "/sys/kernel/dlm/clvmd/control" 1401925866 clvmd check_fencing done 1401925866 clvmd check_quorum disabled 1401925866 clvmd check_fs none registered 1401925866 clvmd send_start cg 2 flags 2 data2 0 counts 1 2 1 0 0 1401925866 clvmd receive_start 2:1 len 80 1401925866 clvmd match_change 2:1 matches cg 2 1401925866 clvmd wait_messages cg 2 need 1 of 2 1401925866 clvmd receive_start 1:2 len 80 1401925866 clvmd match_change 1:2 matches cg 2 1401925866 clvmd wait_messages cg 2 got all 2 1401925866 clvmd start_kernel cg 2 member_count 2 1401925866 dir_member 1 1401925866 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" 1401925866 write "1" to "/sys/kernel/dlm/clvmd/control" 1401925866 clvmd set_plock_ckpt_node from 1 to 1 1401925866 clvmd unlink ckpt 0 1401925866 clvmd store_plocks r_count 0 p_count 0 total_size 0 max_section_size 0 1401925866 clvmd store_plocks open ckpt handle 6a2342ec00000000 1401925866 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0 1401925866 clvmd send_plocks_stored cg 2 flags a data2 0 counts 1 2 1 0 0 1401925866 clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 0 1401926434 cluster node 2 removed seq 96 1401926434 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2" 1401926434 dlm:controld conf 1 0 1 memb 1 join left 2 1401926434 dlm:ls:clvmd conf 1 0 1 memb 1 join left 2 1401926434 clvmd add_change cg 3 remove nodeid 2 reason 3 1401926434 clvmd add_change cg 3 counts member 1 joined 0 remove 1 failed 1 1401926434 clvmd stop_kernel cg 3 1401926434 write "0" to "/sys/kernel/dlm/clvmd/control" 1401926434 clvmd check_fencing 2 wait add 1401925866 fail 1401926434 last 1401924759 1401926451 clvmd check_fencing 2 done add 1401925866 fail 1401926434 last 1401926451 1401926451 clvmd check_fencing done 1401926451 clvmd check_quorum disabled 1401926451 clvmd check_fs none registered 1401926451 clvmd send_start cg 3 flags 2 data2 0 counts 2 1 0 1 1 1401926451 clvmd receive_start 1:3 len 76 1401926451 clvmd match_change 1:3 matches cg 3 1401926451 clvmd wait_messages cg 3 got all 1 1401926451 clvmd start_kernel cg 3 member_count 1 1401926451 dir_member 2 1401926451 dir_member 1 1401926451 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" 1401926451 write "1" to "/sys/kernel/dlm/clvmd/control" 1401926860 cluster node 2 added seq 100 1401926860 set_configfs_node 2 10.20.40.2 local 0 1401926864 dlm:controld conf 2 1 0 memb 1 2 join 2 left 1401927111 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left 1401927111 clvmd add_change cg 4 joined nodeid 2 1401927111 clvmd add_change cg 4 counts member 2 joined 1 remove 0 failed 0 1401927111 clvmd stop_kernel cg 4 1401927111 write "0" to "/sys/kernel/dlm/clvmd/control" 1401927111 clvmd check_fencing done 1401927111 clvmd check_quorum disabled 1401927111 clvmd check_fs none registered 1401927111 clvmd send_start cg 4 flags 2 data2 0 counts 3 2 1 0 0 1401927111 clvmd receive_start 2:1 len 80 1401927111 clvmd match_change 2:1 matches cg 4 1401927111 clvmd wait_messages cg 4 need 1 of 2 1401927111 clvmd receive_start 1:4 len 80 1401927111 clvmd match_change 1:4 matches cg 4 1401927111 clvmd wait_messages cg 4 got all 2 1401927111 clvmd start_kernel cg 4 member_count 2 1401927111 dir_member 1 1401927111 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" 1401927111 write "1" to "/sys/kernel/dlm/clvmd/control" 1401927111 clvmd set_plock_ckpt_node from 1 to 1 1401927111 clvmd store_plocks saved ckpt uptodate 1401927111 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0 1401927111 clvmd send_plocks_stored cg 4 flags a data2 0 counts 3 2 1 0 0 1401927111 clvmd receive_plocks_stored 1:4 flags a sig 0 need_plocks 0 ==== and an-a04n02: ==== [root@an-a04n02 ~]# group_tool status fence domain member count 2 victim count 0 victim now 0 master nodeid 1 wait state none members 1 2 dlm lockspaces name clvmd id 0x4104eefa flags 0x00000000 change member 2 joined 1 remove 0 failed 0 seq 1,1 members 1 2 [root@an-a04n02 ~]# group_tool dump [root@an-a04n02 ~]# group_tool dump dlm dump dlm 1401926863 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401926863 dlm_controld 3.0.12.1 started 1401926864 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log 1401926864 found /dev/misc/dlm-control minor 57 1401926864 found /dev/misc/dlm-monitor minor 56 1401926864 found /dev/misc/dlm_plock minor 55 1401926864 /dev/misc/dlm-monitor fd 12 1401926864 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 1401926864 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 1401926864 cluster node 1 added seq 100 1401926864 set_configfs_node 1 10.20.40.1 local 0 1401926864 cluster node 2 added seq 100 1401926864 set_configfs_node 2 10.20.40.2 local 1 1401926864 totem/rrp_mode = 'none' 1401926864 set protocol 0 1401926864 group_mode 3 compat 0 1401926864 setup_cpg_daemon 15 1401926864 dlm:controld conf 2 1 0 memb 1 2 join 2 left 1401926864 run protocol from nodeid 1 1401926864 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1 1401926864 plocks 18 1401926864 plock cpg message size: 104 bytes 1401926865 client connection 5 fd 19 1401927111 uevent: add@/kernel/dlm/clvmd 1401927111 kernel: add@ clvmd 1401927111 uevent: online@/kernel/dlm/clvmd 1401927111 kernel: online@ clvmd 1401927111 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left 1401927111 clvmd add_change cg 1 joined nodeid 2 1401927111 clvmd add_change cg 1 we joined 1401927111 clvmd add_change cg 1 counts member 2 joined 1 remove 0 failed 0 1401927111 clvmd check_fencing done 1401927111 clvmd check_quorum disabled 1401927111 clvmd check_fs none registered 1401927111 clvmd send_start cg 1 flags 1 data2 0 counts 0 2 1 0 0 1401927111 clvmd receive_start 2:1 len 80 1401927111 clvmd match_change 2:1 matches cg 1 1401927111 clvmd wait_messages cg 1 need 1 of 2 1401927111 clvmd receive_start 1:4 len 80 1401927111 clvmd match_change 1:4 matches cg 1 1401927111 clvmd wait_messages cg 1 got all 2 1401927111 clvmd start_kernel cg 1 member_count 2 1401927111 write "1090842362" to "/sys/kernel/dlm/clvmd/id" 1401927111 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1" 1401927111 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" 1401927111 write "1" to "/sys/kernel/dlm/clvmd/control" 1401927111 write "0" to "/sys/kernel/dlm/clvmd/event_done" 1401927111 clvmd set_plock_ckpt_node from 0 to 1 1401927111 clvmd receive_plocks_stored 1:4 flags a sig 0 need_plocks 1 1401927111 clvmd match_change 1:4 matches cg 1 1401927111 clvmd retrieve_plocks 1401927111 clvmd retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0 ==== Here are the cluster configs:
====
<cluster config_version="9" name="an-anvil-04">
<fence_daemon/>
<clusternodes>
<clusternode name="an-a04n01.alteeve.ca" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="an-a04n01.alteeve.ca"/>
</method>
</fence>
</clusternode>
<clusternode name="an-a04n02.alteeve.ca" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="an-a04n02.alteeve.ca"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>
====
====
[root@an-a04n01 ~]# pcs config
Cluster Name: an-anvil-04
Corosync Nodes:
Pacemaker Nodes:
an-a04n01.alteeve.ca an-a04n02.alteeve.ca
Resources:
Stonith Devices:
Resource: fence_n01_ipmi (class=stonith type=fence_ipmilan)
Attributes: pcmk_host_list=an-a04n01.alteeve.ca ipaddr=an-a04n01.ipmi action=reboot login=admin passwd=Initial1 delay=15
Operations: monitor interval=60s (fence_n01_ipmi-monitor-interval-60s)
Resource: fence_n02_ipmi (class=stonith type=fence_ipmilan)
Attributes: pcmk_host_list=an-a04n02.alteeve.ca ipaddr=an-a04n02.ipmi action=reboot login=admin passwd=Initial1
Operations: monitor interval=60s (fence_n02_ipmi-monitor-interval-60s)
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.10-14.el6_5.3-368c726
expected-quorum-votes: 2
stonith-enabled: true
====
Created attachment 902376 [details]
crm_report from the clean start of pacemaker
This report starts with the first log entry from starting pacemaker (with chkconfig clvmd off, as enabling it caused pacemaker to hang on start).
Created attachment 902377 [details]
crm_report after starting clvmd and it hanging
This report starts where the last attachment ends. It covers the failed attempt to start clvmd, and ends after both nodes printed a 'hung process' timeout message to both nodes.
I can't reproduce this on my cluster here - but it looks to be in the DLM new lockspace code somewhere. Can you get a sysrq-t from the machine after it's hung so I can see what sort of state the DLM processes are in please? I changed cluster.conf to:
====
<cluster config_version="10" name="an-anvil-04">
<fence_daemon/>
<clusternodes>
<clusternode name="an-a04n01.alteeve.ca" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="an-a04n01.alteeve.ca"/>
</method>
</fence>
</clusternode>
<clusternode name="an-a04n02.alteeve.ca" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="an-a04n02.alteeve.ca"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_pcmk" name="pcmk"/>
</fencedevices>
<logging>
<logging_daemon name="dlm_controld" debug="on"/>
<logging_daemon name="groupd" debug="on"/>
<logging_daemon name="fenced" debug="on"/>
</logging>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>
====
I pushed it out, enabled clvmd to start on boot so that pacemaker would start it. I started pacemaker on both nodes:
====
[root@an-a04n01 ~]# /etc/init.d/pacemaker start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Starting gfs_controld... [ OK ]
Unfencing self... [ OK ]
Joining fence domain... [ OK ]
Starting clvmd:
Activating VG(s): No volume groups found
[ OK ]
Starting Pacemaker Cluster Manager [ OK ]
====
Jun 5 13:55:10 an-a04n01 kernel: DLM (built Apr 11 2014 17:28:07) installed
Jun 5 13:55:10 an-a04n01 corosync[2885]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Jun 5 13:55:10 an-a04n01 corosync[2885]: [MAIN ] Corosync built-in features: nss dbus rdma snmp
Jun 5 13:55:10 an-a04n01 corosync[2885]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Jun 5 13:55:10 an-a04n01 corosync[2885]: [MAIN ] Successfully parsed cman config
Jun 5 13:55:10 an-a04n01 corosync[2885]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 5 13:55:10 an-a04n01 corosync[2885]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jun 5 13:55:10 an-a04n01 corosync[2885]: [TOTEM ] The network interface [10.20.40.1] is now up.
Jun 5 13:55:11 an-a04n01 corosync[2885]: [QUORUM] Using quorum provider quorum_cman
Jun 5 13:55:11 an-a04n01 corosync[2885]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jun 5 13:55:11 an-a04n01 corosync[2885]: [CMAN ] CMAN 3.0.12.1 (built Apr 3 2014 05:12:26) started
Jun 5 13:55:11 an-a04n01 corosync[2885]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Jun 5 13:55:11 an-a04n01 corosync[2885]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Jun 5 13:55:11 an-a04n01 corosync[2885]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Jun 5 13:55:11 an-a04n01 corosync[2885]: [SERV ] Service engine loaded: corosync configuration service
Jun 5 13:55:11 an-a04n01 corosync[2885]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Jun 5 13:55:11 an-a04n01 corosync[2885]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Jun 5 13:55:11 an-a04n01 corosync[2885]: [SERV ] Service engine loaded: corosync profile loading service
Jun 5 13:55:11 an-a04n01 corosync[2885]: [QUORUM] Using quorum provider quorum_cman
Jun 5 13:55:11 an-a04n01 corosync[2885]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jun 5 13:55:11 an-a04n01 corosync[2885]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Jun 5 13:55:11 an-a04n01 corosync[2885]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 5 13:55:11 an-a04n01 corosync[2885]: [CMAN ] quorum regained, resuming activity
Jun 5 13:55:11 an-a04n01 corosync[2885]: [QUORUM] This node is within the primary component and will provide service.
Jun 5 13:55:11 an-a04n01 corosync[2885]: [QUORUM] Members[1]: 1
Jun 5 13:55:11 an-a04n01 corosync[2885]: [QUORUM] Members[1]: 1
Jun 5 13:55:11 an-a04n01 corosync[2885]: [CPG ] chosen downlist: sender r(0) ip(10.20.40.1) ; members(old:0 left:0)
Jun 5 13:55:11 an-a04n01 corosync[2885]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 5 13:55:12 an-a04n01 corosync[2885]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 5 13:55:12 an-a04n01 corosync[2885]: [QUORUM] Members[2]: 1 2
Jun 5 13:55:12 an-a04n01 corosync[2885]: [QUORUM] Members[2]: 1 2
Jun 5 13:55:12 an-a04n01 corosync[2885]: [CPG ] chosen downlist: sender r(0) ip(10.20.40.1) ; members(old:1 left:0)
Jun 5 13:55:12 an-a04n01 corosync[2885]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 5 13:55:14 an-a04n01 fenced[2940]: fenced 3.0.12.1 started
Jun 5 13:55:14 an-a04n01 dlm_controld[2964]: dlm_controld 3.0.12.1 started
Jun 5 13:55:16 an-a04n01 gfs_controld[3011]: gfs_controld 3.0.12.1 started
Jun 5 13:55:17 an-a04n01 pacemaker: Attempting to start clvmd
Jun 5 13:55:18 an-a04n01 kernel: dlm: Using TCP for communications
Jun 5 13:55:19 an-a04n01 clvmd: Cluster LVM daemon started - connected to CMAN
Jun 5 13:55:19 an-a04n01 kernel: dlm: connecting to 2
Jun 5 13:55:19 an-a04n01 pacemaker: Starting Pacemaker Cluster Manager
Jun 5 13:55:19 an-a04n01 pacemakerd[3141]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:19 an-a04n01 pacemakerd[3141]: notice: main: Starting Pacemaker 1.1.10-14.el6_5.3 (Build: 368c726): generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc nagios corosync-plugin cman
Jun 5 13:55:19 an-a04n01 pengine[3151]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:19 an-a04n01 attrd[3150]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:19 an-a04n01 attrd[3150]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun 5 13:55:19 an-a04n01 cib[3147]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:19 an-a04n01 stonith-ng[3148]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:19 an-a04n01 stonith-ng[3148]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun 5 13:55:19 an-a04n01 attrd[3150]: notice: main: Starting mainloop...
Jun 5 13:55:20 an-a04n01 lrmd[3149]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:20 an-a04n01 crmd[3152]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:20 an-a04n01 crmd[3152]: notice: main: CRM Git Version: 368c726
Jun 5 13:55:20 an-a04n01 cib[3147]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun 5 13:55:20 an-a04n01 stonith-ng[3148]: notice: setup_cib: Watching for stonith topology changes
Jun 5 13:55:21 an-a04n01 crmd[3152]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun 5 13:55:21 an-a04n01 crmd[3152]: notice: cman_event_callback: Membership 132: quorum acquired
Jun 5 13:55:21 an-a04n01 crmd[3152]: notice: crm_update_peer_state: cman_event_callback: Node an-a04n01.alteeve.ca[1] - state is now member (was (null))
Jun 5 13:55:21 an-a04n01 crmd[3152]: notice: crm_update_peer_state: cman_event_callback: Node an-a04n02.alteeve.ca[2] - state is now member (was (null))
Jun 5 13:55:21 an-a04n01 crmd[3152]: notice: do_started: The local CRM is operational
Jun 5 13:55:21 an-a04n01 crmd[3152]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Jun 5 13:55:21 an-a04n01 stonith-ng[3148]: notice: stonith_device_register: Added 'fence_n01_ipmi' to the device list (1 active devices)
Jun 5 13:55:22 an-a04n01 stonith-ng[3148]: notice: stonith_device_register: Added 'fence_n02_ipmi' to the device list (2 active devices)
Jun 5 13:55:42 an-a04n01 crmd[3152]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Jun 5 13:55:42 an-a04n01 crmd[3152]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Jun 5 13:55:42 an-a04n01 attrd[3150]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jun 5 13:55:43 an-a04n01 pengine[3151]: warning: custom_action: Action fence_n01_ipmi_monitor_0 on an-a04n02.alteeve.ca is unrunnable (pending)
Jun 5 13:55:43 an-a04n01 pengine[3151]: warning: custom_action: Action fence_n02_ipmi_monitor_0 on an-a04n02.alteeve.ca is unrunnable (pending)
Jun 5 13:55:43 an-a04n01 pengine[3151]: notice: LogActions: Start fence_n01_ipmi#011(an-a04n01.alteeve.ca)
Jun 5 13:55:43 an-a04n01 pengine[3151]: notice: LogActions: Start fence_n02_ipmi#011(an-a04n01.alteeve.ca)
Jun 5 13:55:43 an-a04n01 pengine[3151]: notice: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-27.bz2
Jun 5 13:55:43 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 4: monitor fence_n01_ipmi_monitor_0 on an-a04n01.alteeve.ca (local)
Jun 5 13:55:43 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 5: monitor fence_n02_ipmi_monitor_0 on an-a04n01.alteeve.ca (local)
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 3: probe_complete probe_complete on an-a04n01.alteeve.ca (local) - no waiting
Jun 5 13:55:44 an-a04n01 crmd[3152]: warning: run_graph: Transition 0 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=5, Source=/var/lib/pacemaker/pengine/pe-input-27.bz2): Terminated
Jun 5 13:55:44 an-a04n01 attrd[3150]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun 5 13:55:44 an-a04n01 crmd[3152]: warning: te_graph_trigger: Transition failed: terminated
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_graph: Graph 0 with 8 actions: batch-limit=8 jobs, network-delay=30ms
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_synapse: [Action 10]: Pending rsc op fence_n01_ipmi_monitor_60000 on an-a04n01.alteeve.ca (priority: 0, waiting: 9)
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_synapse: [Action 9]: Pending rsc op fence_n01_ipmi_start_0 on an-a04n01.alteeve.ca (priority: 0, waiting: 2)
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_synapse: [Action 4]: Completed rsc op fence_n01_ipmi_monitor_0 on an-a04n01.alteeve.ca (priority: 0, waiting: none)
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_synapse: [Action 12]: Pending rsc op fence_n02_ipmi_monitor_60000 on an-a04n01.alteeve.ca (priority: 0, waiting: 11)
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_synapse: [Action 11]: Pending rsc op fence_n02_ipmi_start_0 on an-a04n01.alteeve.ca (priority: 0, waiting: 2)
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_synapse: [Action 5]: Completed rsc op fence_n02_ipmi_monitor_0 on an-a04n01.alteeve.ca (priority: 0, waiting: none)
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_synapse: [Action 3]: Completed rsc op probe_complete on an-a04n01.alteeve.ca (priority: 1000000, waiting: none)
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_synapse: [Action 2]: Pending pseduo op probe_complete on N/A (priority: 0, waiting: none)
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: print_synapse: * [Input 6]: Unresolved dependancy rsc op probe_complete on an-a04n02.alteeve.ca
Jun 5 13:55:44 an-a04n01 crmd[3152]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 5 13:55:44 an-a04n01 attrd[3150]: notice: attrd_perform_update: Sent update 4: probe_complete=true
Jun 5 13:55:49 an-a04n01 crmd[3152]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
Jun 5 13:55:51 an-a04n01 attrd[3150]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jun 5 13:55:51 an-a04n01 attrd[3150]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun 5 13:55:51 an-a04n01 pengine[3151]: notice: LogActions: Start fence_n01_ipmi#011(an-a04n01.alteeve.ca)
Jun 5 13:55:51 an-a04n01 pengine[3151]: notice: LogActions: Start fence_n02_ipmi#011(an-a04n02.alteeve.ca)
Jun 5 13:55:51 an-a04n01 pengine[3151]: notice: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-28.bz2
Jun 5 13:55:51 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 5: monitor fence_n01_ipmi_monitor_0 on an-a04n02.alteeve.ca
Jun 5 13:55:51 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 6: monitor fence_n02_ipmi_monitor_0 on an-a04n02.alteeve.ca
Jun 5 13:55:52 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 4: probe_complete probe_complete on an-a04n02.alteeve.ca - no waiting
Jun 5 13:55:52 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 7: start fence_n01_ipmi_start_0 on an-a04n01.alteeve.ca (local)
Jun 5 13:55:52 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 9: start fence_n02_ipmi_start_0 on an-a04n02.alteeve.ca
Jun 5 13:55:53 an-a04n01 stonith-ng[3148]: notice: stonith_device_register: Device 'fence_n01_ipmi' already existed in device list (2 active devices)
Jun 5 13:55:54 an-a04n01 crmd[3152]: notice: process_lrm_event: LRM operation fence_n01_ipmi_start_0 (call=13, rc=0, cib-update=41, confirmed=true) ok
Jun 5 13:55:54 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 10: monitor fence_n02_ipmi_monitor_60000 on an-a04n02.alteeve.ca
Jun 5 13:55:54 an-a04n01 crmd[3152]: notice: te_rsc_command: Initiating action 8: monitor fence_n01_ipmi_monitor_60000 on an-a04n01.alteeve.ca (local)
Jun 5 13:55:54 an-a04n01 crmd[3152]: notice: process_lrm_event: LRM operation fence_n01_ipmi_monitor_60000 (call=16, rc=0, cib-update=42, confirmed=false) ok
Jun 5 13:55:54 an-a04n01 crmd[3152]: notice: run_graph: Transition 1 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-28.bz2): Complete
Jun 5 13:55:54 an-a04n01 crmd[3152]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
====
second node (start pacemaker ~1 second later)
====
[root@an-a04n02 ~]# /etc/init.d/pacemaker start
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Starting gfs_controld... [ OK ]
Unfencing self... [ OK ]
Joining fence domain... [ OK ]
Starting clvmd: clvmd startup timed out
Starting Pacemaker Cluster Manager [ OK ]
====
Jun 5 13:47:55 an-a04n02 ntpd[2487]: 0.0.0.0 c612 02 freq_set kernel 27.924 PPM
Jun 5 13:47:55 an-a04n02 ntpd[2487]: 0.0.0.0 c615 05 clock_sync
Jun 5 13:55:11 an-a04n02 kernel: DLM (built Apr 11 2014 17:28:07) installed
Jun 5 13:55:11 an-a04n02 corosync[2831]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Jun 5 13:55:11 an-a04n02 corosync[2831]: [MAIN ] Corosync built-in features: nss dbus rdma snmp
Jun 5 13:55:11 an-a04n02 corosync[2831]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf
Jun 5 13:55:11 an-a04n02 corosync[2831]: [MAIN ] Successfully parsed cman config
Jun 5 13:55:11 an-a04n02 corosync[2831]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun 5 13:55:11 an-a04n02 corosync[2831]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jun 5 13:55:11 an-a04n02 corosync[2831]: [TOTEM ] The network interface [10.20.40.2] is now up.
Jun 5 13:55:12 an-a04n02 corosync[2831]: [QUORUM] Using quorum provider quorum_cman
Jun 5 13:55:12 an-a04n02 corosync[2831]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jun 5 13:55:12 an-a04n02 corosync[2831]: [CMAN ] CMAN 3.0.12.1 (built Apr 3 2014 05:12:26) started
Jun 5 13:55:12 an-a04n02 corosync[2831]: [SERV ] Service engine loaded: corosync CMAN membership service 2.90
Jun 5 13:55:12 an-a04n02 corosync[2831]: [SERV ] Service engine loaded: openais checkpoint service B.01.01
Jun 5 13:55:12 an-a04n02 corosync[2831]: [SERV ] Service engine loaded: corosync extended virtual synchrony service
Jun 5 13:55:12 an-a04n02 corosync[2831]: [SERV ] Service engine loaded: corosync configuration service
Jun 5 13:55:12 an-a04n02 corosync[2831]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01
Jun 5 13:55:12 an-a04n02 corosync[2831]: [SERV ] Service engine loaded: corosync cluster config database access v1.01
Jun 5 13:55:12 an-a04n02 corosync[2831]: [SERV ] Service engine loaded: corosync profile loading service
Jun 5 13:55:12 an-a04n02 corosync[2831]: [QUORUM] Using quorum provider quorum_cman
Jun 5 13:55:12 an-a04n02 corosync[2831]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1
Jun 5 13:55:12 an-a04n02 corosync[2831]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine.
Jun 5 13:55:12 an-a04n02 corosync[2831]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 5 13:55:12 an-a04n02 corosync[2831]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 5 13:55:12 an-a04n02 corosync[2831]: [CMAN ] quorum regained, resuming activity
Jun 5 13:55:12 an-a04n02 corosync[2831]: [QUORUM] This node is within the primary component and will provide service.
Jun 5 13:55:12 an-a04n02 corosync[2831]: [QUORUM] Members[1]: 2
Jun 5 13:55:12 an-a04n02 corosync[2831]: [QUORUM] Members[1]: 2
Jun 5 13:55:12 an-a04n02 corosync[2831]: [QUORUM] Members[2]: 1 2
Jun 5 13:55:12 an-a04n02 corosync[2831]: [QUORUM] Members[2]: 1 2
Jun 5 13:55:12 an-a04n02 corosync[2831]: [CPG ] chosen downlist: sender r(0) ip(10.20.40.1) ; members(old:1 left:0)
Jun 5 13:55:12 an-a04n02 corosync[2831]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 5 13:55:15 an-a04n02 fenced[2886]: fenced 3.0.12.1 started
Jun 5 13:55:16 an-a04n02 dlm_controld[2910]: dlm_controld 3.0.12.1 started
Jun 5 13:55:17 an-a04n02 gfs_controld[2957]: gfs_controld 3.0.12.1 started
Jun 5 13:55:18 an-a04n02 pacemaker: Attempting to start clvmd
Jun 5 13:55:19 an-a04n02 kernel: dlm: Using TCP for communications
Jun 5 13:55:19 an-a04n02 kernel: dlm: connecting to 1
Jun 5 13:55:48 an-a04n02 pacemaker: Starting Pacemaker Cluster Manager
Jun 5 13:55:48 an-a04n02 pacemakerd[3081]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:48 an-a04n02 pacemakerd[3081]: notice: main: Starting Pacemaker 1.1.10-14.el6_5.3 (Build: 368c726): generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc nagios corosync-plugin cman
Jun 5 13:55:48 an-a04n02 cib[3087]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:48 an-a04n02 attrd[3090]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:48 an-a04n02 attrd[3090]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun 5 13:55:48 an-a04n02 attrd[3090]: notice: main: Starting mainloop...
Jun 5 13:55:48 an-a04n02 stonith-ng[3088]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:48 an-a04n02 pengine[3091]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:48 an-a04n02 stonith-ng[3088]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun 5 13:55:48 an-a04n02 lrmd[3089]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:48 an-a04n02 crmd[3092]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun 5 13:55:48 an-a04n02 crmd[3092]: notice: main: CRM Git Version: 368c726
Jun 5 13:55:48 an-a04n02 cib[3087]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun 5 13:55:49 an-a04n02 stonith-ng[3088]: notice: setup_cib: Watching for stonith topology changes
Jun 5 13:55:49 an-a04n02 crmd[3092]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun 5 13:55:49 an-a04n02 crmd[3092]: notice: cman_event_callback: Membership 132: quorum acquired
Jun 5 13:55:49 an-a04n02 crmd[3092]: notice: crm_update_peer_state: cman_event_callback: Node an-a04n01.alteeve.ca[1] - state is now member (was (null))
Jun 5 13:55:49 an-a04n02 crmd[3092]: notice: crm_update_peer_state: cman_event_callback: Node an-a04n02.alteeve.ca[2] - state is now member (was (null))
Jun 5 13:55:49 an-a04n02 crmd[3092]: notice: do_started: The local CRM is operational
Jun 5 13:55:49 an-a04n02 crmd[3092]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Jun 5 13:55:50 an-a04n02 stonith-ng[3088]: notice: stonith_device_register: Added 'fence_n01_ipmi' to the device list (1 active devices)
Jun 5 13:55:51 an-a04n02 stonith-ng[3088]: notice: stonith_device_register: Added 'fence_n02_ipmi' to the device list (2 active devices)
Jun 5 13:55:51 an-a04n02 crmd[3092]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Jun 5 13:55:51 an-a04n02 attrd[3090]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jun 5 13:55:52 an-a04n02 attrd[3090]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun 5 13:55:53 an-a04n02 attrd[3090]: notice: attrd_perform_update: Sent update 6: probe_complete=true
Jun 5 13:55:53 an-a04n02 stonith-ng[3088]: notice: stonith_device_register: Device 'fence_n02_ipmi' already existed in device list (2 active devices)
Jun 5 13:55:53 an-a04n02 crmd[3092]: notice: process_lrm_event: LRM operation fence_n02_ipmi_start_0 (call=13, rc=0, cib-update=12, confirmed=true) ok
Jun 5 13:55:54 an-a04n02 crmd[3092]: notice: process_lrm_event: LRM operation fence_n02_ipmi_monitor_60000 (call=16, rc=0, cib-update=13, confirmed=false) ok
Jun 5 13:58:10 an-a04n02 kernel: INFO: task clvmd:3060 blocked for more than 120 seconds.
Jun 5 13:58:10 an-a04n02 kernel: Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun 5 13:58:10 an-a04n02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 5 13:58:10 an-a04n02 kernel: clvmd D 000000000000000c 0 3060 1 0x00000080
Jun 5 13:58:10 an-a04n02 kernel: ffff88033827fc98 0000000000000086 0000000000000000 0000000000000000
Jun 5 13:58:10 an-a04n02 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Jun 5 13:58:10 an-a04n02 kernel: ffff880339569058 ffff88033827ffd8 000000000000fbc8 ffff880339569058
Jun 5 13:58:10 an-a04n02 kernel: Call Trace:
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff81528a95>] schedule_timeout+0x215/0x2e0
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff81527bfe>] ? thread_return+0x4e/0x760
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff81285172>] ? kobject_uevent_env+0x202/0x620
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff81528713>] wait_for_common+0x123/0x180
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff8152882d>] wait_for_completion+0x1d/0x20
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffffa032af79>] dlm_new_lockspace+0x999/0xa30 [dlm]
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffffa0332ff1>] device_write+0x311/0x720 [dlm]
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff8123316b>] ? selinux_file_permission+0xfb/0x150
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff81226056>] ? security_file_permission+0x16/0x20
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff81188c38>] vfs_write+0xb8/0x1a0
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff81189531>] sys_write+0x51/0x90
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff810e1abe>] ? __audit_syscall_exit+0x25e/0x290
Jun 5 13:58:10 an-a04n02 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
====
I will attach 'echo t > /proc/sysrq-trigger' from both nodes in a moment.
Bugzilla isn't letting my add attachments (says the .cgi file is missing). Here is the syslog output of the two nodes after running 'echo t > /proc/sysrq-trigger' after pacemaker tried to start clvmd. https://alteeve.ca/files/messages.an-a04n01.sysrq-t https://alteeve.ca/files/messages.an-a04n02.sysrq-t Here are the other logs: an-a04n01's dlm_controld.log: ==== Jun 05 13:55:14 dlm_controld dlm_controld 3.0.12.1 started Jun 05 13:55:15 dlm_controld found /dev/misc/dlm-control minor 57 Jun 05 13:55:15 dlm_controld found /dev/misc/dlm-monitor minor 56 Jun 05 13:55:15 dlm_controld found /dev/misc/dlm_plock minor 55 Jun 05 13:55:15 dlm_controld /dev/misc/dlm-monitor fd 12 Jun 05 13:55:15 dlm_controld /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 Jun 05 13:55:15 dlm_controld /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 Jun 05 13:55:15 dlm_controld cluster node 1 added seq 132 Jun 05 13:55:15 dlm_controld set_configfs_node 1 10.20.40.1 local 1 Jun 05 13:55:15 dlm_controld cluster node 2 added seq 132 Jun 05 13:55:15 dlm_controld set_configfs_node 2 10.20.40.2 local 0 Jun 05 13:55:15 dlm_controld totem/rrp_mode = 'none' Jun 05 13:55:15 dlm_controld set protocol 0 Jun 05 13:55:15 dlm_controld group_mode 3 compat 0 Jun 05 13:55:15 dlm_controld setup_cpg_daemon 15 Jun 05 13:55:15 dlm_controld dlm:controld conf 1 1 0 memb 1 join 1 left Jun 05 13:55:15 dlm_controld set_protocol member_count 1 propose daemon 1.1.1 kernel 1.1.1 Jun 05 13:55:15 dlm_controld run protocol from nodeid 1 Jun 05 13:55:15 dlm_controld daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1 Jun 05 13:55:15 dlm_controld plocks 18 Jun 05 13:55:15 dlm_controld plock cpg message size: 104 bytes Jun 05 13:55:16 dlm_controld client connection 5 fd 19 Jun 05 13:55:16 dlm_controld dlm:controld conf 2 1 0 memb 1 2 join 2 left Jun 05 13:55:18 dlm_controld uevent: add@/kernel/dlm/clvmd Jun 05 13:55:18 dlm_controld kernel: add@ clvmd Jun 05 13:55:18 dlm_controld uevent: online@/kernel/dlm/clvmd Jun 05 13:55:18 dlm_controld kernel: online@ clvmd Jun 05 13:55:18 dlm_controld dlm:ls:clvmd conf 1 1 0 memb 1 join 1 left Jun 05 13:55:18 dlm_controld clvmd add_change cg 1 joined nodeid 1 Jun 05 13:55:18 dlm_controld clvmd add_change cg 1 we joined Jun 05 13:55:18 dlm_controld clvmd add_change cg 1 counts member 1 joined 1 remove 0 failed 0 Jun 05 13:55:18 dlm_controld clvmd check_fencing done Jun 05 13:55:18 dlm_controld clvmd check_quorum disabled Jun 05 13:55:18 dlm_controld clvmd check_fs none registered Jun 05 13:55:18 dlm_controld clvmd send_start cg 1 flags 1 data2 0 counts 0 1 1 0 0 Jun 05 13:55:18 dlm_controld clvmd receive_start 1:1 len 76 Jun 05 13:55:18 dlm_controld clvmd match_change 1:1 matches cg 1 Jun 05 13:55:18 dlm_controld clvmd wait_messages cg 1 got all 1 Jun 05 13:55:18 dlm_controld clvmd start_kernel cg 1 member_count 1 Jun 05 13:55:18 dlm_controld write "1090842362" to "/sys/kernel/dlm/clvmd/id" Jun 05 13:55:18 dlm_controld set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1" Jun 05 13:55:18 dlm_controld write "1" to "/sys/kernel/dlm/clvmd/control" Jun 05 13:55:18 dlm_controld write "0" to "/sys/kernel/dlm/clvmd/event_done" Jun 05 13:55:18 dlm_controld uevent: add@/devices/virtual/misc/dlm_clvmd Jun 05 13:55:19 dlm_controld dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left Jun 05 13:55:19 dlm_controld clvmd add_change cg 2 joined nodeid 2 Jun 05 13:55:19 dlm_controld clvmd add_change cg 2 counts member 2 joined 1 remove 0 failed 0 Jun 05 13:55:19 dlm_controld clvmd stop_kernel cg 2 Jun 05 13:55:19 dlm_controld write "0" to "/sys/kernel/dlm/clvmd/control" Jun 05 13:55:19 dlm_controld clvmd check_fencing done Jun 05 13:55:19 dlm_controld clvmd check_quorum disabled Jun 05 13:55:19 dlm_controld clvmd check_fs none registered Jun 05 13:55:19 dlm_controld clvmd send_start cg 2 flags 2 data2 0 counts 1 2 1 0 0 Jun 05 13:55:19 dlm_controld clvmd receive_start 2:1 len 80 Jun 05 13:55:19 dlm_controld clvmd match_change 2:1 matches cg 2 Jun 05 13:55:19 dlm_controld clvmd wait_messages cg 2 need 1 of 2 Jun 05 13:55:19 dlm_controld clvmd receive_start 1:2 len 80 Jun 05 13:55:19 dlm_controld clvmd match_change 1:2 matches cg 2 Jun 05 13:55:19 dlm_controld clvmd wait_messages cg 2 got all 2 Jun 05 13:55:19 dlm_controld clvmd start_kernel cg 2 member_count 2 Jun 05 13:55:19 dlm_controld dir_member 1 Jun 05 13:55:19 dlm_controld set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" Jun 05 13:55:19 dlm_controld write "1" to "/sys/kernel/dlm/clvmd/control" Jun 05 13:55:19 dlm_controld clvmd set_plock_ckpt_node from 1 to 1 Jun 05 13:55:19 dlm_controld clvmd unlink ckpt 0 Jun 05 13:55:19 dlm_controld clvmd unlink ckpt error 12 clvmd Jun 05 13:55:19 dlm_controld clvmd unlink ckpt status error 9 clvmd Jun 05 13:55:19 dlm_controld clvmd store_plocks r_count 0 p_count 0 total_size 0 max_section_size 0 Jun 05 13:55:19 dlm_controld clvmd store_plocks open ckpt handle 580bd78f00000000 Jun 05 13:55:19 dlm_controld clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0 Jun 05 13:55:19 dlm_controld clvmd send_plocks_stored cg 2 flags a data2 0 counts 1 2 1 0 0 Jun 05 13:55:19 dlm_controld clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 0 ==== an-a04n02's dlm_controld.log: ==== Jun 05 13:55:16 dlm_controld found /dev/misc/dlm-control minor 57 Jun 05 13:55:16 dlm_controld found /dev/misc/dlm-monitor minor 56 Jun 05 13:55:16 dlm_controld found /dev/misc/dlm_plock minor 55 Jun 05 13:55:16 dlm_controld /dev/misc/dlm-monitor fd 7 Jun 05 13:55:16 dlm_controld /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 Jun 05 13:55:16 dlm_controld /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2 Jun 05 13:55:16 dlm_controld cluster node 1 added seq 132 Jun 05 13:55:16 dlm_controld set_configfs_node 1 10.20.40.1 local 0 Jun 05 13:55:16 dlm_controld cluster node 2 added seq 132 Jun 05 13:55:16 dlm_controld set_configfs_node 2 10.20.40.2 local 1 Jun 05 13:55:16 dlm_controld totem/rrp_mode = 'none' Jun 05 13:55:16 dlm_controld set protocol 0 Jun 05 13:55:16 dlm_controld group_mode 3 compat 0 Jun 05 13:55:16 dlm_controld setup_cpg_daemon 15 Jun 05 13:55:16 dlm_controld dlm:controld conf 2 1 0 memb 1 2 join 2 left Jun 05 13:55:16 dlm_controld run protocol from nodeid 1 Jun 05 13:55:16 dlm_controld daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1 Jun 05 13:55:16 dlm_controld plocks 18 Jun 05 13:55:16 dlm_controld plock cpg message size: 104 bytes Jun 05 13:55:17 dlm_controld client connection 5 fd 19 Jun 05 13:55:19 dlm_controld uevent: add@/kernel/dlm/clvmd Jun 05 13:55:19 dlm_controld kernel: add@ clvmd Jun 05 13:55:19 dlm_controld uevent: online@/kernel/dlm/clvmd Jun 05 13:55:19 dlm_controld kernel: online@ clvmd Jun 05 13:55:19 dlm_controld dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left Jun 05 13:55:19 dlm_controld clvmd add_change cg 1 joined nodeid 2 Jun 05 13:55:19 dlm_controld clvmd add_change cg 1 we joined Jun 05 13:55:19 dlm_controld clvmd add_change cg 1 counts member 2 joined 1 remove 0 failed 0 Jun 05 13:55:19 dlm_controld clvmd check_fencing done Jun 05 13:55:19 dlm_controld clvmd check_quorum disabled Jun 05 13:55:19 dlm_controld clvmd check_fs none registered Jun 05 13:55:19 dlm_controld clvmd send_start cg 1 flags 1 data2 0 counts 0 2 1 0 0 Jun 05 13:55:19 dlm_controld clvmd receive_start 2:1 len 80 Jun 05 13:55:19 dlm_controld clvmd match_change 2:1 matches cg 1 Jun 05 13:55:19 dlm_controld clvmd wait_messages cg 1 need 1 of 2 Jun 05 13:55:19 dlm_controld clvmd receive_start 1:2 len 80 Jun 05 13:55:19 dlm_controld clvmd match_change 1:2 matches cg 1 Jun 05 13:55:19 dlm_controld clvmd wait_messages cg 1 got all 2 Jun 05 13:55:19 dlm_controld clvmd start_kernel cg 1 member_count 2 Jun 05 13:55:19 dlm_controld write "1090842362" to "/sys/kernel/dlm/clvmd/id" Jun 05 13:55:19 dlm_controld set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1" Jun 05 13:55:19 dlm_controld set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2" Jun 05 13:55:19 dlm_controld write "1" to "/sys/kernel/dlm/clvmd/control" Jun 05 13:55:19 dlm_controld write "0" to "/sys/kernel/dlm/clvmd/event_done" Jun 05 13:55:19 dlm_controld clvmd set_plock_ckpt_node from 0 to 1 Jun 05 13:55:19 dlm_controld clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 1 Jun 05 13:55:19 dlm_controld clvmd match_change 1:2 matches cg 1 Jun 05 13:55:19 dlm_controld clvmd retrieve_plocks Jun 05 13:55:19 dlm_controld clvmd retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0 ==== an-a04n01's fenced.log: ==== Jun 05 13:55:14 fenced fenced 3.0.12.1 started Jun 05 13:55:14 fenced logfile cur mode 100644 Jun 05 13:55:14 fenced cpg_join fenced:daemon ... Jun 05 13:55:14 fenced setup_cpg_daemon 14 Jun 05 13:55:14 fenced group_mode 3 compat 0 Jun 05 13:55:14 fenced fenced:daemon conf 1 1 0 memb 1 join 1 left Jun 05 13:55:14 fenced fenced:daemon ring 1:132 2 memb 1 2 Jun 05 13:55:14 fenced receive_protocol from 1 max 1.1.1.0 run 0.0.0.0 Jun 05 13:55:14 fenced daemon node 1 max 0.0.0.0 run 0.0.0.0 Jun 05 13:55:14 fenced daemon node 1 join 1401990914 left 0 local quorum 1401990914 Jun 05 13:55:14 fenced set_protocol member_count 1 propose daemon 1.1.1 Jun 05 13:55:14 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.0 Jun 05 13:55:14 fenced daemon node 1 max 1.1.1.0 run 0.0.0.0 Jun 05 13:55:14 fenced daemon node 1 join 1401990914 left 0 local quorum 1401990914 Jun 05 13:55:14 fenced run protocol from nodeid 1 Jun 05 13:55:14 fenced daemon run 1.1.1 max 1.1.1 Jun 05 13:55:14 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.0 Jun 05 13:55:14 fenced daemon node 1 max 1.1.1.0 run 1.1.1.0 Jun 05 13:55:14 fenced daemon node 1 join 1401990914 left 0 local quorum 1401990914 Jun 05 13:55:16 fenced fenced:daemon conf 2 1 0 memb 1 2 join 2 left Jun 05 13:55:16 fenced receive_protocol from 2 max 1.1.1.0 run 0.0.0.0 Jun 05 13:55:16 fenced daemon node 2 max 0.0.0.0 run 0.0.0.0 Jun 05 13:55:16 fenced daemon node 2 join 1401990916 left 0 local quorum 1401990914 Jun 05 13:55:16 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.0 Jun 05 13:55:16 fenced daemon node 1 max 1.1.1.0 run 1.1.1.0 Jun 05 13:55:16 fenced daemon node 1 join 1401990914 left 0 local quorum 1401990914 Jun 05 13:55:16 fenced receive_protocol from 2 max 1.1.1.0 run 1.1.1.0 Jun 05 13:55:16 fenced daemon node 2 max 1.1.1.0 run 0.0.0.0 Jun 05 13:55:16 fenced daemon node 2 join 1401990916 left 0 local quorum 1401990914 Jun 05 13:55:16 fenced client connection 3 fd 15 Jun 05 13:55:16 fenced /cluster/cman/@two_node is 1 Jun 05 13:55:16 fenced added 2 nodes from ccs Jun 05 13:55:16 fenced cpg_join fenced:default ... Jun 05 13:55:16 fenced fenced:default conf 1 1 0 memb 1 join 1 left Jun 05 13:55:16 fenced add_change cg 1 joined nodeid 1 Jun 05 13:55:16 fenced add_change cg 1 m 1 j 1 r 0 f 0 Jun 05 13:55:16 fenced check_ringid cluster 132 cpg 0:0 Jun 05 13:55:16 fenced fenced:default ring 1:132 2 memb 1 2 Jun 05 13:55:16 fenced check_ringid done cluster 132 cpg 1:132 Jun 05 13:55:16 fenced check_quorum done Jun 05 13:55:16 fenced send_start 1:1 flags 1 started 0 m 1 j 1 r 0 f 0 Jun 05 13:55:16 fenced receive_start 1:1 len 152 Jun 05 13:55:16 fenced match_change 1:1 matches cg 1 Jun 05 13:55:16 fenced wait_messages cg 1 got all 1 Jun 05 13:55:16 fenced set_master from 0 to low node 1 Jun 05 13:55:16 fenced send_complete 1:1 flags 1 started 0 m 1 j 1 r 0 f 0 Jun 05 13:55:16 fenced receive_complete 1:1 len 152 Jun 05 13:55:17 fenced fenced:default conf 2 1 0 memb 1 2 join 2 left Jun 05 13:55:17 fenced add_change cg 2 joined nodeid 2 Jun 05 13:55:17 fenced add_change cg 2 m 2 j 1 r 0 f 0 Jun 05 13:55:17 fenced check_ringid done cluster 132 cpg 1:132 Jun 05 13:55:17 fenced check_quorum done Jun 05 13:55:17 fenced send_start 1:2 flags 2 started 1 m 2 j 1 r 0 f 0 Jun 05 13:55:17 fenced receive_start 1:2 len 152 Jun 05 13:55:17 fenced match_change 1:2 matches cg 2 Jun 05 13:55:17 fenced wait_messages cg 2 need 1 of 2 Jun 05 13:55:17 fenced receive_start 2:1 len 152 Jun 05 13:55:17 fenced match_change 2:1 matches cg 2 Jun 05 13:55:17 fenced wait_messages cg 2 got all 2 Jun 05 13:55:17 fenced set_master from 1 to complete node 1 Jun 05 13:55:17 fenced send_complete 1:2 flags 2 started 1 m 2 j 1 r 0 f 0 Jun 05 13:55:17 fenced receive_complete 1:2 len 152 ==== an-a04n02's fenced.log: ==== Jun 05 13:55:15 fenced fenced 3.0.12.1 started Jun 05 13:55:16 fenced logfile cur mode 100644 Jun 05 13:55:16 fenced cpg_join fenced:daemon ... Jun 05 13:55:16 fenced setup_cpg_daemon 14 Jun 05 13:55:16 fenced group_mode 3 compat 0 Jun 05 13:55:16 fenced fenced:daemon conf 2 1 0 memb 1 2 join 2 left Jun 05 13:55:16 fenced fenced:daemon ring 1:132 2 memb 1 2 Jun 05 13:55:16 fenced receive_protocol from 2 max 1.1.1.0 run 0.0.0.0 Jun 05 13:55:16 fenced daemon node 2 max 0.0.0.0 run 0.0.0.0 Jun 05 13:55:16 fenced daemon node 2 join 1401990916 left 0 local quorum 1401990915 Jun 05 13:55:16 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.0 Jun 05 13:55:16 fenced daemon node 1 max 0.0.0.0 run 0.0.0.0 Jun 05 13:55:16 fenced daemon node 1 join 1401990916 left 0 local quorum 1401990915 Jun 05 13:55:16 fenced run protocol from nodeid 1 Jun 05 13:55:16 fenced daemon run 1.1.1 max 1.1.1 Jun 05 13:55:16 fenced receive_protocol from 2 max 1.1.1.0 run 1.1.1.0 Jun 05 13:55:16 fenced daemon node 2 max 1.1.1.0 run 0.0.0.0 Jun 05 13:55:16 fenced daemon node 2 join 1401990916 left 0 local quorum 1401990915 Jun 05 13:55:17 fenced client connection 3 fd 15 Jun 05 13:55:17 fenced /cluster/cman/@two_node is 1 Jun 05 13:55:17 fenced added 2 nodes from ccs Jun 05 13:55:17 fenced cpg_join fenced:default ... Jun 05 13:55:17 fenced fenced:default conf 2 1 0 memb 1 2 join 2 left Jun 05 13:55:17 fenced add_change cg 1 joined nodeid 2 Jun 05 13:55:17 fenced add_change cg 1 m 2 j 1 r 0 f 0 Jun 05 13:55:17 fenced check_ringid cluster 132 cpg 0:0 Jun 05 13:55:17 fenced fenced:default ring 1:132 2 memb 1 2 Jun 05 13:55:17 fenced check_ringid done cluster 132 cpg 1:132 Jun 05 13:55:17 fenced check_quorum done Jun 05 13:55:17 fenced send_start 2:1 flags 1 started 0 m 2 j 1 r 0 f 0 Jun 05 13:55:17 fenced receive_start 1:2 len 152 Jun 05 13:55:17 fenced match_change 1:2 matches cg 1 Jun 05 13:55:17 fenced wait_messages cg 1 need 1 of 2 Jun 05 13:55:17 fenced receive_start 2:1 len 152 Jun 05 13:55:17 fenced match_change 2:1 matches cg 1 Jun 05 13:55:17 fenced wait_messages cg 1 got all 2 Jun 05 13:55:17 fenced set_master from 0 to complete node 1 Jun 05 13:55:17 fenced receive_complete 1:2 len 152 ==== think it's probably in the kernel - though you're using exactly the same kernel as I am - the final uevent isn't reaching dlm_controld. Can you enable kernel debugging please ? echo 1 > /sys/kernel/config/dlm/cluster/log_debug You'll need to do this between loading the DLM module and starting clvmd, as the files won't be there until the module is loaded. You might also need to enable LOG_DEBUG level logging in syslog. I 'chkconfig clvmd off' on both nodes, started pacemaker, 'echo 1 > /sys/kernel/config/dlm/cluster/log_debug', and then started clvmd. I started clvmd on an-a04n02 first, then a second later on an-a04n01 (reverse from previous). This time, an-a04n02 *appeared* to start, and an-a04n01 timed out. So start order, in some fashion, matters. I will (try to) attach the syslog in a moment. Created attachment 903015 [details]
syslog with kernel/dlm debugging enabled
Created attachment 903016 [details]
syslog with kernel/dlm debugging enabled, an-a04n02
Created attachment 903017 [details]
dlm_controld.log with kernel/dlm debugging enabled, an-a04n01
Created attachment 903018 [details]
dlm_controld.log with kernel/dlm debugging enabled, an-a04n02
In hopes of trying to get back to studying, I wiped both nodes and did a fresh near-minimal install of RHEL 6.5. I created a skeleton cluster.conf, configured stonith in pacemaker, changed 'locking_type = 3' in lvm.conf and started clvmd... Same results. So whatever it is, it is persistent through reinstalls. If you're having trouble reproducing, I can setup remote access to the effected nodes. They are purely test systems and can be wiped at any time. I decided to play with this some more today... Turns out, it was a firewall issue: ==== [root@an-a04n01 ~]# diff -u iptables.broken iptables.dlm --- iptables.broken 2014-06-07 16:03:41.120294967 -0400 +++ iptables.dlm 2014-06-07 16:04:20.444294917 -0400 @@ -1,8 +1,9 @@ -# Generated by iptables-save v1.4.7 on Sat Jun 7 16:03:41 2014 +# Generated by iptables-save v1.4.7 on Sat Jun 7 16:04:20 2014 *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] -:OUTPUT ACCEPT [37:4556] +:OUTPUT ACCEPT [22:3824] +-A INPUT -s 10.20.0.0/16 -d 10.20.0.0/16 -p tcp -m state --state NEW -m tcp --dport 21064 -j ACCEPT -A INPUT -s 10.10.0.0/16 -d 10.10.0.0/16 -p tcp -m state --state NEW -m tcp --dport 7789 -j ACCEPT -A INPUT -s 10.10.0.0/16 -d 10.10.0.0/16 -p tcp -m state --state NEW -m tcp --dport 7788 -j ACCEPT -A INPUT -s 10.20.0.0/16 -p udp -m addrtype --dst-type MULTICAST -m state --state NEW -m multiport --dports 5404,5405 -j ACCEPT @@ -14,4 +15,4 @@ -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited COMMIT -# Completed on Sat Jun 7 16:03:41 2014 +# Completed on Sat Jun 7 16:04:20 2014 ==== So, "user error". My question now is, can dlm/clvmd handle this better? It strikes me that it shouldn't lock up the system so hard when the firewall isn't open. Can it do a test to see if there is a firewall in place, complain and exit instead? I'm more worried that that default firewall rules were wrong TBH. That's a much easier and a far more useful fix I think. Were you using the default firewall rules? I'll ask resource agent developers if its feasible to add a check when the DLM is started (In reply to Christine Caulfield from comment #22) > I'll ask resource agent developers if its feasible to add a check when the > DLM is started I don't maintain the dlm or clvmd init scripts for rhel6. The dlm is actually started by the CMAN init script. We shouldn't need to be checking firewall rules, all of this should just work. If it doesn't just work, then we need to focus on fixing that. -- Vossel OK, my mistake, sorry. It just looks like what might be a massively complicated fix in the DLM to fix a rare configuration error. Reluctantly closing this because it's getting too late in the RHEL-6 lifecycle to contemplate intrusive DLM changes. |