Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1104899

Summary: Starting clvmd hangs with incorrect firewall rules
Product: Red Hat Enterprise Linux 6 Reporter: Madison Kelly <mkelly>
Component: clusterAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: unspecified    
Version: 6.5CC: abeekhof, agk, ccaulfie, cluster-maint, dvossel, dwysocha, heinzm, jbrassow, msnitzer, prajnoha, prockai, rpeterso, teigland, thornber, tlavigne, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-20 08:13:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
crm_report from the clean start of pacemaker
none
crm_report after starting clvmd and it hanging
none
syslog with kernel/dlm debugging enabled
none
syslog with kernel/dlm debugging enabled, an-a04n02
none
dlm_controld.log with kernel/dlm debugging enabled, an-a04n01
none
dlm_controld.log with kernel/dlm debugging enabled, an-a04n02 none

Description Madison Kelly 2014-06-04 23:48:24 UTC
Description of problem:

I've build a 2-node cluster with cman + pacemaker, stonith/fencing enabled and tested on both nodes. Cluster forms fine, as confirmed with multiple tools:

====
[root@an-a04n01 ~]# pcs status
Cluster name: an-anvil-04
Last updated: Wed Jun  4 19:31:59 2014
Last change: Wed May 28 23:56:25 2014 via cibadmin on an-a04n01.alteeve.ca
Stack: cman
Current DC: an-a04n02.alteeve.ca - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured
2 Resources configured


Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ]

Full list of resources:

 fence_n01_ipmi	(stonith:fence_ipmilan):	Started an-a04n02.alteeve.ca 
 fence_n02_ipmi	(stonith:fence_ipmilan):	Started an-a04n01.alteeve.ca 


[root@an-a04n01 ~]# /etc/init.d/cman status
cluster is running.

[root@an-a04n01 ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: an-anvil-04
Cluster Id: 6054
Cluster Member: Yes
Cluster Generation: 92
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1  
Active subsystems: 8
Flags: 2node 
Ports Bound: 0  
Node name: an-a04n01.alteeve.ca
Node ID: 1
Multicast addresses: 239.192.23.189 
Node addresses: 10.20.40.1 

[root@an-a04n01 ~]# pcs status
Cluster name: an-anvil-04
Last updated: Wed Jun  4 19:35:15 2014
Last change: Wed May 28 23:56:25 2014 via cibadmin on an-a04n01.alteeve.ca
Stack: cman
Current DC: an-a04n01.alteeve.ca - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured
2 Resources configured


Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ]

Full list of resources:

 fence_n01_ipmi	(stonith:fence_ipmilan):	Started an-a04n01.alteeve.ca 
 fence_n02_ipmi	(stonith:fence_ipmilan):	Started an-a04n02.alteeve.ca 


[root@an-a04n01 ~]# dlm_tool dump | grep node
1401924707 cluster node 1 added seq 84
1401924707 set_configfs_node 1 10.20.40.1 local 1
1401924707 cluster node 2 added seq 84
1401924707 set_configfs_node 2 10.20.40.2 local 0
1401924707 run protocol from nodeid 2
1401924740 cluster node 2 removed seq 88
1401924740 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
1401924891 cluster node 2 added seq 92
1401924891 set_configfs_node 2 10.20.40.2 local 0
====

I configured /etc/lvm/lvm.conf on both nodes (lvm.conf.anvil is the original lvm.conf)

====
sed -i.anvil 's^filter = \[ "a/\.\*/" \]^filter = \[ "a|/dev/drbd*|", "r/.*/" \]^' /etc/lvm/lvm.conf
sed -i 's/locking_type = 1$/locking_type = 3/' /etc/lvm/lvm.conf
sed -i 's/fallback_to_local_locking = 1$/fallback_to_local_locking = 0/' /etc/lvm/lvm.conf 
diff -U0 /etc/lvm/lvm.conf.anvil /etc/lvm/lvm.conf

--- /etc/lvm/lvm.conf.anvil	2013-10-30 04:10:42.000000000 -0400
+++ /etc/lvm/lvm.conf	2014-06-04 18:38:15.545166869 -0400
@@ -85 +85 @@
-    filter = [ "a/.*/" ]
+    filter = [ "a|/dev/drbd*|", "r/.*/" ]
@@ -462 +462 @@
-    locking_type = 1
+    locking_type = 3
@@ -478 +478 @@
-    fallback_to_local_locking = 1
+    fallback_to_local_locking = 0
====

This is copied to both nodes and verified on node 2 with a matching 'diff' call.

Starting 'clvmd' hangs on both nodes:

====
[root@an-a04n02 ~]# /etc/init.d/clvmd start
Starting clvmd: clvmd startup timed out
====

In syslog, I see:

====
Jun  4 19:15:16 an-a04n02 kernel: dlm: Using TCP for communications
Jun  4 19:15:16 an-a04n02 kernel: dlm: connecting to 1
Jun  4 19:18:27 an-a04n02 kernel: INFO: task clvmd:3228 blocked for more than 120 seconds.
Jun  4 19:18:27 an-a04n02 kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun  4 19:18:27 an-a04n02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 19:18:27 an-a04n02 kernel: clvmd         D 0000000000000006     0  3228      1 0x00000080
Jun  4 19:18:27 an-a04n02 kernel: ffff880339ba5c98 0000000000000086 0000000000000000 ffffffff810699d3
Jun  4 19:18:27 an-a04n02 kernel: ffff880339ba5c38 ffff880339a740b8 0000000000000000 ffff8800282168a8
Jun  4 19:18:27 an-a04n02 kernel: ffff880339a74638 ffff880339ba5fd8 000000000000fbc8 ffff880339a74638
Jun  4 19:18:27 an-a04n02 kernel: Call Trace:
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff810699d3>] ? dequeue_entity+0x113/0x2e0
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff81528a95>] schedule_timeout+0x215/0x2e0
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff81527bfe>] ? thread_return+0x4e/0x760
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff81285172>] ? kobject_uevent_env+0x202/0x620
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff81528713>] wait_for_common+0x123/0x180
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff8152882d>] wait_for_completion+0x1d/0x20
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffffa032af79>] dlm_new_lockspace+0x999/0xa30 [dlm]
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffffa0332ff1>] device_write+0x311/0x720 [dlm]
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff8123316b>] ? selinux_file_permission+0xfb/0x150
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff81226056>] ? security_file_permission+0x16/0x20
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff81188c38>] vfs_write+0xb8/0x1a0
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff81189531>] sys_write+0x51/0x90
Jun  4 19:18:27 an-a04n02 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
 
<repeats>
====

If you run '/etc/init.d/clvmd status', it thinks it is running:

====
[root@an-a04n01 ~]# /etc/init.d/clvmd status
clvmd (pid  3307) is running...
Clustered Volume Groups: (none)
Active clustered Logical Volumes: (none)
====

But a 'pvscan' after hangs.


Version-Release number of selected component (if applicable):

[root@an-a04n01 ~]# rpm -q lvm2-cluster cman corosync pacemaker
lvm2-cluster-2.02.100-8.el6.x86_64
cman-3.0.12.1-59.el6_5.2.x86_64
corosync-1.4.1-17.el6_5.1.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64


How reproducible:

100% through reboots.


Steps to Reproduce:
1. Setup a basic cluster:

ccs -f /etc/cluster/cluster.conf --createcluster an-anvil-04
ccs -f /etc/cluster/cluster.conf --setcman two_node="1" expected_votes="1"
ccs -f /etc/cluster/cluster.conf --addnode an-a04n01.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addnode an-a04n02.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addfencedev pcmk agent=fence_pcmk 
ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect an-a04n01.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addmethod pcmk-redirect an-a04n02.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk an-a04n01.alteeve.ca pcmk-redirect port=an-a04n01.alteeve.ca
ccs -f /etc/cluster/cluster.conf --addfenceinst pcmk an-a04n02.alteeve.ca pcmk-redirect port=an-a04n02.alteeve.ca

2. Start pacemaker
3. Edit lvm.conf as above, verify with pcs, cman_tool, etc that both nodes are in the cluster.
4. Start clvmd


Actual results:

Starting clvmd hangs


Expected results:

Starting clvmd should not hang


Additional info:

Fully up to date RHEL 6.5 install, as of the time this ticket was entered. Install was off of minimal.

I filed this against 'cluster' because I don't see the 'lvm2-cluster' package in the component list.

Comment 1 Madison Kelly 2014-06-04 23:55:22 UTC
After cleanly resetting both nodes, I changed lvm.conf to:

====
[root@an-a04n01 ~]# diff -U0 /etc/lvm/lvm.conf.anvil /etc/lvm/lvm.conf
--- /etc/lvm/lvm.conf.anvil	2013-10-30 04:10:42.000000000 -0400
+++ /etc/lvm/lvm.conf	2014-06-04 19:50:14.823159186 -0400
@@ -84,0 +85 @@
+    #filter = [ "a|/dev/drbd*|", "r/.*/" ]
@@ -462 +463 @@
-    locking_type = 1
+    locking_type = 3
====

(only 'locking_type = 3' changed). I tried again starting clvmd and this time it started on 'an-a04n01', but timed out (as it did in comment #1) on 'an-a04n02'...

====
[root@an-a04n01 ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: an-anvil-04
Cluster Id: 6054
Cluster Member: Yes
Cluster Generation: 92
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1  
Active subsystems: 8
Flags: 2node 
Ports Bound: 0  
Node name: an-a04n01.alteeve.ca
Node ID: 1
Multicast addresses: 239.192.23.189 
Node addresses: 10.20.40.1 

[root@an-a04n01 ~]# vim /etc/lvm/lvm.conf
[root@an-a04n01 ~]# diff -U0 /etc/lvm/lvm.conf.anvil /etc/lvm/lvm.conf
--- /etc/lvm/lvm.conf.anvil	2013-10-30 04:10:42.000000000 -0400
+++ /etc/lvm/lvm.conf	2014-06-04 19:50:14.823159186 -0400
@@ -84,0 +85 @@
+    #filter = [ "a|/dev/drbd*|", "r/.*/" ]
@@ -462 +463 @@
-    locking_type = 1
+    locking_type = 3

[root@an-a04n01 ~]# rsync -av /etc/lvm/lvm.conf root@an-a04n02:/etc/lvm/
sending incremental file list
lvm.conf

sent 1758 bytes  received 421 bytes  1452.67 bytes/sec
total size is 45034  speedup is 20.67
[root@an-a04n01 ~]# /etc/init.d/clvmd status
clvmd is stopped
[root@an-a04n01 ~]# /etc/init.d/clvmd start
Starting clvmd: 
Activating VG(s):   No volume groups found
                                                           [  OK  ]
====

In an-a04n01's syslog:

====
Jun  4 19:50:57 an-a04n01 kernel: dlm: Using TCP for communications
Jun  4 19:50:58 an-a04n01 clvmd: Cluster LVM daemon started - connected to CMAN
Jun  4 19:51:06 an-a04n01 kernel: dlm: connecting to 2
====

I then started clvmd on 'an-a04n02' about 2~3 seconds later:

====
[root@an-a04n02 ~]# diff -U0 /etc/lvm/lvm.conf.anvil /etc/lvm/lvm.conf
--- /etc/lvm/lvm.conf.anvil	2013-10-30 04:10:42.000000000 -0400
+++ /etc/lvm/lvm.conf	2014-06-04 19:50:14.000000000 -0400
@@ -84,0 +85 @@
+    #filter = [ "a|/dev/drbd*|", "r/.*/" ]
@@ -462 +463 @@
-    locking_type = 1
+    locking_type = 3

[root@an-a04n02 ~]# /etc/init.d/clvmd status
clvmd is stopped

[root@an-a04n02 ~]# /etc/init.d/clvmd start
Starting clvmd: clvmd startup timed out
====

In system log:

====
Jun  4 19:51:06 an-a04n02 kernel: dlm: Using TCP for communications
Jun  4 19:51:06 an-a04n02 kernel: dlm: connecting to 1
Jun  4 19:54:04 an-a04n02 kernel: INFO: task clvmd:3156 blocked for more than 120 seconds.
Jun  4 19:54:04 an-a04n02 kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun  4 19:54:04 an-a04n02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 19:54:04 an-a04n02 kernel: clvmd         D 0000000000000001     0  3156      1 0x00000080
Jun  4 19:54:04 an-a04n02 kernel: ffff880338a07c98 0000000000000082 0000000000000000 0000000000000000
Jun  4 19:54:04 an-a04n02 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Jun  4 19:54:04 an-a04n02 kernel: ffff880337aedaf8 ffff880338a07fd8 000000000000fbc8 ffff880337aedaf8
Jun  4 19:54:04 an-a04n02 kernel: Call Trace:
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff81528a95>] schedule_timeout+0x215/0x2e0
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff81527bfe>] ? thread_return+0x4e/0x760
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff81285172>] ? kobject_uevent_env+0x202/0x620
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff81528713>] wait_for_common+0x123/0x180
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff8152882d>] wait_for_completion+0x1d/0x20
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffffa032af79>] dlm_new_lockspace+0x999/0xa30 [dlm]
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffffa0332ff1>] device_write+0x311/0x720 [dlm]
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff8123316b>] ? selinux_file_permission+0xfb/0x150
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff81226056>] ? security_file_permission+0x16/0x20
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff81188c38>] vfs_write+0xb8/0x1a0
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff81189531>] sys_write+0x51/0x90
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff810e1abe>] ? __audit_syscall_exit+0x25e/0x290
Jun  4 19:54:04 an-a04n02 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
====

Comment 3 Madison Kelly 2014-06-05 00:04:37 UTC
Running 'pvscan' on 'an-a04n01' shows this:

====
[root@an-a04n01 ~]# pvscan
  clvmd not running on node an-a04n02.alteeve.ca
  Unable to obtain global lock.
====

Trying to stop clvmd on an-a04n02:

====
[root@an-a04n02 ~]# /etc/init.d/clvmd stop
Signaling clvmd to exit                                    [  OK  ]
Waiting for clvmd to exit:                                 [FAILED]
clvmd failed to exit                                       [FAILED]

[root@an-a04n02 ~]# echo $?
1
====

Nothing further was written to syslog (though the 120 second timeout error is still being printed).

Trying to stop pacemaker on an-a04n02 also fails:

====
[root@an-a04n02 ~]# /etc/init.d/pacemaker stop
Waiting for shutdown of managed resources.                 [  OK  ]
DLM lockspace still in use
(shell doesn't return, but does return on ^C)
====

System log shows basically the same thing:

====
Jun  4 19:58:46 an-a04n02 attrd[3075]:   notice: attrd_trigger_update: Sending flush op to all hosts for: standby (true)
Jun  4 19:58:46 an-a04n02 attrd[3075]:   notice: attrd_perform_update: Sent update 10: standby=true
Jun  4 19:58:46 an-a04n02 pacemaker: Waiting for shutdown of managed resources
Jun  4 19:58:46 an-a04n02 crmd[3077]:   notice: process_lrm_event: LRM operation fence_n02_ipmi_stop_0 (call=20, rc=0, cib-update=14, confirmed=true) ok
Jun  4 19:58:48 an-a04n02 pacemaker: DLM lockspace still in use
====

If I crash an-a04n02 with 'echo c > /proc/sysrq-trigger', it gets fenced properly (fence_ipmilan in case it matters) Then 'pvscan' on 'an-a04n01' works.

====
[root@an-a04n01 ~]# pvscan
  No matching physical volumes found
====

Comment 4 Madison Kelly 2014-06-05 00:14:04 UTC
As per Bob's comments in #linux-cluster, here is the group_tool output from 'an-a04n01' with clvmd started and 'an-a04n02' not yet back in the cluster:

====
[root@an-a04n01 ~]# group_tool status
fence domain
member count  1
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000000 
change        member 1 joined 0 remove 1 failed 1 seq 3,3
members       1 

[root@an-a04n01 ~]# group_tool dump

[root@an-a04n01 ~]# group_tool dump dlm
dump dlm
1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401924707 dlm_controld 3.0.12.1 started
1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401924707 found /dev/misc/dlm-control minor 57
1401924707 found /dev/misc/dlm-monitor minor 56
1401924707 found /dev/misc/dlm_plock minor 55
1401924707 /dev/misc/dlm-monitor fd 12
1401924707 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
1401924707 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
1401924707 cluster node 1 added seq 84
1401924707 set_configfs_node 1 10.20.40.1 local 1
1401924707 cluster node 2 added seq 84
1401924707 set_configfs_node 2 10.20.40.2 local 0
1401924707 totem/rrp_mode = 'none'
1401924707 set protocol 0
1401924707 group_mode 3 compat 0
1401924707 setup_cpg_daemon 15
1401924707 dlm:controld conf 2 1 0 memb 1 2 join 1 left
1401924707 run protocol from nodeid 2
1401924707 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
1401924707 plocks 18
1401924707 plock cpg message size: 104 bytes
1401924709 client connection 5 fd 19
1401924740 cluster node 2 removed seq 88
1401924740 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
1401924740 dlm:controld conf 1 0 1 memb 1 join left 2
1401924891 cluster node 2 added seq 92
1401924891 set_configfs_node 2 10.20.40.2 local 0
1401924895 dlm:controld conf 2 1 0 memb 1 2 join 2 left
1401925857 uevent: add@/kernel/dlm/clvmd
1401925857 kernel: add@ clvmd
1401925857 uevent: online@/kernel/dlm/clvmd
1401925857 kernel: online@ clvmd
1401925857 dlm:ls:clvmd conf 1 1 0 memb 1 join 1 left
1401925857 clvmd add_change cg 1 joined nodeid 1
1401925857 clvmd add_change cg 1 we joined
1401925857 clvmd add_change cg 1 counts member 1 joined 1 remove 0 failed 0
1401925857 clvmd check_fencing done
1401925857 clvmd check_quorum disabled
1401925857 clvmd check_fs none registered
1401925857 clvmd send_start cg 1 flags 1 data2 0 counts 0 1 1 0 0
1401925857 clvmd receive_start 1:1 len 76
1401925857 clvmd match_change 1:1 matches cg 1
1401925857 clvmd wait_messages cg 1 got all 1
1401925857 clvmd start_kernel cg 1 member_count 1
1401925857 write "1090842362" to "/sys/kernel/dlm/clvmd/id"
1401925857 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1"
1401925857 write "1" to "/sys/kernel/dlm/clvmd/control"
1401925857 write "0" to "/sys/kernel/dlm/clvmd/event_done"
1401925857 uevent: add@/devices/virtual/misc/dlm_clvmd
1401925866 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left
1401925866 clvmd add_change cg 2 joined nodeid 2
1401925866 clvmd add_change cg 2 counts member 2 joined 1 remove 0 failed 0
1401925866 clvmd stop_kernel cg 2
1401925866 write "0" to "/sys/kernel/dlm/clvmd/control"
1401925866 clvmd check_fencing done
1401925866 clvmd check_quorum disabled
1401925866 clvmd check_fs none registered
1401925866 clvmd send_start cg 2 flags 2 data2 0 counts 1 2 1 0 0
1401925866 clvmd receive_start 2:1 len 80
1401925866 clvmd match_change 2:1 matches cg 2
1401925866 clvmd wait_messages cg 2 need 1 of 2
1401925866 clvmd receive_start 1:2 len 80
1401925866 clvmd match_change 1:2 matches cg 2
1401925866 clvmd wait_messages cg 2 got all 2
1401925866 clvmd start_kernel cg 2 member_count 2
1401925866 dir_member 1
1401925866 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
1401925866 write "1" to "/sys/kernel/dlm/clvmd/control"
1401925866 clvmd set_plock_ckpt_node from 1 to 1
1401925866 clvmd unlink ckpt 0
1401925866 clvmd store_plocks r_count 0 p_count 0 total_size 0 max_section_size 0
1401925866 clvmd store_plocks open ckpt handle 6a2342ec00000000
1401925866 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0
1401925866 clvmd send_plocks_stored cg 2 flags a data2 0 counts 1 2 1 0 0
1401925866 clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 0
1401926434 cluster node 2 removed seq 96
1401926434 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
1401926434 dlm:controld conf 1 0 1 memb 1 join left 2
1401926434 dlm:ls:clvmd conf 1 0 1 memb 1 join left 2
1401926434 clvmd add_change cg 3 remove nodeid 2 reason 3
1401926434 clvmd add_change cg 3 counts member 1 joined 0 remove 1 failed 1
1401926434 clvmd stop_kernel cg 3
1401926434 write "0" to "/sys/kernel/dlm/clvmd/control"
1401926434 clvmd check_fencing 2 wait add 1401925866 fail 1401926434 last 1401924759
1401926451 clvmd check_fencing 2 done add 1401925866 fail 1401926434 last 1401926451
1401926451 clvmd check_fencing done
1401926451 clvmd check_quorum disabled
1401926451 clvmd check_fs none registered
1401926451 clvmd send_start cg 3 flags 2 data2 0 counts 2 1 0 1 1
1401926451 clvmd receive_start 1:3 len 76
1401926451 clvmd match_change 1:3 matches cg 3
1401926451 clvmd wait_messages cg 3 got all 1
1401926451 clvmd start_kernel cg 3 member_count 1
1401926451 dir_member 2
1401926451 dir_member 1
1401926451 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
1401926451 write "1" to "/sys/kernel/dlm/clvmd/control"
====

After starting 'an-a04n02'

====
[root@an-a04n01 ~]# pcs status
Cluster name: an-anvil-04
Last updated: Wed Jun  4 20:08:03 2014
Last change: Wed May 28 23:56:25 2014 via cibadmin on an-a04n01.alteeve.ca
Stack: cman
Current DC: an-a04n01.alteeve.ca - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured
2 Resources configured


Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ]

Full list of resources:

 fence_n01_ipmi	(stonith:fence_ipmilan):	Started an-a04n01.alteeve.ca 
 fence_n02_ipmi	(stonith:fence_ipmilan):	Started an-a04n02.alteeve.ca 

[root@an-a04n01 ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: an-anvil-04
Cluster Id: 6054
Cluster Member: Yes
Cluster Generation: 100
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1  
Active subsystems: 9
Flags: 2node 
Ports Bound: 0 11  
Node name: an-a04n01.alteeve.ca
Node ID: 1
Multicast addresses: 239.192.23.189 
Node addresses: 10.20.40.1 

[root@an-a04n01 ~]# group_tool status
fence domain
member count  2
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000000 
change        member 1 joined 0 remove 1 failed 1 seq 3,3
members       1 

[root@an-a04n01 ~]# group_tool dump

[root@an-a04n01 ~]# group_tool dump dlm
dump dlm
1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401924707 dlm_controld 3.0.12.1 started
1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401924707 found /dev/misc/dlm-control minor 57
1401924707 found /dev/misc/dlm-monitor minor 56
1401924707 found /dev/misc/dlm_plock minor 55
1401924707 /dev/misc/dlm-monitor fd 12
1401924707 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
1401924707 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
1401924707 cluster node 1 added seq 84
1401924707 set_configfs_node 1 10.20.40.1 local 1
1401924707 cluster node 2 added seq 84
1401924707 set_configfs_node 2 10.20.40.2 local 0
1401924707 totem/rrp_mode = 'none'
1401924707 set protocol 0
1401924707 group_mode 3 compat 0
1401924707 setup_cpg_daemon 15
1401924707 dlm:controld conf 2 1 0 memb 1 2 join 1 left
1401924707 run protocol from nodeid 2
1401924707 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
1401924707 plocks 18
1401924707 plock cpg message size: 104 bytes
1401924709 client connection 5 fd 19
1401924740 cluster node 2 removed seq 88
1401924740 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
1401924740 dlm:controld conf 1 0 1 memb 1 join left 2
1401924891 cluster node 2 added seq 92
1401924891 set_configfs_node 2 10.20.40.2 local 0
1401924895 dlm:controld conf 2 1 0 memb 1 2 join 2 left
1401925857 uevent: add@/kernel/dlm/clvmd
1401925857 kernel: add@ clvmd
1401925857 uevent: online@/kernel/dlm/clvmd
1401925857 kernel: online@ clvmd
1401925857 dlm:ls:clvmd conf 1 1 0 memb 1 join 1 left
1401925857 clvmd add_change cg 1 joined nodeid 1
1401925857 clvmd add_change cg 1 we joined
1401925857 clvmd add_change cg 1 counts member 1 joined 1 remove 0 failed 0
1401925857 clvmd check_fencing done
1401925857 clvmd check_quorum disabled
1401925857 clvmd check_fs none registered
1401925857 clvmd send_start cg 1 flags 1 data2 0 counts 0 1 1 0 0
1401925857 clvmd receive_start 1:1 len 76
1401925857 clvmd match_change 1:1 matches cg 1
1401925857 clvmd wait_messages cg 1 got all 1
1401925857 clvmd start_kernel cg 1 member_count 1
1401925857 write "1090842362" to "/sys/kernel/dlm/clvmd/id"
1401925857 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1"
1401925857 write "1" to "/sys/kernel/dlm/clvmd/control"
1401925857 write "0" to "/sys/kernel/dlm/clvmd/event_done"
1401925857 uevent: add@/devices/virtual/misc/dlm_clvmd
1401925866 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left
1401925866 clvmd add_change cg 2 joined nodeid 2
1401925866 clvmd add_change cg 2 counts member 2 joined 1 remove 0 failed 0
1401925866 clvmd stop_kernel cg 2
1401925866 write "0" to "/sys/kernel/dlm/clvmd/control"
1401925866 clvmd check_fencing done
1401925866 clvmd check_quorum disabled
1401925866 clvmd check_fs none registered
1401925866 clvmd send_start cg 2 flags 2 data2 0 counts 1 2 1 0 0
1401925866 clvmd receive_start 2:1 len 80
1401925866 clvmd match_change 2:1 matches cg 2
1401925866 clvmd wait_messages cg 2 need 1 of 2
1401925866 clvmd receive_start 1:2 len 80
1401925866 clvmd match_change 1:2 matches cg 2
1401925866 clvmd wait_messages cg 2 got all 2
1401925866 clvmd start_kernel cg 2 member_count 2
1401925866 dir_member 1
1401925866 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
1401925866 write "1" to "/sys/kernel/dlm/clvmd/control"
1401925866 clvmd set_plock_ckpt_node from 1 to 1
1401925866 clvmd unlink ckpt 0
1401925866 clvmd store_plocks r_count 0 p_count 0 total_size 0 max_section_size 0
1401925866 clvmd store_plocks open ckpt handle 6a2342ec00000000
1401925866 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0
1401925866 clvmd send_plocks_stored cg 2 flags a data2 0 counts 1 2 1 0 0
1401925866 clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 0
1401926434 cluster node 2 removed seq 96
1401926434 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
1401926434 dlm:controld conf 1 0 1 memb 1 join left 2
1401926434 dlm:ls:clvmd conf 1 0 1 memb 1 join left 2
1401926434 clvmd add_change cg 3 remove nodeid 2 reason 3
1401926434 clvmd add_change cg 3 counts member 1 joined 0 remove 1 failed 1
1401926434 clvmd stop_kernel cg 3
1401926434 write "0" to "/sys/kernel/dlm/clvmd/control"
1401926434 clvmd check_fencing 2 wait add 1401925866 fail 1401926434 last 1401924759
1401926451 clvmd check_fencing 2 done add 1401925866 fail 1401926434 last 1401926451
1401926451 clvmd check_fencing done
1401926451 clvmd check_quorum disabled
1401926451 clvmd check_fs none registered
1401926451 clvmd send_start cg 3 flags 2 data2 0 counts 2 1 0 1 1
1401926451 clvmd receive_start 1:3 len 76
1401926451 clvmd match_change 1:3 matches cg 3
1401926451 clvmd wait_messages cg 3 got all 1
1401926451 clvmd start_kernel cg 3 member_count 1
1401926451 dir_member 2
1401926451 dir_member 1
1401926451 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
1401926451 write "1" to "/sys/kernel/dlm/clvmd/control"
1401926860 cluster node 2 added seq 100
1401926860 set_configfs_node 2 10.20.40.2 local 0
1401926864 dlm:controld conf 2 1 0 memb 1 2 join 2 left
====

Same commands from an-a04n02:

====
[root@an-a04n02 ~]# pcs status
Cluster name: an-anvil-04
Last updated: Wed Jun  4 20:10:15 2014
Last change: Wed May 28 23:56:25 2014 via cibadmin on an-a04n01.alteeve.ca
Stack: cman
Current DC: an-a04n01.alteeve.ca - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured
2 Resources configured


Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ]

Full list of resources:

 fence_n01_ipmi	(stonith:fence_ipmilan):	Started an-a04n01.alteeve.ca 
 fence_n02_ipmi	(stonith:fence_ipmilan):	Started an-a04n02.alteeve.ca 

[root@an-a04n02 ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: an-anvil-04
Cluster Id: 6054
Cluster Member: Yes
Cluster Generation: 100
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1  
Active subsystems: 8
Flags: 2node 
Ports Bound: 0  
Node name: an-a04n02.alteeve.ca
Node ID: 2
Multicast addresses: 239.192.23.189 
Node addresses: 10.20.40.2 

[root@an-a04n02 ~]# group_tool status
fence domain
member count  2
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 

[root@an-a04n02 ~]# group_tool dump

[root@an-a04n02 ~]# group_tool dump dlm
dump dlm
1401926863 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401926863 dlm_controld 3.0.12.1 started
1401926864 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401926864 found /dev/misc/dlm-control minor 57
1401926864 found /dev/misc/dlm-monitor minor 56
1401926864 found /dev/misc/dlm_plock minor 55
1401926864 /dev/misc/dlm-monitor fd 12
1401926864 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
1401926864 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
1401926864 cluster node 1 added seq 100
1401926864 set_configfs_node 1 10.20.40.1 local 0
1401926864 cluster node 2 added seq 100
1401926864 set_configfs_node 2 10.20.40.2 local 1
1401926864 totem/rrp_mode = 'none'
1401926864 set protocol 0
1401926864 group_mode 3 compat 0
1401926864 setup_cpg_daemon 15
1401926864 dlm:controld conf 2 1 0 memb 1 2 join 2 left
1401926864 run protocol from nodeid 1
1401926864 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
1401926864 plocks 18
1401926864 plock cpg message size: 104 bytes
1401926865 client connection 5 fd 19
====

Now starting 'clvmd' on an-a04n02

====
[root@an-a04n02 ~]# /etc/init.d/clvmd start
Starting clvmd: clvmd startup timed out
====


Same group_tool calls from an-a04n01:

====
[root@an-a04n01 ~]# group_tool status
fence domain
member count  2
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000000 
change        member 2 joined 1 remove 0 failed 0 seq 4,4
members       1 2 

[root@an-a04n01 ~]# group_tool dump

[root@an-a04n01 ~]# group_tool dump dlm
dump dlm
1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401924707 dlm_controld 3.0.12.1 started
1401924707 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401924707 found /dev/misc/dlm-control minor 57
1401924707 found /dev/misc/dlm-monitor minor 56
1401924707 found /dev/misc/dlm_plock minor 55
1401924707 /dev/misc/dlm-monitor fd 12
1401924707 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
1401924707 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
1401924707 cluster node 1 added seq 84
1401924707 set_configfs_node 1 10.20.40.1 local 1
1401924707 cluster node 2 added seq 84
1401924707 set_configfs_node 2 10.20.40.2 local 0
1401924707 totem/rrp_mode = 'none'
1401924707 set protocol 0
1401924707 group_mode 3 compat 0
1401924707 setup_cpg_daemon 15
1401924707 dlm:controld conf 2 1 0 memb 1 2 join 1 left
1401924707 run protocol from nodeid 2
1401924707 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
1401924707 plocks 18
1401924707 plock cpg message size: 104 bytes
1401924709 client connection 5 fd 19
1401924740 cluster node 2 removed seq 88
1401924740 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
1401924740 dlm:controld conf 1 0 1 memb 1 join left 2
1401924891 cluster node 2 added seq 92
1401924891 set_configfs_node 2 10.20.40.2 local 0
1401924895 dlm:controld conf 2 1 0 memb 1 2 join 2 left
1401925857 uevent: add@/kernel/dlm/clvmd
1401925857 kernel: add@ clvmd
1401925857 uevent: online@/kernel/dlm/clvmd
1401925857 kernel: online@ clvmd
1401925857 dlm:ls:clvmd conf 1 1 0 memb 1 join 1 left
1401925857 clvmd add_change cg 1 joined nodeid 1
1401925857 clvmd add_change cg 1 we joined
1401925857 clvmd add_change cg 1 counts member 1 joined 1 remove 0 failed 0
1401925857 clvmd check_fencing done
1401925857 clvmd check_quorum disabled
1401925857 clvmd check_fs none registered
1401925857 clvmd send_start cg 1 flags 1 data2 0 counts 0 1 1 0 0
1401925857 clvmd receive_start 1:1 len 76
1401925857 clvmd match_change 1:1 matches cg 1
1401925857 clvmd wait_messages cg 1 got all 1
1401925857 clvmd start_kernel cg 1 member_count 1
1401925857 write "1090842362" to "/sys/kernel/dlm/clvmd/id"
1401925857 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1"
1401925857 write "1" to "/sys/kernel/dlm/clvmd/control"
1401925857 write "0" to "/sys/kernel/dlm/clvmd/event_done"
1401925857 uevent: add@/devices/virtual/misc/dlm_clvmd
1401925866 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left
1401925866 clvmd add_change cg 2 joined nodeid 2
1401925866 clvmd add_change cg 2 counts member 2 joined 1 remove 0 failed 0
1401925866 clvmd stop_kernel cg 2
1401925866 write "0" to "/sys/kernel/dlm/clvmd/control"
1401925866 clvmd check_fencing done
1401925866 clvmd check_quorum disabled
1401925866 clvmd check_fs none registered
1401925866 clvmd send_start cg 2 flags 2 data2 0 counts 1 2 1 0 0
1401925866 clvmd receive_start 2:1 len 80
1401925866 clvmd match_change 2:1 matches cg 2
1401925866 clvmd wait_messages cg 2 need 1 of 2
1401925866 clvmd receive_start 1:2 len 80
1401925866 clvmd match_change 1:2 matches cg 2
1401925866 clvmd wait_messages cg 2 got all 2
1401925866 clvmd start_kernel cg 2 member_count 2
1401925866 dir_member 1
1401925866 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
1401925866 write "1" to "/sys/kernel/dlm/clvmd/control"
1401925866 clvmd set_plock_ckpt_node from 1 to 1
1401925866 clvmd unlink ckpt 0
1401925866 clvmd store_plocks r_count 0 p_count 0 total_size 0 max_section_size 0
1401925866 clvmd store_plocks open ckpt handle 6a2342ec00000000
1401925866 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0
1401925866 clvmd send_plocks_stored cg 2 flags a data2 0 counts 1 2 1 0 0
1401925866 clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 0
1401926434 cluster node 2 removed seq 96
1401926434 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
1401926434 dlm:controld conf 1 0 1 memb 1 join left 2
1401926434 dlm:ls:clvmd conf 1 0 1 memb 1 join left 2
1401926434 clvmd add_change cg 3 remove nodeid 2 reason 3
1401926434 clvmd add_change cg 3 counts member 1 joined 0 remove 1 failed 1
1401926434 clvmd stop_kernel cg 3
1401926434 write "0" to "/sys/kernel/dlm/clvmd/control"
1401926434 clvmd check_fencing 2 wait add 1401925866 fail 1401926434 last 1401924759
1401926451 clvmd check_fencing 2 done add 1401925866 fail 1401926434 last 1401926451
1401926451 clvmd check_fencing done
1401926451 clvmd check_quorum disabled
1401926451 clvmd check_fs none registered
1401926451 clvmd send_start cg 3 flags 2 data2 0 counts 2 1 0 1 1
1401926451 clvmd receive_start 1:3 len 76
1401926451 clvmd match_change 1:3 matches cg 3
1401926451 clvmd wait_messages cg 3 got all 1
1401926451 clvmd start_kernel cg 3 member_count 1
1401926451 dir_member 2
1401926451 dir_member 1
1401926451 set_members rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
1401926451 write "1" to "/sys/kernel/dlm/clvmd/control"
1401926860 cluster node 2 added seq 100
1401926860 set_configfs_node 2 10.20.40.2 local 0
1401926864 dlm:controld conf 2 1 0 memb 1 2 join 2 left
1401927111 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left
1401927111 clvmd add_change cg 4 joined nodeid 2
1401927111 clvmd add_change cg 4 counts member 2 joined 1 remove 0 failed 0
1401927111 clvmd stop_kernel cg 4
1401927111 write "0" to "/sys/kernel/dlm/clvmd/control"
1401927111 clvmd check_fencing done
1401927111 clvmd check_quorum disabled
1401927111 clvmd check_fs none registered
1401927111 clvmd send_start cg 4 flags 2 data2 0 counts 3 2 1 0 0
1401927111 clvmd receive_start 2:1 len 80
1401927111 clvmd match_change 2:1 matches cg 4
1401927111 clvmd wait_messages cg 4 need 1 of 2
1401927111 clvmd receive_start 1:4 len 80
1401927111 clvmd match_change 1:4 matches cg 4
1401927111 clvmd wait_messages cg 4 got all 2
1401927111 clvmd start_kernel cg 4 member_count 2
1401927111 dir_member 1
1401927111 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
1401927111 write "1" to "/sys/kernel/dlm/clvmd/control"
1401927111 clvmd set_plock_ckpt_node from 1 to 1
1401927111 clvmd store_plocks saved ckpt uptodate
1401927111 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0
1401927111 clvmd send_plocks_stored cg 4 flags a data2 0 counts 3 2 1 0 0
1401927111 clvmd receive_plocks_stored 1:4 flags a sig 0 need_plocks 0
====

and an-a04n02:

====
[root@an-a04n02 ~]# group_tool status
fence domain
member count  2
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1 2 

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000000 
change        member 2 joined 1 remove 0 failed 0 seq 1,1
members       1 2 

[root@an-a04n02 ~]# group_tool dump

[root@an-a04n02 ~]# group_tool dump dlm
dump dlm
1401926863 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401926863 dlm_controld 3.0.12.1 started
1401926864 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/dlm_controld.log
1401926864 found /dev/misc/dlm-control minor 57
1401926864 found /dev/misc/dlm-monitor minor 56
1401926864 found /dev/misc/dlm_plock minor 55
1401926864 /dev/misc/dlm-monitor fd 12
1401926864 /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
1401926864 /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
1401926864 cluster node 1 added seq 100
1401926864 set_configfs_node 1 10.20.40.1 local 0
1401926864 cluster node 2 added seq 100
1401926864 set_configfs_node 2 10.20.40.2 local 1
1401926864 totem/rrp_mode = 'none'
1401926864 set protocol 0
1401926864 group_mode 3 compat 0
1401926864 setup_cpg_daemon 15
1401926864 dlm:controld conf 2 1 0 memb 1 2 join 2 left
1401926864 run protocol from nodeid 1
1401926864 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
1401926864 plocks 18
1401926864 plock cpg message size: 104 bytes
1401926865 client connection 5 fd 19
1401927111 uevent: add@/kernel/dlm/clvmd
1401927111 kernel: add@ clvmd
1401927111 uevent: online@/kernel/dlm/clvmd
1401927111 kernel: online@ clvmd
1401927111 dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left
1401927111 clvmd add_change cg 1 joined nodeid 2
1401927111 clvmd add_change cg 1 we joined
1401927111 clvmd add_change cg 1 counts member 2 joined 1 remove 0 failed 0
1401927111 clvmd check_fencing done
1401927111 clvmd check_quorum disabled
1401927111 clvmd check_fs none registered
1401927111 clvmd send_start cg 1 flags 1 data2 0 counts 0 2 1 0 0
1401927111 clvmd receive_start 2:1 len 80
1401927111 clvmd match_change 2:1 matches cg 1
1401927111 clvmd wait_messages cg 1 need 1 of 2
1401927111 clvmd receive_start 1:4 len 80
1401927111 clvmd match_change 1:4 matches cg 1
1401927111 clvmd wait_messages cg 1 got all 2
1401927111 clvmd start_kernel cg 1 member_count 2
1401927111 write "1090842362" to "/sys/kernel/dlm/clvmd/id"
1401927111 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1"
1401927111 set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
1401927111 write "1" to "/sys/kernel/dlm/clvmd/control"
1401927111 write "0" to "/sys/kernel/dlm/clvmd/event_done"
1401927111 clvmd set_plock_ckpt_node from 0 to 1
1401927111 clvmd receive_plocks_stored 1:4 flags a sig 0 need_plocks 1
1401927111 clvmd match_change 1:4 matches cg 1
1401927111 clvmd retrieve_plocks
1401927111 clvmd retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0
====

Comment 5 Madison Kelly 2014-06-05 00:49:58 UTC
Here are the cluster configs:

====
<cluster config_version="9" name="an-anvil-04">
  <fence_daemon/>
  <clusternodes>
    <clusternode name="an-a04n01.alteeve.ca" nodeid="1">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="an-a04n01.alteeve.ca"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="an-a04n02.alteeve.ca" nodeid="2">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="an-a04n02.alteeve.ca"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <cman expected_votes="1" two_node="1"/>
  <fencedevices>
    <fencedevice agent="fence_pcmk" name="pcmk"/>
  </fencedevices>
  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>
====

====
[root@an-a04n01 ~]# pcs config
Cluster Name: an-anvil-04
Corosync Nodes:
 
Pacemaker Nodes:
 an-a04n01.alteeve.ca an-a04n02.alteeve.ca 

Resources: 

Stonith Devices: 
 Resource: fence_n01_ipmi (class=stonith type=fence_ipmilan)
  Attributes: pcmk_host_list=an-a04n01.alteeve.ca ipaddr=an-a04n01.ipmi action=reboot login=admin passwd=Initial1 delay=15 
  Operations: monitor interval=60s (fence_n01_ipmi-monitor-interval-60s)
 Resource: fence_n02_ipmi (class=stonith type=fence_ipmilan)
  Attributes: pcmk_host_list=an-a04n02.alteeve.ca ipaddr=an-a04n02.ipmi action=reboot login=admin passwd=Initial1 
  Operations: monitor interval=60s (fence_n02_ipmi-monitor-interval-60s)
Fencing Levels: 

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.10-14.el6_5.3-368c726
 expected-quorum-votes: 2
 stonith-enabled: true
====

Comment 6 Madison Kelly 2014-06-05 01:02:45 UTC
Created attachment 902376 [details]
crm_report from the clean start of pacemaker

This report starts with the first log entry from starting pacemaker (with chkconfig clvmd off, as enabling it caused pacemaker to hang on start).

Comment 7 Madison Kelly 2014-06-05 01:04:03 UTC
Created attachment 902377 [details]
crm_report after starting clvmd and it hanging

This report starts where the last attachment ends. It covers the failed attempt to start clvmd, and ends after both nodes printed a 'hung process' timeout message to both nodes.

Comment 8 Christine Caulfield 2014-06-05 14:41:24 UTC
I can't reproduce this on my cluster here - but it looks to be in the DLM new lockspace code somewhere. 


Can you get a sysrq-t from the machine after it's hung so I can see what sort of state the DLM processes are in please?

Comment 9 Madison Kelly 2014-06-05 18:00:47 UTC
I changed cluster.conf to:

====
<cluster config_version="10" name="an-anvil-04">
  <fence_daemon/>
  <clusternodes>
    <clusternode name="an-a04n01.alteeve.ca" nodeid="1">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="an-a04n01.alteeve.ca"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="an-a04n02.alteeve.ca" nodeid="2">
      <fence>
        <method name="pcmk-redirect">
          <device name="pcmk" port="an-a04n02.alteeve.ca"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <cman expected_votes="1" two_node="1"/>
  <fencedevices>
    <fencedevice agent="fence_pcmk" name="pcmk"/>
  </fencedevices>
  <logging>
    <logging_daemon name="dlm_controld" debug="on"/>
    <logging_daemon name="groupd" debug="on"/>
    <logging_daemon name="fenced" debug="on"/>
  </logging>
  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>
====

I pushed it out, enabled clvmd to start on boot so that pacemaker would start it. I started pacemaker on both nodes:

====
[root@an-a04n01 ~]# /etc/init.d/pacemaker start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Tuning DLM kernel config...                             [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]
Starting clvmd: 
Activating VG(s):   No volume groups found
                                                           [  OK  ]
Starting Pacemaker Cluster Manager                         [  OK  ]
====
Jun  5 13:55:10 an-a04n01 kernel: DLM (built Apr 11 2014 17:28:07) installed
Jun  5 13:55:10 an-a04n01 corosync[2885]:   [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Jun  5 13:55:10 an-a04n01 corosync[2885]:   [MAIN  ] Corosync built-in features: nss dbus rdma snmp
Jun  5 13:55:10 an-a04n01 corosync[2885]:   [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Jun  5 13:55:10 an-a04n01 corosync[2885]:   [MAIN  ] Successfully parsed cman config
Jun  5 13:55:10 an-a04n01 corosync[2885]:   [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun  5 13:55:10 an-a04n01 corosync[2885]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jun  5 13:55:10 an-a04n01 corosync[2885]:   [TOTEM ] The network interface [10.20.40.1] is now up.
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [QUORUM] Using quorum provider quorum_cman
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [CMAN  ] CMAN 3.0.12.1 (built Apr  3 2014 05:12:26) started
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [SERV  ] Service engine loaded: corosync configuration service
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [SERV  ] Service engine loaded: corosync profile loading service
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [QUORUM] Using quorum provider quorum_cman
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [CMAN  ] quorum regained, resuming activity
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [QUORUM] This node is within the primary component and will provide service.
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [QUORUM] Members[1]: 1
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [QUORUM] Members[1]: 1
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.40.1) ; members(old:0 left:0)
Jun  5 13:55:11 an-a04n01 corosync[2885]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun  5 13:55:12 an-a04n01 corosync[2885]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun  5 13:55:12 an-a04n01 corosync[2885]:   [QUORUM] Members[2]: 1 2
Jun  5 13:55:12 an-a04n01 corosync[2885]:   [QUORUM] Members[2]: 1 2
Jun  5 13:55:12 an-a04n01 corosync[2885]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.40.1) ; members(old:1 left:0)
Jun  5 13:55:12 an-a04n01 corosync[2885]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun  5 13:55:14 an-a04n01 fenced[2940]: fenced 3.0.12.1 started
Jun  5 13:55:14 an-a04n01 dlm_controld[2964]: dlm_controld 3.0.12.1 started
Jun  5 13:55:16 an-a04n01 gfs_controld[3011]: gfs_controld 3.0.12.1 started
Jun  5 13:55:17 an-a04n01 pacemaker: Attempting to start clvmd
Jun  5 13:55:18 an-a04n01 kernel: dlm: Using TCP for communications
Jun  5 13:55:19 an-a04n01 clvmd: Cluster LVM daemon started - connected to CMAN
Jun  5 13:55:19 an-a04n01 kernel: dlm: connecting to 2
Jun  5 13:55:19 an-a04n01 pacemaker: Starting Pacemaker Cluster Manager
Jun  5 13:55:19 an-a04n01 pacemakerd[3141]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:19 an-a04n01 pacemakerd[3141]:   notice: main: Starting Pacemaker 1.1.10-14.el6_5.3 (Build: 368c726):  generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc nagios  corosync-plugin cman
Jun  5 13:55:19 an-a04n01 pengine[3151]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:19 an-a04n01 attrd[3150]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:19 an-a04n01 attrd[3150]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun  5 13:55:19 an-a04n01 cib[3147]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:19 an-a04n01 stonith-ng[3148]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:19 an-a04n01 stonith-ng[3148]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun  5 13:55:19 an-a04n01 attrd[3150]:   notice: main: Starting mainloop...
Jun  5 13:55:20 an-a04n01 lrmd[3149]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:20 an-a04n01 crmd[3152]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:20 an-a04n01 crmd[3152]:   notice: main: CRM Git Version: 368c726
Jun  5 13:55:20 an-a04n01 cib[3147]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun  5 13:55:20 an-a04n01 stonith-ng[3148]:   notice: setup_cib: Watching for stonith topology changes
Jun  5 13:55:21 an-a04n01 crmd[3152]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun  5 13:55:21 an-a04n01 crmd[3152]:   notice: cman_event_callback: Membership 132: quorum acquired
Jun  5 13:55:21 an-a04n01 crmd[3152]:   notice: crm_update_peer_state: cman_event_callback: Node an-a04n01.alteeve.ca[1] - state is now member (was (null))
Jun  5 13:55:21 an-a04n01 crmd[3152]:   notice: crm_update_peer_state: cman_event_callback: Node an-a04n02.alteeve.ca[2] - state is now member (was (null))
Jun  5 13:55:21 an-a04n01 crmd[3152]:   notice: do_started: The local CRM is operational
Jun  5 13:55:21 an-a04n01 crmd[3152]:   notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Jun  5 13:55:21 an-a04n01 stonith-ng[3148]:   notice: stonith_device_register: Added 'fence_n01_ipmi' to the device list (1 active devices)
Jun  5 13:55:22 an-a04n01 stonith-ng[3148]:   notice: stonith_device_register: Added 'fence_n02_ipmi' to the device list (2 active devices)
Jun  5 13:55:42 an-a04n01 crmd[3152]:  warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Jun  5 13:55:42 an-a04n01 crmd[3152]:   notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Jun  5 13:55:42 an-a04n01 attrd[3150]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jun  5 13:55:43 an-a04n01 pengine[3151]:  warning: custom_action: Action fence_n01_ipmi_monitor_0 on an-a04n02.alteeve.ca is unrunnable (pending)
Jun  5 13:55:43 an-a04n01 pengine[3151]:  warning: custom_action: Action fence_n02_ipmi_monitor_0 on an-a04n02.alteeve.ca is unrunnable (pending)
Jun  5 13:55:43 an-a04n01 pengine[3151]:   notice: LogActions: Start   fence_n01_ipmi#011(an-a04n01.alteeve.ca)
Jun  5 13:55:43 an-a04n01 pengine[3151]:   notice: LogActions: Start   fence_n02_ipmi#011(an-a04n01.alteeve.ca)
Jun  5 13:55:43 an-a04n01 pengine[3151]:   notice: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-27.bz2
Jun  5 13:55:43 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 4: monitor fence_n01_ipmi_monitor_0 on an-a04n01.alteeve.ca (local)
Jun  5 13:55:43 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 5: monitor fence_n02_ipmi_monitor_0 on an-a04n01.alteeve.ca (local)
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 3: probe_complete probe_complete on an-a04n01.alteeve.ca (local) - no waiting
Jun  5 13:55:44 an-a04n01 crmd[3152]:  warning: run_graph: Transition 0 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=5, Source=/var/lib/pacemaker/pengine/pe-input-27.bz2): Terminated
Jun  5 13:55:44 an-a04n01 attrd[3150]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun  5 13:55:44 an-a04n01 crmd[3152]:  warning: te_graph_trigger: Transition failed: terminated
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_graph: Graph 0 with 8 actions: batch-limit=8 jobs, network-delay=30ms
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_synapse: [Action   10]: Pending rsc op fence_n01_ipmi_monitor_60000        on an-a04n01.alteeve.ca (priority: 0, waiting:  9)
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_synapse: [Action    9]: Pending rsc op fence_n01_ipmi_start_0              on an-a04n01.alteeve.ca (priority: 0, waiting:  2)
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_synapse: [Action    4]: Completed rsc op fence_n01_ipmi_monitor_0          on an-a04n01.alteeve.ca (priority: 0, waiting: none)
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_synapse: [Action   12]: Pending rsc op fence_n02_ipmi_monitor_60000        on an-a04n01.alteeve.ca (priority: 0, waiting:  11)
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_synapse: [Action   11]: Pending rsc op fence_n02_ipmi_start_0              on an-a04n01.alteeve.ca (priority: 0, waiting:  2)
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_synapse: [Action    5]: Completed rsc op fence_n02_ipmi_monitor_0          on an-a04n01.alteeve.ca (priority: 0, waiting: none)
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_synapse: [Action    3]: Completed rsc op probe_complete                    on an-a04n01.alteeve.ca (priority: 1000000, waiting: none)
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_synapse: [Action    2]: Pending pseduo op probe_complete                   on N/A (priority: 0, waiting: none)
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: print_synapse:  * [Input  6]: Unresolved dependancy rsc op probe_complete on an-a04n02.alteeve.ca
Jun  5 13:55:44 an-a04n01 crmd[3152]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun  5 13:55:44 an-a04n01 attrd[3150]:   notice: attrd_perform_update: Sent update 4: probe_complete=true
Jun  5 13:55:49 an-a04n01 crmd[3152]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
Jun  5 13:55:51 an-a04n01 attrd[3150]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jun  5 13:55:51 an-a04n01 attrd[3150]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun  5 13:55:51 an-a04n01 pengine[3151]:   notice: LogActions: Start   fence_n01_ipmi#011(an-a04n01.alteeve.ca)
Jun  5 13:55:51 an-a04n01 pengine[3151]:   notice: LogActions: Start   fence_n02_ipmi#011(an-a04n02.alteeve.ca)
Jun  5 13:55:51 an-a04n01 pengine[3151]:   notice: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-28.bz2
Jun  5 13:55:51 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 5: monitor fence_n01_ipmi_monitor_0 on an-a04n02.alteeve.ca
Jun  5 13:55:51 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 6: monitor fence_n02_ipmi_monitor_0 on an-a04n02.alteeve.ca
Jun  5 13:55:52 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 4: probe_complete probe_complete on an-a04n02.alteeve.ca - no waiting
Jun  5 13:55:52 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 7: start fence_n01_ipmi_start_0 on an-a04n01.alteeve.ca (local)
Jun  5 13:55:52 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 9: start fence_n02_ipmi_start_0 on an-a04n02.alteeve.ca
Jun  5 13:55:53 an-a04n01 stonith-ng[3148]:   notice: stonith_device_register: Device 'fence_n01_ipmi' already existed in device list (2 active devices)
Jun  5 13:55:54 an-a04n01 crmd[3152]:   notice: process_lrm_event: LRM operation fence_n01_ipmi_start_0 (call=13, rc=0, cib-update=41, confirmed=true) ok
Jun  5 13:55:54 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 10: monitor fence_n02_ipmi_monitor_60000 on an-a04n02.alteeve.ca
Jun  5 13:55:54 an-a04n01 crmd[3152]:   notice: te_rsc_command: Initiating action 8: monitor fence_n01_ipmi_monitor_60000 on an-a04n01.alteeve.ca (local)
Jun  5 13:55:54 an-a04n01 crmd[3152]:   notice: process_lrm_event: LRM operation fence_n01_ipmi_monitor_60000 (call=16, rc=0, cib-update=42, confirmed=false) ok
Jun  5 13:55:54 an-a04n01 crmd[3152]:   notice: run_graph: Transition 1 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-28.bz2): Complete
Jun  5 13:55:54 an-a04n01 crmd[3152]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
====

second node (start pacemaker ~1 second later)

====
[root@an-a04n02 ~]# /etc/init.d/pacemaker start
Starting cluster: 
   Checking if cluster has been disabled at boot...        [  OK  ]
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Tuning DLM kernel config...                             [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]
Starting clvmd: clvmd startup timed out
Starting Pacemaker Cluster Manager                         [  OK  ]
====
Jun  5 13:47:55 an-a04n02 ntpd[2487]: 0.0.0.0 c612 02 freq_set kernel 27.924 PPM
Jun  5 13:47:55 an-a04n02 ntpd[2487]: 0.0.0.0 c615 05 clock_sync
Jun  5 13:55:11 an-a04n02 kernel: DLM (built Apr 11 2014 17:28:07) installed
Jun  5 13:55:11 an-a04n02 corosync[2831]:   [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Jun  5 13:55:11 an-a04n02 corosync[2831]:   [MAIN  ] Corosync built-in features: nss dbus rdma snmp
Jun  5 13:55:11 an-a04n02 corosync[2831]:   [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Jun  5 13:55:11 an-a04n02 corosync[2831]:   [MAIN  ] Successfully parsed cman config
Jun  5 13:55:11 an-a04n02 corosync[2831]:   [TOTEM ] Initializing transport (UDP/IP Multicast).
Jun  5 13:55:11 an-a04n02 corosync[2831]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jun  5 13:55:11 an-a04n02 corosync[2831]:   [TOTEM ] The network interface [10.20.40.2] is now up.
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [QUORUM] Using quorum provider quorum_cman
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [CMAN  ] CMAN 3.0.12.1 (built Apr  3 2014 05:12:26) started
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [SERV  ] Service engine loaded: corosync configuration service
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [SERV  ] Service engine loaded: corosync profile loading service
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [QUORUM] Using quorum provider quorum_cman
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [CMAN  ] quorum regained, resuming activity
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [QUORUM] This node is within the primary component and will provide service.
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [QUORUM] Members[1]: 2
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [QUORUM] Members[1]: 2
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [QUORUM] Members[2]: 1 2
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [QUORUM] Members[2]: 1 2
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.40.1) ; members(old:1 left:0)
Jun  5 13:55:12 an-a04n02 corosync[2831]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun  5 13:55:15 an-a04n02 fenced[2886]: fenced 3.0.12.1 started
Jun  5 13:55:16 an-a04n02 dlm_controld[2910]: dlm_controld 3.0.12.1 started
Jun  5 13:55:17 an-a04n02 gfs_controld[2957]: gfs_controld 3.0.12.1 started
Jun  5 13:55:18 an-a04n02 pacemaker: Attempting to start clvmd
Jun  5 13:55:19 an-a04n02 kernel: dlm: Using TCP for communications
Jun  5 13:55:19 an-a04n02 kernel: dlm: connecting to 1
Jun  5 13:55:48 an-a04n02 pacemaker: Starting Pacemaker Cluster Manager
Jun  5 13:55:48 an-a04n02 pacemakerd[3081]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:48 an-a04n02 pacemakerd[3081]:   notice: main: Starting Pacemaker 1.1.10-14.el6_5.3 (Build: 368c726):  generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc nagios  corosync-plugin cman
Jun  5 13:55:48 an-a04n02 cib[3087]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:48 an-a04n02 attrd[3090]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:48 an-a04n02 attrd[3090]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun  5 13:55:48 an-a04n02 attrd[3090]:   notice: main: Starting mainloop...
Jun  5 13:55:48 an-a04n02 stonith-ng[3088]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:48 an-a04n02 pengine[3091]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:48 an-a04n02 stonith-ng[3088]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun  5 13:55:48 an-a04n02 lrmd[3089]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:48 an-a04n02 crmd[3092]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Jun  5 13:55:48 an-a04n02 crmd[3092]:   notice: main: CRM Git Version: 368c726
Jun  5 13:55:48 an-a04n02 cib[3087]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun  5 13:55:49 an-a04n02 stonith-ng[3088]:   notice: setup_cib: Watching for stonith topology changes
Jun  5 13:55:49 an-a04n02 crmd[3092]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Jun  5 13:55:49 an-a04n02 crmd[3092]:   notice: cman_event_callback: Membership 132: quorum acquired
Jun  5 13:55:49 an-a04n02 crmd[3092]:   notice: crm_update_peer_state: cman_event_callback: Node an-a04n01.alteeve.ca[1] - state is now member (was (null))
Jun  5 13:55:49 an-a04n02 crmd[3092]:   notice: crm_update_peer_state: cman_event_callback: Node an-a04n02.alteeve.ca[2] - state is now member (was (null))
Jun  5 13:55:49 an-a04n02 crmd[3092]:   notice: do_started: The local CRM is operational
Jun  5 13:55:49 an-a04n02 crmd[3092]:   notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
Jun  5 13:55:50 an-a04n02 stonith-ng[3088]:   notice: stonith_device_register: Added 'fence_n01_ipmi' to the device list (1 active devices)
Jun  5 13:55:51 an-a04n02 stonith-ng[3088]:   notice: stonith_device_register: Added 'fence_n02_ipmi' to the device list (2 active devices)
Jun  5 13:55:51 an-a04n02 crmd[3092]:   notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Jun  5 13:55:51 an-a04n02 attrd[3090]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jun  5 13:55:52 an-a04n02 attrd[3090]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun  5 13:55:53 an-a04n02 attrd[3090]:   notice: attrd_perform_update: Sent update 6: probe_complete=true
Jun  5 13:55:53 an-a04n02 stonith-ng[3088]:   notice: stonith_device_register: Device 'fence_n02_ipmi' already existed in device list (2 active devices)
Jun  5 13:55:53 an-a04n02 crmd[3092]:   notice: process_lrm_event: LRM operation fence_n02_ipmi_start_0 (call=13, rc=0, cib-update=12, confirmed=true) ok
Jun  5 13:55:54 an-a04n02 crmd[3092]:   notice: process_lrm_event: LRM operation fence_n02_ipmi_monitor_60000 (call=16, rc=0, cib-update=13, confirmed=false) ok
Jun  5 13:58:10 an-a04n02 kernel: INFO: task clvmd:3060 blocked for more than 120 seconds.
Jun  5 13:58:10 an-a04n02 kernel:      Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Jun  5 13:58:10 an-a04n02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  5 13:58:10 an-a04n02 kernel: clvmd         D 000000000000000c     0  3060      1 0x00000080
Jun  5 13:58:10 an-a04n02 kernel: ffff88033827fc98 0000000000000086 0000000000000000 0000000000000000
Jun  5 13:58:10 an-a04n02 kernel: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
Jun  5 13:58:10 an-a04n02 kernel: ffff880339569058 ffff88033827ffd8 000000000000fbc8 ffff880339569058
Jun  5 13:58:10 an-a04n02 kernel: Call Trace:
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff81528a95>] schedule_timeout+0x215/0x2e0
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff81527bfe>] ? thread_return+0x4e/0x760
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff81285172>] ? kobject_uevent_env+0x202/0x620
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff81528713>] wait_for_common+0x123/0x180
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff8152882d>] wait_for_completion+0x1d/0x20
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffffa032af79>] dlm_new_lockspace+0x999/0xa30 [dlm]
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffffa0332ff1>] device_write+0x311/0x720 [dlm]
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff8123316b>] ? selinux_file_permission+0xfb/0x150
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff81226056>] ? security_file_permission+0x16/0x20
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff81188c38>] vfs_write+0xb8/0x1a0
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff81189531>] sys_write+0x51/0x90
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff810e1abe>] ? __audit_syscall_exit+0x25e/0x290
Jun  5 13:58:10 an-a04n02 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
====

I will attach 'echo t > /proc/sysrq-trigger' from both nodes in a moment.

Comment 10 Madison Kelly 2014-06-05 18:07:08 UTC
Bugzilla isn't letting my add attachments (says the .cgi file is missing). Here is the syslog output of the two nodes after running 'echo t > /proc/sysrq-trigger' after pacemaker tried to start clvmd.

https://alteeve.ca/files/messages.an-a04n01.sysrq-t
https://alteeve.ca/files/messages.an-a04n02.sysrq-t

Comment 11 Madison Kelly 2014-06-05 20:34:02 UTC
Here are the other logs:

an-a04n01's dlm_controld.log:
====
Jun 05 13:55:14 dlm_controld dlm_controld 3.0.12.1 started
Jun 05 13:55:15 dlm_controld found /dev/misc/dlm-control minor 57
Jun 05 13:55:15 dlm_controld found /dev/misc/dlm-monitor minor 56
Jun 05 13:55:15 dlm_controld found /dev/misc/dlm_plock minor 55
Jun 05 13:55:15 dlm_controld /dev/misc/dlm-monitor fd 12
Jun 05 13:55:15 dlm_controld /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
Jun 05 13:55:15 dlm_controld /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
Jun 05 13:55:15 dlm_controld cluster node 1 added seq 132
Jun 05 13:55:15 dlm_controld set_configfs_node 1 10.20.40.1 local 1
Jun 05 13:55:15 dlm_controld cluster node 2 added seq 132
Jun 05 13:55:15 dlm_controld set_configfs_node 2 10.20.40.2 local 0
Jun 05 13:55:15 dlm_controld totem/rrp_mode = 'none'
Jun 05 13:55:15 dlm_controld set protocol 0
Jun 05 13:55:15 dlm_controld group_mode 3 compat 0
Jun 05 13:55:15 dlm_controld setup_cpg_daemon 15
Jun 05 13:55:15 dlm_controld dlm:controld conf 1 1 0 memb 1 join 1 left
Jun 05 13:55:15 dlm_controld set_protocol member_count 1 propose daemon 1.1.1 kernel 1.1.1
Jun 05 13:55:15 dlm_controld run protocol from nodeid 1
Jun 05 13:55:15 dlm_controld daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
Jun 05 13:55:15 dlm_controld plocks 18
Jun 05 13:55:15 dlm_controld plock cpg message size: 104 bytes
Jun 05 13:55:16 dlm_controld client connection 5 fd 19
Jun 05 13:55:16 dlm_controld dlm:controld conf 2 1 0 memb 1 2 join 2 left
Jun 05 13:55:18 dlm_controld uevent: add@/kernel/dlm/clvmd
Jun 05 13:55:18 dlm_controld kernel: add@ clvmd
Jun 05 13:55:18 dlm_controld uevent: online@/kernel/dlm/clvmd
Jun 05 13:55:18 dlm_controld kernel: online@ clvmd
Jun 05 13:55:18 dlm_controld dlm:ls:clvmd conf 1 1 0 memb 1 join 1 left
Jun 05 13:55:18 dlm_controld clvmd add_change cg 1 joined nodeid 1
Jun 05 13:55:18 dlm_controld clvmd add_change cg 1 we joined
Jun 05 13:55:18 dlm_controld clvmd add_change cg 1 counts member 1 joined 1 remove 0 failed 0
Jun 05 13:55:18 dlm_controld clvmd check_fencing done
Jun 05 13:55:18 dlm_controld clvmd check_quorum disabled
Jun 05 13:55:18 dlm_controld clvmd check_fs none registered
Jun 05 13:55:18 dlm_controld clvmd send_start cg 1 flags 1 data2 0 counts 0 1 1 0 0
Jun 05 13:55:18 dlm_controld clvmd receive_start 1:1 len 76
Jun 05 13:55:18 dlm_controld clvmd match_change 1:1 matches cg 1
Jun 05 13:55:18 dlm_controld clvmd wait_messages cg 1 got all 1
Jun 05 13:55:18 dlm_controld clvmd start_kernel cg 1 member_count 1
Jun 05 13:55:18 dlm_controld write "1090842362" to "/sys/kernel/dlm/clvmd/id"
Jun 05 13:55:18 dlm_controld set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1"
Jun 05 13:55:18 dlm_controld write "1" to "/sys/kernel/dlm/clvmd/control"
Jun 05 13:55:18 dlm_controld write "0" to "/sys/kernel/dlm/clvmd/event_done"
Jun 05 13:55:18 dlm_controld uevent: add@/devices/virtual/misc/dlm_clvmd
Jun 05 13:55:19 dlm_controld dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left
Jun 05 13:55:19 dlm_controld clvmd add_change cg 2 joined nodeid 2
Jun 05 13:55:19 dlm_controld clvmd add_change cg 2 counts member 2 joined 1 remove 0 failed 0
Jun 05 13:55:19 dlm_controld clvmd stop_kernel cg 2
Jun 05 13:55:19 dlm_controld write "0" to "/sys/kernel/dlm/clvmd/control"
Jun 05 13:55:19 dlm_controld clvmd check_fencing done
Jun 05 13:55:19 dlm_controld clvmd check_quorum disabled
Jun 05 13:55:19 dlm_controld clvmd check_fs none registered
Jun 05 13:55:19 dlm_controld clvmd send_start cg 2 flags 2 data2 0 counts 1 2 1 0 0
Jun 05 13:55:19 dlm_controld clvmd receive_start 2:1 len 80
Jun 05 13:55:19 dlm_controld clvmd match_change 2:1 matches cg 2
Jun 05 13:55:19 dlm_controld clvmd wait_messages cg 2 need 1 of 2
Jun 05 13:55:19 dlm_controld clvmd receive_start 1:2 len 80
Jun 05 13:55:19 dlm_controld clvmd match_change 1:2 matches cg 2
Jun 05 13:55:19 dlm_controld clvmd wait_messages cg 2 got all 2
Jun 05 13:55:19 dlm_controld clvmd start_kernel cg 2 member_count 2
Jun 05 13:55:19 dlm_controld dir_member 1
Jun 05 13:55:19 dlm_controld set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
Jun 05 13:55:19 dlm_controld write "1" to "/sys/kernel/dlm/clvmd/control"
Jun 05 13:55:19 dlm_controld clvmd set_plock_ckpt_node from 1 to 1
Jun 05 13:55:19 dlm_controld clvmd unlink ckpt 0
Jun 05 13:55:19 dlm_controld clvmd unlink ckpt error 12 clvmd
Jun 05 13:55:19 dlm_controld clvmd unlink ckpt status error 9 clvmd
Jun 05 13:55:19 dlm_controld clvmd store_plocks r_count 0 p_count 0 total_size 0 max_section_size 0
Jun 05 13:55:19 dlm_controld clvmd store_plocks open ckpt handle 580bd78f00000000
Jun 05 13:55:19 dlm_controld clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0
Jun 05 13:55:19 dlm_controld clvmd send_plocks_stored cg 2 flags a data2 0 counts 1 2 1 0 0
Jun 05 13:55:19 dlm_controld clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 0
====

an-a04n02's dlm_controld.log:
====
Jun 05 13:55:16 dlm_controld found /dev/misc/dlm-control minor 57
Jun 05 13:55:16 dlm_controld found /dev/misc/dlm-monitor minor 56
Jun 05 13:55:16 dlm_controld found /dev/misc/dlm_plock minor 55
Jun 05 13:55:16 dlm_controld /dev/misc/dlm-monitor fd 7
Jun 05 13:55:16 dlm_controld /sys/kernel/config/dlm/cluster/comms: opendir failed: 2
Jun 05 13:55:16 dlm_controld /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
Jun 05 13:55:16 dlm_controld cluster node 1 added seq 132
Jun 05 13:55:16 dlm_controld set_configfs_node 1 10.20.40.1 local 0
Jun 05 13:55:16 dlm_controld cluster node 2 added seq 132
Jun 05 13:55:16 dlm_controld set_configfs_node 2 10.20.40.2 local 1
Jun 05 13:55:16 dlm_controld totem/rrp_mode = 'none'
Jun 05 13:55:16 dlm_controld set protocol 0
Jun 05 13:55:16 dlm_controld group_mode 3 compat 0
Jun 05 13:55:16 dlm_controld setup_cpg_daemon 15
Jun 05 13:55:16 dlm_controld dlm:controld conf 2 1 0 memb 1 2 join 2 left
Jun 05 13:55:16 dlm_controld run protocol from nodeid 1
Jun 05 13:55:16 dlm_controld daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
Jun 05 13:55:16 dlm_controld plocks 18
Jun 05 13:55:16 dlm_controld plock cpg message size: 104 bytes
Jun 05 13:55:17 dlm_controld client connection 5 fd 19
Jun 05 13:55:19 dlm_controld uevent: add@/kernel/dlm/clvmd
Jun 05 13:55:19 dlm_controld kernel: add@ clvmd
Jun 05 13:55:19 dlm_controld uevent: online@/kernel/dlm/clvmd
Jun 05 13:55:19 dlm_controld kernel: online@ clvmd
Jun 05 13:55:19 dlm_controld dlm:ls:clvmd conf 2 1 0 memb 1 2 join 2 left
Jun 05 13:55:19 dlm_controld clvmd add_change cg 1 joined nodeid 2
Jun 05 13:55:19 dlm_controld clvmd add_change cg 1 we joined
Jun 05 13:55:19 dlm_controld clvmd add_change cg 1 counts member 2 joined 1 remove 0 failed 0
Jun 05 13:55:19 dlm_controld clvmd check_fencing done
Jun 05 13:55:19 dlm_controld clvmd check_quorum disabled
Jun 05 13:55:19 dlm_controld clvmd check_fs none registered
Jun 05 13:55:19 dlm_controld clvmd send_start cg 1 flags 1 data2 0 counts 0 2 1 0 0
Jun 05 13:55:19 dlm_controld clvmd receive_start 2:1 len 80
Jun 05 13:55:19 dlm_controld clvmd match_change 2:1 matches cg 1
Jun 05 13:55:19 dlm_controld clvmd wait_messages cg 1 need 1 of 2
Jun 05 13:55:19 dlm_controld clvmd receive_start 1:2 len 80
Jun 05 13:55:19 dlm_controld clvmd match_change 1:2 matches cg 1
Jun 05 13:55:19 dlm_controld clvmd wait_messages cg 1 got all 2
Jun 05 13:55:19 dlm_controld clvmd start_kernel cg 1 member_count 2
Jun 05 13:55:19 dlm_controld write "1090842362" to "/sys/kernel/dlm/clvmd/id"
Jun 05 13:55:19 dlm_controld set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/1"
Jun 05 13:55:19 dlm_controld set_members mkdir "/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/2"
Jun 05 13:55:19 dlm_controld write "1" to "/sys/kernel/dlm/clvmd/control"
Jun 05 13:55:19 dlm_controld write "0" to "/sys/kernel/dlm/clvmd/event_done"
Jun 05 13:55:19 dlm_controld clvmd set_plock_ckpt_node from 0 to 1
Jun 05 13:55:19 dlm_controld clvmd receive_plocks_stored 1:2 flags a sig 0 need_plocks 1
Jun 05 13:55:19 dlm_controld clvmd match_change 1:2 matches cg 1
Jun 05 13:55:19 dlm_controld clvmd retrieve_plocks
Jun 05 13:55:19 dlm_controld clvmd retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0
====


an-a04n01's fenced.log:
====
Jun 05 13:55:14 fenced fenced 3.0.12.1 started
Jun 05 13:55:14 fenced logfile cur mode 100644
Jun 05 13:55:14 fenced cpg_join fenced:daemon ...
Jun 05 13:55:14 fenced setup_cpg_daemon 14
Jun 05 13:55:14 fenced group_mode 3 compat 0
Jun 05 13:55:14 fenced fenced:daemon conf 1 1 0 memb 1 join 1 left
Jun 05 13:55:14 fenced fenced:daemon ring 1:132 2 memb 1 2
Jun 05 13:55:14 fenced receive_protocol from 1 max 1.1.1.0 run 0.0.0.0
Jun 05 13:55:14 fenced daemon node 1 max 0.0.0.0 run 0.0.0.0
Jun 05 13:55:14 fenced daemon node 1 join 1401990914 left 0 local quorum 1401990914
Jun 05 13:55:14 fenced set_protocol member_count 1 propose daemon 1.1.1
Jun 05 13:55:14 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.0
Jun 05 13:55:14 fenced daemon node 1 max 1.1.1.0 run 0.0.0.0
Jun 05 13:55:14 fenced daemon node 1 join 1401990914 left 0 local quorum 1401990914
Jun 05 13:55:14 fenced run protocol from nodeid 1
Jun 05 13:55:14 fenced daemon run 1.1.1 max 1.1.1
Jun 05 13:55:14 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.0
Jun 05 13:55:14 fenced daemon node 1 max 1.1.1.0 run 1.1.1.0
Jun 05 13:55:14 fenced daemon node 1 join 1401990914 left 0 local quorum 1401990914
Jun 05 13:55:16 fenced fenced:daemon conf 2 1 0 memb 1 2 join 2 left
Jun 05 13:55:16 fenced receive_protocol from 2 max 1.1.1.0 run 0.0.0.0
Jun 05 13:55:16 fenced daemon node 2 max 0.0.0.0 run 0.0.0.0
Jun 05 13:55:16 fenced daemon node 2 join 1401990916 left 0 local quorum 1401990914
Jun 05 13:55:16 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.0
Jun 05 13:55:16 fenced daemon node 1 max 1.1.1.0 run 1.1.1.0
Jun 05 13:55:16 fenced daemon node 1 join 1401990914 left 0 local quorum 1401990914
Jun 05 13:55:16 fenced receive_protocol from 2 max 1.1.1.0 run 1.1.1.0
Jun 05 13:55:16 fenced daemon node 2 max 1.1.1.0 run 0.0.0.0
Jun 05 13:55:16 fenced daemon node 2 join 1401990916 left 0 local quorum 1401990914
Jun 05 13:55:16 fenced client connection 3 fd 15
Jun 05 13:55:16 fenced /cluster/cman/@two_node is 1
Jun 05 13:55:16 fenced added 2 nodes from ccs
Jun 05 13:55:16 fenced cpg_join fenced:default ...
Jun 05 13:55:16 fenced fenced:default conf 1 1 0 memb 1 join 1 left
Jun 05 13:55:16 fenced add_change cg 1 joined nodeid 1
Jun 05 13:55:16 fenced add_change cg 1 m 1 j 1 r 0 f 0
Jun 05 13:55:16 fenced check_ringid cluster 132 cpg 0:0
Jun 05 13:55:16 fenced fenced:default ring 1:132 2 memb 1 2
Jun 05 13:55:16 fenced check_ringid done cluster 132 cpg 1:132
Jun 05 13:55:16 fenced check_quorum done
Jun 05 13:55:16 fenced send_start 1:1 flags 1 started 0 m 1 j 1 r 0 f 0
Jun 05 13:55:16 fenced receive_start 1:1 len 152
Jun 05 13:55:16 fenced match_change 1:1 matches cg 1
Jun 05 13:55:16 fenced wait_messages cg 1 got all 1
Jun 05 13:55:16 fenced set_master from 0 to low node 1
Jun 05 13:55:16 fenced send_complete 1:1 flags 1 started 0 m 1 j 1 r 0 f 0
Jun 05 13:55:16 fenced receive_complete 1:1 len 152
Jun 05 13:55:17 fenced fenced:default conf 2 1 0 memb 1 2 join 2 left
Jun 05 13:55:17 fenced add_change cg 2 joined nodeid 2
Jun 05 13:55:17 fenced add_change cg 2 m 2 j 1 r 0 f 0
Jun 05 13:55:17 fenced check_ringid done cluster 132 cpg 1:132
Jun 05 13:55:17 fenced check_quorum done
Jun 05 13:55:17 fenced send_start 1:2 flags 2 started 1 m 2 j 1 r 0 f 0
Jun 05 13:55:17 fenced receive_start 1:2 len 152
Jun 05 13:55:17 fenced match_change 1:2 matches cg 2
Jun 05 13:55:17 fenced wait_messages cg 2 need 1 of 2
Jun 05 13:55:17 fenced receive_start 2:1 len 152
Jun 05 13:55:17 fenced match_change 2:1 matches cg 2
Jun 05 13:55:17 fenced wait_messages cg 2 got all 2
Jun 05 13:55:17 fenced set_master from 1 to complete node 1
Jun 05 13:55:17 fenced send_complete 1:2 flags 2 started 1 m 2 j 1 r 0 f 0
Jun 05 13:55:17 fenced receive_complete 1:2 len 152
====

an-a04n02's fenced.log:
====
Jun 05 13:55:15 fenced fenced 3.0.12.1 started
Jun 05 13:55:16 fenced logfile cur mode 100644
Jun 05 13:55:16 fenced cpg_join fenced:daemon ...
Jun 05 13:55:16 fenced setup_cpg_daemon 14
Jun 05 13:55:16 fenced group_mode 3 compat 0
Jun 05 13:55:16 fenced fenced:daemon conf 2 1 0 memb 1 2 join 2 left
Jun 05 13:55:16 fenced fenced:daemon ring 1:132 2 memb 1 2
Jun 05 13:55:16 fenced receive_protocol from 2 max 1.1.1.0 run 0.0.0.0
Jun 05 13:55:16 fenced daemon node 2 max 0.0.0.0 run 0.0.0.0
Jun 05 13:55:16 fenced daemon node 2 join 1401990916 left 0 local quorum 1401990915
Jun 05 13:55:16 fenced receive_protocol from 1 max 1.1.1.0 run 1.1.1.0
Jun 05 13:55:16 fenced daemon node 1 max 0.0.0.0 run 0.0.0.0
Jun 05 13:55:16 fenced daemon node 1 join 1401990916 left 0 local quorum 1401990915
Jun 05 13:55:16 fenced run protocol from nodeid 1
Jun 05 13:55:16 fenced daemon run 1.1.1 max 1.1.1
Jun 05 13:55:16 fenced receive_protocol from 2 max 1.1.1.0 run 1.1.1.0
Jun 05 13:55:16 fenced daemon node 2 max 1.1.1.0 run 0.0.0.0
Jun 05 13:55:16 fenced daemon node 2 join 1401990916 left 0 local quorum 1401990915
Jun 05 13:55:17 fenced client connection 3 fd 15
Jun 05 13:55:17 fenced /cluster/cman/@two_node is 1
Jun 05 13:55:17 fenced added 2 nodes from ccs
Jun 05 13:55:17 fenced cpg_join fenced:default ...
Jun 05 13:55:17 fenced fenced:default conf 2 1 0 memb 1 2 join 2 left
Jun 05 13:55:17 fenced add_change cg 1 joined nodeid 2
Jun 05 13:55:17 fenced add_change cg 1 m 2 j 1 r 0 f 0
Jun 05 13:55:17 fenced check_ringid cluster 132 cpg 0:0
Jun 05 13:55:17 fenced fenced:default ring 1:132 2 memb 1 2
Jun 05 13:55:17 fenced check_ringid done cluster 132 cpg 1:132
Jun 05 13:55:17 fenced check_quorum done
Jun 05 13:55:17 fenced send_start 2:1 flags 1 started 0 m 2 j 1 r 0 f 0
Jun 05 13:55:17 fenced receive_start 1:2 len 152
Jun 05 13:55:17 fenced match_change 1:2 matches cg 1
Jun 05 13:55:17 fenced wait_messages cg 1 need 1 of 2
Jun 05 13:55:17 fenced receive_start 2:1 len 152
Jun 05 13:55:17 fenced match_change 2:1 matches cg 1
Jun 05 13:55:17 fenced wait_messages cg 1 got all 2
Jun 05 13:55:17 fenced set_master from 0 to complete node 1
Jun 05 13:55:17 fenced receive_complete 1:2 len 152
====

Comment 12 Christine Caulfield 2014-06-06 10:06:23 UTC
 think it's probably in the kernel - though you're using exactly the same kernel as I am - the final uevent isn't reaching dlm_controld. 

Can you enable kernel debugging please ?

echo 1 > /sys/kernel/config/dlm/cluster/log_debug 

You'll need to do this between loading the DLM module and starting clvmd, as the files won't be there until the module is loaded. You might also need to enable LOG_DEBUG level logging in syslog.

Comment 13 Madison Kelly 2014-06-06 19:21:29 UTC
I 'chkconfig clvmd off' on both nodes, started pacemaker, 'echo 1 > /sys/kernel/config/dlm/cluster/log_debug', and then started clvmd. I started clvmd on an-a04n02 first, then a second later on an-a04n01 (reverse from previous). This time, an-a04n02 *appeared* to start, and an-a04n01 timed out. So start order, in some fashion, matters.

I will (try to) attach the syslog in a moment.

Comment 14 Madison Kelly 2014-06-06 19:22:19 UTC
Created attachment 903015 [details]
syslog with kernel/dlm debugging enabled

Comment 15 Madison Kelly 2014-06-06 19:22:38 UTC
Created attachment 903016 [details]
syslog with kernel/dlm debugging enabled, an-a04n02

Comment 16 Madison Kelly 2014-06-06 19:25:08 UTC
Created attachment 903017 [details]
dlm_controld.log with kernel/dlm debugging enabled, an-a04n01

Comment 17 Madison Kelly 2014-06-06 19:25:29 UTC
Created attachment 903018 [details]
dlm_controld.log with kernel/dlm debugging enabled, an-a04n02

Comment 18 Madison Kelly 2014-06-06 22:15:33 UTC
In hopes of trying to get back to studying, I wiped both nodes and did a fresh near-minimal install of RHEL 6.5. I created a skeleton cluster.conf, configured stonith in pacemaker, changed 'locking_type = 3' in lvm.conf and started clvmd... Same results.

So whatever it is, it is persistent through reinstalls.

Comment 19 Madison Kelly 2014-06-06 22:17:03 UTC
If you're having trouble reproducing, I can setup remote access to the effected nodes. They are purely test systems and can be wiped at any time.

Comment 20 Madison Kelly 2014-06-07 20:06:37 UTC
I decided to play with this some more today... Turns out, it was a firewall issue:

====
[root@an-a04n01 ~]# diff -u iptables.broken iptables.dlm 
--- iptables.broken	2014-06-07 16:03:41.120294967 -0400
+++ iptables.dlm	2014-06-07 16:04:20.444294917 -0400
@@ -1,8 +1,9 @@
-# Generated by iptables-save v1.4.7 on Sat Jun  7 16:03:41 2014
+# Generated by iptables-save v1.4.7 on Sat Jun  7 16:04:20 2014
 *filter
 :INPUT ACCEPT [0:0]
 :FORWARD ACCEPT [0:0]
-:OUTPUT ACCEPT [37:4556]
+:OUTPUT ACCEPT [22:3824]
+-A INPUT -s 10.20.0.0/16 -d 10.20.0.0/16 -p tcp -m state --state NEW -m tcp --dport 21064 -j ACCEPT 
 -A INPUT -s 10.10.0.0/16 -d 10.10.0.0/16 -p tcp -m state --state NEW -m tcp --dport 7789 -j ACCEPT 
 -A INPUT -s 10.10.0.0/16 -d 10.10.0.0/16 -p tcp -m state --state NEW -m tcp --dport 7788 -j ACCEPT 
 -A INPUT -s 10.20.0.0/16 -p udp -m addrtype --dst-type MULTICAST -m state --state NEW -m multiport --dports 5404,5405 -j ACCEPT 
@@ -14,4 +15,4 @@
 -A INPUT -j REJECT --reject-with icmp-host-prohibited 
 -A FORWARD -j REJECT --reject-with icmp-host-prohibited 
 COMMIT
-# Completed on Sat Jun  7 16:03:41 2014
+# Completed on Sat Jun  7 16:04:20 2014
====

So, "user error".

My question now is, can dlm/clvmd handle this better? It strikes me that it shouldn't lock up the system so hard when the firewall isn't open. Can it do a test to see if there is a firewall in place, complain and exit instead?

Comment 21 Christine Caulfield 2014-06-09 07:57:36 UTC
I'm more worried that that default firewall rules were wrong TBH. That's a much easier and a far more useful fix I think.

Were you using the default firewall rules?

Comment 22 Christine Caulfield 2014-06-09 08:57:14 UTC
I'll ask resource agent developers if its feasible to add a check when the DLM is started

Comment 23 David Vossel 2014-06-30 22:18:41 UTC
(In reply to Christine Caulfield from comment #22)
> I'll ask resource agent developers if its feasible to add a check when the
> DLM is started

I don't maintain the dlm or clvmd init scripts for rhel6.  The dlm is actually started by the CMAN init script.

We shouldn't need to be checking firewall rules, all of this should just work. If it doesn't just work, then we need to focus on fixing that.

-- Vossel

Comment 24 Christine Caulfield 2014-07-01 08:48:40 UTC
OK, my mistake, sorry. It just looks like what might be a massively complicated fix in the DLM to fix a rare configuration error.

Comment 27 Christine Caulfield 2015-08-20 08:13:15 UTC
Reluctantly closing this because it's getting too late in the RHEL-6 lifecycle to contemplate intrusive DLM changes.