Bug 1452894

Summary:

Clustered LVM hangs hard after node is fenced

Product:

Red Hat Enterprise Linux 6

Reporter:

Madison Kelly <mkelly>

Component:

lvm2

Assignee:

Alasdair Kergon <agk>

Status:

CLOSED UPSTREAM

QA Contact:

cluster-qe <cluster-qe>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

6.9

CC:

agk, ccaulfie, cluster-maint, heinzm, jbrassow, mcsontos, msnitzer, prajnoha, rpeterso, teigland, thornber, zkabelac

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-09-29 21:03:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
sos report from node 1	none
sos report from node 2	none
sos report from an-anvil-01 - node 1	none
sos report from an-anvil-01 - node 2	none
sos report from an-anvil-01 - node 1 - Post-fence hang	none
sos report from an-anvil-01 - node 2 - Post-fence hang	none
Full strace of pvscan	none

Description Madison Kelly 2017-05-19 22:31:26 UTC

Description of problem:

Sometimes, when a node is failed and fenced, rgmanager and gfs2 recover but clustered LVM stays blocked. 

Example; 2-node cluster and I crashed node 2 with 'echo c > /proc/sysrq-trigger' with IPMI network unplugged, forcing the backup fencing to be used (2x APC PDUs) causes the hang;

====
May 19 18:07:24 an-a01n01 corosync[3477]:   [TOTEM ] A processor failed, forming new configuration.
May 19 18:07:26 an-a01n01 kernel: dlm: closing connection to node 2
May 19 18:07:26 an-a01n01 fenced[3544]: fencing node an-a01n02.alteeve.com
May 19 18:07:26 an-a01n01 corosync[3477]:   [QUORUM] Members[1]: 1
May 19 18:07:26 an-a01n01 corosync[3477]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 19 18:07:26 an-a01n01 corosync[3477]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.10.1) ; members(old:2 left:1)
May 19 18:07:26 an-a01n01 corosync[3477]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 19 18:07:26 an-a01n01 kernel: GFS2: fsid=an-anvil-01:shared.1: jid=0: Trying to acquire journal lock...
May 19 18:07:28 an-a01n01 kernel: drbd r0: PingAck did not arrive in time.
May 19 18:07:28 an-a01n01 kernel: drbd r0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) 
May 19 18:07:28 an-a01n01 kernel: drbd r0: ack_receiver terminated
May 19 18:07:28 an-a01n01 kernel: drbd r0: Terminating drbd_a_r0
May 19 18:07:28 an-a01n01 kernel: drbd r0: Connection closed
May 19 18:07:28 an-a01n01 kernel: drbd r0: conn( NetworkFailure -> Unconnected ) 
May 19 18:07:28 an-a01n01 kernel: drbd r0: receiver terminated
May 19 18:07:28 an-a01n01 kernel: drbd r0: helper command: /sbin/drbdadm fence-peer r0
May 19 18:07:28 an-a01n01 kernel: drbd r0: Restarting receiver thread
May 19 18:07:28 an-a01n01 kernel: drbd r0: receiver (re)started
May 19 18:07:28 an-a01n01 kernel: drbd r0: conn( Unconnected -> WFConnection ) 
May 19 18:07:28 an-a01n01 rhcs_fence: Attempting to fence peer using RHCS from DRBD...
May 19 18:07:37 an-a01n01 python: Failed: Unable to obtain correct plug status or plug is not available#012
May 19 18:07:39 an-a01n01 fenced[3544]: fence an-a01n02.alteeve.com dev 0.0 agent fence_ipmilan result: error from agent
May 19 18:07:39 an-a01n01 fenced[3544]: fence an-a01n02.alteeve.com success
May 19 18:07:39 an-a01n01 rhcs_fence: Seems fenced was successfull
May 19 18:07:39 an-a01n01 kernel: drbd r0: helper command: /sbin/drbdadm fence-peer r0 exit code 7 (0x700)
May 19 18:07:39 an-a01n01 kernel: drbd r0: fence-peer helper returned 7 (peer was stonithed)
May 19 18:07:39 an-a01n01 kernel: drbd r0: pdsk( DUnknown -> Outdated ) 
May 19 18:07:39 an-a01n01 kernel: block drbd0: new current UUID 9A8F9C4E36FED6B7:9906DA90C3B45FB9:E5618D2286C3BA0B:E5608D2286C3BA0B
May 19 18:07:39 an-a01n01 kernel: drbd r0: susp( 1 -> 0 ) 
May 19 18:07:39 an-a01n01 kernel: GFS2: fsid=an-anvil-01:shared.1: jid=0: Looking at journal...
May 19 18:07:40 an-a01n01 kernel: GFS2: fsid=an-anvil-01:shared.1: jid=0: Acquiring the transaction lock...
May 19 18:07:40 an-a01n01 kernel: GFS2: fsid=an-anvil-01:shared.1: jid=0: Replaying journal...
May 19 18:07:40 an-a01n01 kernel: GFS2: fsid=an-anvil-01:shared.1: jid=0: Replayed 2 of 3 blocks
May 19 18:07:40 an-a01n01 kernel: GFS2: fsid=an-anvil-01:shared.1: jid=0: Found 1 revoke tags
May 19 18:07:40 an-a01n01 kernel: GFS2: fsid=an-anvil-01:shared.1: jid=0: Journal replayed in 1s
May 19 18:07:40 an-a01n01 kernel: GFS2: fsid=an-anvil-01:shared.1: jid=0: Done
May 19 18:07:40 an-a01n01 rgmanager[7195]: Marking service:storage_n02 as stopped: Restricted domain unavailable
May 19 18:07:40 an-a01n01 rgmanager[7195]: Marking service:libvirtd_n02 as stopped: Restricted domain unavailable
====

I can touch and delete a file on gfs2;

====
[root@an-a01n01 ~]# mount | grep gfs2
/dev/mapper/an--a01n01_vg0-shared on /shared type gfs2 (rw,seclabel,noatime,nodiratime,hostdata=jid=1)

[root@an-a01n01 ~]# touch /shared/rest
[root@an-a01n01 ~]# rm /shared/rest 
rm: remove regular empty file `/shared/rest'? y
====

If I try to call 'pvscan', though, it hangs hard. Even with 'timeout';

====
[root@an-a01n01 ~]# /usr/bin/timeout 15 /sbin/pvscan; /bin/echo pvscan:$?
^C
Connection to an-a01n01 closed.
255 digimer@pulsar:~$ ssh root@an-a01n01
Last login: Fri May 19 13:00:07 2017 from 10.20.100.100
[root@an-a01n01 ~]# 
====

I had to ~ + . + <enter> to close the shell.

DLM seems OK:

====
[root@an-a01n01 ~]# dlm_tool ls
dlm lockspaces
name          rgmanager
id            0x5231f3eb
flags         0x00000000 
change        member 1 joined 0 remove 1 failed 1 seq 4,4
members       1 

name          shared
id            0xa98de676
flags         0x00000008 fs_reg
change        member 1 joined 0 remove 1 failed 1 seq 4,4
members       1 

name          clvmd
id            0x4104eefa
flags         0x00000000 
change        member 1 joined 0 remove 1 failed 1 seq 4,4
members       1 
====

Cluster config;

====
[root@an-a01n01 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="97" name="an-anvil-01">
	<cman expected_votes="1" transport="udpu" two_node="1"/>
	<clusternodes>
		<clusternode name="an-a01n01.alteeve.com" nodeid="1">
			<fence>
				<method name="ipmi">
					<device action="reboot" delay="15" ipaddr="10.20.11.1" login="admin" name="ipmi_n01" passwd="Initial1"/>
				</method>
				<method name="pdu">
					<device action="off" name="pdu01" port="1"/>
					<device action="off" name="pdu02" port="1"/>
					<device action="on" name="pdu01" port="1"/>
					<device action="on" name="pdu02" port="1"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-a01n02.alteeve.com" nodeid="2">
			<fence>
				<method name="ipmi">
					<device action="reboot" ipaddr="10.20.11.2" login="admin" name="ipmi_n02" passwd="Initial1"/>
				</method>
				<method name="pdu">
					<device action="off" name="pdu01" port="2"/>
					<device action="off" name="pdu02" port="2"/>
					<device action="on" name="pdu01" port="2"/>
					<device action="on" name="pdu02" port="2"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" name="ipmi_n01"/>
		<fencedevice agent="fence_ipmilan" name="ipmi_n02"/>
		<fencedevice agent="fence_apc_snmp" ipaddr="an-pdu01.alteeve.com" name="pdu01"/>
		<fencedevice agent="fence_apc_snmp" ipaddr="an-pdu02.alteeve.com" name="pdu02"/>
	</fencedevices>
	<fence_daemon post_join_delay="90"/>
	<totem rrp_mode="none" secauth="off"/>
	<rm log_level="5">
		<resources>
			<script file="/etc/init.d/libvirtd" name="libvirtd"/>
			<script file="/etc/init.d/drbd" name="drbd"/>
			<script file="/etc/init.d/wait-for-drbd" name="wait-for-drbd"/>
			<script file="/etc/init.d/clvmd" name="clvmd"/>
			<clusterfs device="/dev/an-a01n01_vg0/shared" force_unmount="1" fstype="gfs2" mountpoint="/shared" name="sharedfs" self_fence="1"/>
		</resources>
		<failoverdomains>
			<failoverdomain name="only_n01" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-a01n01.alteeve.com"/>
			</failoverdomain>
			<failoverdomain name="only_n02" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-a01n02.alteeve.com"/>
			</failoverdomain>
			<failoverdomain name="primary_n01" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-a01n01.alteeve.com" priority="1"/>
				<failoverdomainnode name="an-a01n02.alteeve.com" priority="2"/>
			</failoverdomain>
			<failoverdomain name="primary_n02" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-a01n01.alteeve.com" priority="2"/>
				<failoverdomainnode name="an-a01n02.alteeve.com" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<service autostart="1" domain="only_n01" exclusive="0" name="storage_n01" recovery="restart">
			<script ref="drbd">
				<script ref="wait-for-drbd">
					<script ref="clvmd">
						<clusterfs ref="sharedfs"/>
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="only_n02" exclusive="0" name="storage_n02" recovery="restart">
			<script ref="drbd">
				<script ref="wait-for-drbd">
					<script ref="clvmd">
						<clusterfs ref="sharedfs"/>
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="only_n01" exclusive="0" name="libvirtd_n01" recovery="restart">
			<script ref="libvirtd"/>
		</service>
		<service autostart="1" domain="only_n02" exclusive="0" name="libvirtd_n02" recovery="restart">
			<script ref="libvirtd"/>
		</service>
		<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" migrate_options="--unsafe" name="srv01-rhel7" no_kill="1" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>
		<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" migrate_options="--unsafe" name="srv02-win2012" no_kill="1" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>
		<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" migrate_options="--unsafe" name="srv03-sles12" no_kill="1" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>
		<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" migrate_options="--unsafe" name="srv04-freebsd11" no_kill="1" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>
		<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" migrate_options="--unsafe" name="srv05-win2016" no_kill="1" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>
		<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" migrate_options="--unsafe" name="srv06-centos4" no_kill="1" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>
	</rm>
</cluster>
====

Version-Release number of selected component (if applicable):

lvm2-cluster-2.02.143-12.el6.x86_64
cman-3.0.12.1-84.el6.x86_64
package rgmanger is not installed
gfs2-utils-3.0.12.1-84.el6.x86_64

Linux an-a01n01.alteeve.com 2.6.32-696.1.1.el6.x86_64 #1 SMP Tue Mar 21 12:19:18 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:

50~75%

Steps to Reproduce:
1. Build a cluster with IPMI and a pair of PDUs for dual fence methods
2. Unplug a node's IPMI network cable and then panic the node
3. Try to use anything LVM related


Actual results:

Hangs, hard.


Expected results:

Recovers once fencing succeeds


Additional info:

Could this be because the first fence attempt failed and LVM didn't notice when the second method succeeded? Just a guess...

If you don't have a machine with the setup needed to reproduce, we have a development cluster that can be used/wiped/rebuilt.

Comment 2 Madison Kelly 2017-05-19 22:35:27 UTC

Forgot to mention, rgmanager is still responsive;

====
[root@an-a01n01 ~]# clustat
Cluster Status for an-anvil-01 @ Fri May 19 18:31:48 2017
Member Status: Quorate

 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 an-a01n01.alteeve.com                                      1 Online, Local, rgmanager
 an-a01n02.alteeve.com                                      2 Offline

 Service Name                                  Owner (Last)                                  State         
 ------- ----                                  ----- ------                                  -----         
 service:libvirtd_n01                          an-a01n01.alteeve.com                         started       
 service:libvirtd_n02                          (an-a01n02.alteeve.com)                       stopped       
 service:storage_n01                           an-a01n01.alteeve.com                         started       
 service:storage_n02                           (an-a01n02.alteeve.com)                       stopped       
 vm:srv01-rhel7                                an-a01n01.alteeve.com                         started       
 vm:srv02-win2012                              an-a01n01.alteeve.com                         started       
 vm:srv03-sles12                               an-a01n01.alteeve.com                         started       
 vm:srv04-freebsd11                            an-a01n01.alteeve.com                         started       
 vm:srv05-win2016                              an-a01n01.alteeve.com                         started       
 vm:srv06-centos4                              an-a01n01.alteeve.com                         started       
====

It's just clvmd that is hung.

Comment 3 Madison Kelly 2017-05-19 22:39:59 UTC

I ran 'pvscan' again and then rebooted node 2. It stayed hung until node #2 started cman and clvmd started, clvmd recovered and pvscan completed.

====
[root@an-a01n01 ~]# pvscan
  PV /dev/drbd0   VG an-a01n01_vg0   lvm2 [636.57 GiB / 236.57 GiB free]
  Total: 1 [636.57 GiB] / in use: 1 [636.57 GiB] / in no VG: 0 [0   ]
====

Comment 4 Madison Kelly 2017-05-22 01:56:00 UTC

I just had this happen again on a different cluster. This time, however, the node wasn't fenced, it simply was withdrawn from the cluster and then restarted.

Node 1 was withdrawn from the cluster with '/etc/init.d/rgmanager stop && /etc/init.d/cman stop', then a few minutes later, rejoined with '/etc/init.d/cman start' followed by '/etc/init.d/clvmd start'. At this point, 'pvscan' locked up again.

====
May 21 21:26:40 an-a03n01 rgmanager[15368]: Member 2 shutting down
May 21 21:26:40 an-a03n01 rgmanager[15368]: Marking service:storage_n02 as stopped: Restricted domain unavailable
May 21 21:26:41 an-a03n01 rgmanager[15368]: Marking service:libvirtd_n02 as stopped: Restricted domain unavailable
May 21 21:26:58 an-a03n01 corosync[11331]:   [QUORUM] Members[1]: 1
May 21 21:26:58 an-a03n01 corosync[11331]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:26:58 an-a03n01 corosync[11331]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:2 left:1)
May 21 21:26:58 an-a03n01 corosync[11331]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 21:26:58 an-a03n01 kernel: dlm: closing connection to node 2
May 21 21:28:18 an-a03n01 corosync[11331]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:28:18 an-a03n01 corosync[11331]:   [QUORUM] Members[2]: 1 2
May 21 21:28:18 an-a03n01 corosync[11331]:   [QUORUM] Members[2]: 1 2
May 21 21:28:18 an-a03n01 corosync[11331]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:1 left:0)
May 21 21:28:18 an-a03n01 corosync[11331]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 21:28:26 an-a03n01 kernel: dlm: got connection from 2
May 21 21:28:38 an-a03n01 kernel: drbd r0: Handshake successful: Agreed network protocol version 101
May 21 21:28:38 an-a03n01 kernel: drbd r0: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
May 21 21:28:38 an-a03n01 kernel: drbd r0: conn( WFConnection -> WFReportParams )
May 21 21:28:38 an-a03n01 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [16726])
May 21 21:28:38 an-a03n01 kernel: block drbd0: drbd_sync_handshake:
May 21 21:28:38 an-a03n01 kernel: block drbd0: self 421EA68818FE0737:B470D4A67EBFB993:4FE1E2CEE9AE2C3C:4FE0E2CEE9AE2C3C bits:0 flags:0
May 21 21:28:38 an-a03n01 kernel: block drbd0: peer B470D4A67EBFB992:0000000000000000:4FE1E2CEE9AE2C3D:4FE0E2CEE9AE2C3C bits:0 flags:0
May 21 21:28:38 an-a03n01 kernel: block drbd0: uuid_compare()=1 by rule 70
May 21 21:28:38 an-a03n01 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )
May 21 21:28:38 an-a03n01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
May 21 21:28:38 an-a03n01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
May 21 21:28:38 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
May 21 21:28:38 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
May 21 21:28:38 an-a03n01 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
May 21 21:28:38 an-a03n01 kernel: block drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]).
May 21 21:28:38 an-a03n01 kernel: block drbd0: updated sync UUID 421EA68818FE0737:B471D4A67EBFB993:B470D4A67EBFB993:4FE1E2CEE9AE2C3C
May 21 21:28:38 an-a03n01 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
May 21 21:28:38 an-a03n01 kernel: block drbd0: updated UUIDs 421EA68818FE0737:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993
May 21 21:28:38 an-a03n01 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
May 21 21:28:38 an-a03n01 kernel: drbd r0: helper command: /sbin/drbdadm unfence-peer r0
May 21 21:28:38 an-a03n01 kernel: drbd r0: helper command: /sbin/drbdadm unfence-peer r0 exit code 0 (0x0)
May 21 21:28:38 an-a03n01 kernel: block drbd0: peer( Secondary -> Primary )
May 21 21:28:52 an-a03n01 rgmanager[15368]: Shutting down
May 21 21:28:53 an-a03n01 rgmanager[15368]: Stopping service service:storage_n01
May 21 21:28:53 an-a03n01 rgmanager[15368]: Stopping service service:libvirtd_n01
May 21 21:28:53 an-a03n01 rgmanager[15368]: Service service:libvirtd_n01 is stopped
May 21 21:28:54 an-a03n01 kernel: block drbd0: role( Primary -> Secondary )
May 21 21:28:54 an-a03n01 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
May 21 21:28:54 an-a03n01 kernel: drbd r0: Requested state change failed by peer: Refusing to be Primary while peer is not outdated (-7)
May 21 21:28:54 an-a03n01 kernel: drbd r0: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) disk( UpToDate -> Outdated ) pdsk( UpToDate -> DUnknown )
May 21 21:28:54 an-a03n01 kernel: drbd r0: ack_receiver terminated
May 21 21:28:54 an-a03n01 kernel: drbd r0: Terminating drbd_a_r0
May 21 21:28:54 an-a03n01 kernel: drbd r0: Connection closed
May 21 21:28:54 an-a03n01 kernel: drbd r0: conn( Disconnecting -> StandAlone )
May 21 21:28:54 an-a03n01 kernel: drbd r0: receiver terminated
May 21 21:28:54 an-a03n01 kernel: drbd r0: Terminating drbd_r_r0
May 21 21:28:54 an-a03n01 kernel: block drbd0: disk( Outdated -> Failed ) 
May 21 21:28:54 an-a03n01 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
May 21 21:28:54 an-a03n01 kernel: block drbd0: disk( Failed -> Diskless ) 
May 21 21:28:54 an-a03n01 kernel: drbd r0: Terminating drbd_w_r0
May 21 21:28:54 an-a03n01 kernel: drbd: module cleanup done.
May 21 21:28:54 an-a03n01 rgmanager[15368]: Service service:storage_n01 is stopped
May 21 21:28:54 an-a03n01 rgmanager[15368]: Shutting down
May 21 21:28:54 an-a03n01 rgmanager[15368]: Disconnecting from CMAN
May 21 21:28:54 an-a03n01 rgmanager[15368]: Exiting
May 21 21:28:56 an-a03n01 kernel: dlm: closing connection to node 2
May 21 21:28:56 an-a03n01 kernel: dlm: closing connection to node 1
May 21 21:28:57 an-a03n01 corosync[11331]:   [SERV  ] Unloading all Corosync service engines.
May 21 21:28:57 an-a03n01 corosync[11331]:   [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
May 21 21:28:57 an-a03n01 corosync[11331]:   [SERV  ] Service engine unloaded: corosync configuration service
May 21 21:28:57 an-a03n01 corosync[11331]:   [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
May 21 21:28:57 an-a03n01 corosync[11331]:   [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
May 21 21:28:57 an-a03n01 corosync[11331]:   [SERV  ] Service engine unloaded: corosync profile loading service
May 21 21:28:57 an-a03n01 corosync[11331]:   [SERV  ] Service engine unloaded: openais checkpoint service B.01.01
May 21 21:28:57 an-a03n01 corosync[11331]:   [SERV  ] Service engine unloaded: corosync CMAN membership service 2.90
May 21 21:28:57 an-a03n01 corosync[11331]:   [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
May 21 21:28:57 an-a03n01 corosync[11331]:   [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:2055.
May 21 21:29:47 an-a03n01 corosync[22810]:   [MAIN  ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service.
May 21 21:29:47 an-a03n01 corosync[22810]:   [MAIN  ] Corosync built-in features: nss dbus rdma snmp
May 21 21:29:47 an-a03n01 corosync[22810]:   [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
May 21 21:29:47 an-a03n01 corosync[22810]:   [MAIN  ] Successfully parsed cman config
May 21 21:29:47 an-a03n01 corosync[22810]:   [TOTEM ] Initializing transport (UDP/IP Unicast).
May 21 21:29:47 an-a03n01 corosync[22810]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
May 21 21:29:47 an-a03n01 corosync[22810]:   [TOTEM ] The network interface [10.20.30.1] is now up.
May 21 21:29:47 an-a03n01 corosync[22810]:   [QUORUM] Using quorum provider quorum_cman
May 21 21:29:47 an-a03n01 corosync[22810]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
May 21 21:29:47 an-a03n01 corosync[22810]:   [CMAN  ] CMAN 3.0.12.1 (built Mar 24 2017 12:40:51) started
May 21 21:29:47 an-a03n01 corosync[22810]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
May 21 21:29:47 an-a03n01 corosync[22810]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
May 21 21:29:47 an-a03n01 corosync[22810]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
May 21 21:29:47 an-a03n01 corosync[22810]:   [SERV  ] Service engine loaded: corosync configuration service
May 21 21:29:47 an-a03n01 corosync[22810]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
May 21 21:29:47 an-a03n01 corosync[22810]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
May 21 21:29:47 an-a03n01 corosync[22810]:   [SERV  ] Service engine loaded: corosync profile loading service
May 21 21:29:47 an-a03n01 corosync[22810]:   [QUORUM] Using quorum provider quorum_cman
May 21 21:29:47 an-a03n01 corosync[22810]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
May 21 21:29:47 an-a03n01 corosync[22810]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
May 21 21:29:47 an-a03n01 corosync[22810]:   [TOTEM ] adding new UDPU member {10.20.30.1}
May 21 21:29:47 an-a03n01 corosync[22810]:   [TOTEM ] adding new UDPU member {10.20.30.2}
May 21 21:29:47 an-a03n01 corosync[22810]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:29:47 an-a03n01 corosync[22810]:   [CMAN  ] quorum regained, resuming activity
May 21 21:29:47 an-a03n01 corosync[22810]:   [QUORUM] This node is within the primary component and will provide service.
May 21 21:29:47 an-a03n01 corosync[22810]:   [QUORUM] Members[1]: 1
May 21 21:29:47 an-a03n01 corosync[22810]:   [QUORUM] Members[1]: 1
May 21 21:29:47 an-a03n01 corosync[22810]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:0 left:0)
May 21 21:29:47 an-a03n01 corosync[22810]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 21:29:47 an-a03n01 corosync[22810]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:29:47 an-a03n01 corosync[22810]:   [QUORUM] Members[2]: 1 2
May 21 21:29:47 an-a03n01 corosync[22810]:   [QUORUM] Members[2]: 1 2
May 21 21:29:47 an-a03n01 corosync[22810]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:1 left:0)
May 21 21:29:47 an-a03n01 corosync[22810]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 21:29:51 an-a03n01 fenced[22923]: fenced 3.0.12.1 started
May 21 21:29:51 an-a03n01 dlm_controld[22946]: dlm_controld 3.0.12.1 started
May 21 21:29:52 an-a03n01 gfs_controld[23021]: gfs_controld 3.0.12.1 started
May 21 21:29:55 an-a03n01 kernel: dlm: Using TCP for communications
May 21 21:29:55 an-a03n01 kernel: dlm: connecting to 2
May 21 21:29:56 an-a03n01 clvmd: Cluster LVM daemon started - connected to CMAN
====

From node 2's perspective;

====
May 21 21:28:46 an-a03n02 rgmanager[17840]: I am node #2
May 21 21:28:46 an-a03n02 rgmanager[17840]: Resource Group Manager Starting
May 21 21:28:48 an-a03n02 rgmanager[17840]: Starting stopped service service:storage_n02
May 21 21:28:49 an-a03n02 rgmanager[17840]: Starting stopped service service:libvirtd_n02
May 21 21:28:49 an-a03n02 rgmanager[17840]: Service service:libvirtd_n02 started
May 21 21:28:49 an-a03n02 rgmanager[17840]: Service service:storage_n02 started
May 21 21:28:49 an-a03n02 kernel: lo: Disabled Privacy Extensions
May 21 21:28:54 an-a03n02 kernel: block drbd0: peer( Primary -> Secondary )
May 21 21:28:54 an-a03n02 kernel: block drbd0: State change failed: Refusing to be Primary while peer is not outdated
May 21 21:28:54 an-a03n02 kernel: block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
May 21 21:28:54 an-a03n02 kernel: block drbd0:  wanted = { cs:TearDown ro:Primary/Unknown ds:UpToDate/DUnknown s---F- }
May 21 21:28:54 an-a03n02 kernel: drbd r0: State change failed: Refusing to be Primary while peer is not outdated
May 21 21:28:54 an-a03n02 kernel: drbd r0:  mask = 0x1f0 val = 0x70
May 21 21:28:54 an-a03n02 kernel: drbd r0:  old_conn:WFReportParams wanted_conn:TearDown
May 21 21:28:54 an-a03n02 kernel: drbd r0: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated )
May 21 21:28:54 an-a03n02 kernel: drbd r0: ack_receiver terminated
May 21 21:28:54 an-a03n02 kernel: drbd r0: Terminating drbd_a_r0
May 21 21:28:54 an-a03n02 kernel: block drbd0: new current UUID AB97F17A92B4AA55:421EA68818FE0737:B471D4A67EBFB992:B470D4A67EBFB993
May 21 21:28:54 an-a03n02 kernel: drbd r0: Connection closed
May 21 21:28:54 an-a03n02 kernel: drbd r0: conn( TearDown -> Unconnected )
May 21 21:28:54 an-a03n02 kernel: drbd r0: receiver terminated
May 21 21:28:54 an-a03n02 kernel: drbd r0: Restarting receiver thread
May 21 21:28:54 an-a03n02 kernel: drbd r0: receiver (re)started
May 21 21:28:54 an-a03n02 kernel: drbd r0: conn( Unconnected -> WFConnection )
May 21 21:28:54 an-a03n02 rgmanager[17840]: Member 1 shutting down
May 21 21:28:54 an-a03n02 rgmanager[17840]: Marking service:storage_n01 as stopped: Restricted domain unavailable
May 21 21:28:57 an-a03n02 corosync[16913]:   [QUORUM] Members[1]: 2
May 21 21:28:57 an-a03n02 corosync[16913]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:28:57 an-a03n02 corosync[16913]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.30.2) ; members(old:2 left:1)
May 21 21:28:57 an-a03n02 corosync[16913]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 21:28:57 an-a03n02 kernel: dlm: closing connection to node 1
May 21 21:29:47 an-a03n02 corosync[16913]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:29:47 an-a03n02 corosync[16913]:   [QUORUM] Members[2]: 1 2
May 21 21:29:47 an-a03n02 corosync[16913]:   [QUORUM] Members[2]: 1 2
May 21 21:29:47 an-a03n02 corosync[16913]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.30.1) ; members(old:1 left:0)
May 21 21:29:47 an-a03n02 corosync[16913]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 21:29:55 an-a03n02 kernel: dlm: got connection from 1
====

At this point 'pvscan' locks up on both nodes, but 'clustat' still works.

====
[root@an-a03n01 ~]# clustat
Cluster Status for an-anvil-03 @ Sun May 21 21:50:11 2017
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 an-a03n01.alteeve.com                                               1 Online, Local
 an-a03n02.alteeve.com                                               2 Online
====

And:

====
[root@an-a03n02 ~]# clustat 
Cluster Status for an-anvil-03 @ Sun May 21 21:50:14 2017
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 an-a03n01.alteeve.com                                               1 Online
 an-a03n02.alteeve.com                                               2 Online, Local, rgmanager

 Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----         
 service:libvirtd_n01                                             (an-a03n01.alteeve.com)                                          stopped       
 service:libvirtd_n02                                             an-a03n02.alteeve.com                                            started       
 service:storage_n01                                              (an-a03n01.alteeve.com)                                          stopped       
 service:storage_n02                                              an-a03n02.alteeve.com                                            started       
====

Both nodes were freshly booted and neither node had been fenced or removed/rejoined since the cluster was first started on both nodes.

Note that the nodes had not yet been updated from base 6.9.

====
[root@an-a03n01 ~]# rpm -q lvm2-cluster cman rgmanager gfs2-utils
lvm2-cluster-2.02.143-12.el6.x86_64
cman-3.0.12.1-84.el6.x86_64
rgmanager-3.0.12.1-33.el6.x86_64
gfs2-utils-3.0.12.1-84.el6.x86_64
====

I just noticed that the rgmanager version wasn't reported in the initial comment. It is the same though, so I am guessing it hasn't been updated since 6.9 was released:

====
[root@an-a01n01 ~]# rpm -q rgmanager
rgmanager-3.0.12.1-33.el6.x86_64
====

Note also that this cluster is CentOS, where the previous comments where from RHEL 6.9 proper. 

This is a pretty serious issue, so far as I can tell. Risking a cluster lock up on LVM from simply withdrawing and rejoining the cluster...

Comment 5 Madison Kelly 2017-05-22 02:02:42 UTC

To try and recover, I fenced node 1 and tried to directly start cman followed by rgmanager, without manually starting clvmd first. It took a while, but it did eventually start and, once started, pvscan on node 2 finally returned.

Could this have something to do with cman and clvmd running without rgmanager running? I suppose it's just coincidence and fencing node 1 and rejoining it to the cluster was enough.

I'll try to gather more data when I hit this bug again.

Comment 6 Madison Kelly 2017-05-22 02:08:02 UTC

syslog from node 1 after it was fenced;

====
May 21 21:59:39 an-a03n01 corosync[2360]:   [MAIN  ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service.
May 21 21:59:39 an-a03n01 corosync[2360]:   [MAIN  ] Corosync built-in features: nss dbus rdma snmp
May 21 21:59:39 an-a03n01 corosync[2360]:   [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
May 21 21:59:39 an-a03n01 corosync[2360]:   [MAIN  ] Successfully parsed cman config
May 21 21:59:39 an-a03n01 corosync[2360]:   [TOTEM ] Initializing transport (UDP/IP Unicast).
May 21 21:59:39 an-a03n01 corosync[2360]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
May 21 21:59:39 an-a03n01 corosync[2360]:   [TOTEM ] The network interface [10.20.30.1] is now up.
May 21 21:59:40 an-a03n01 corosync[2360]:   [QUORUM] Using quorum provider quorum_cman
May 21 21:59:40 an-a03n01 corosync[2360]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
May 21 21:59:40 an-a03n01 corosync[2360]:   [CMAN  ] CMAN 3.0.12.1 (built Mar 24 2017 12:40:51) started
May 21 21:59:40 an-a03n01 corosync[2360]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
May 21 21:59:40 an-a03n01 corosync[2360]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
May 21 21:59:40 an-a03n01 corosync[2360]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
May 21 21:59:40 an-a03n01 corosync[2360]:   [SERV  ] Service engine loaded: corosync configuration service
May 21 21:59:40 an-a03n01 corosync[2360]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
May 21 21:59:40 an-a03n01 corosync[2360]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
May 21 21:59:40 an-a03n01 corosync[2360]:   [SERV  ] Service engine loaded: corosync profile loading service
May 21 21:59:40 an-a03n01 corosync[2360]:   [QUORUM] Using quorum provider quorum_cman
May 21 21:59:40 an-a03n01 corosync[2360]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
May 21 21:59:40 an-a03n01 corosync[2360]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
May 21 21:59:40 an-a03n01 corosync[2360]:   [TOTEM ] adding new UDPU member {10.20.30.1}
May 21 21:59:40 an-a03n01 corosync[2360]:   [TOTEM ] adding new UDPU member {10.20.30.2}
May 21 21:59:40 an-a03n01 corosync[2360]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:59:40 an-a03n01 corosync[2360]:   [CMAN  ] quorum regained, resuming activity
May 21 21:59:40 an-a03n01 corosync[2360]:   [QUORUM] This node is within the primary component and will provide service.
May 21 21:59:40 an-a03n01 corosync[2360]:   [QUORUM] Members[1]: 1
May 21 21:59:40 an-a03n01 corosync[2360]:   [QUORUM] Members[1]: 1
May 21 21:59:40 an-a03n01 corosync[2360]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:59:40 an-a03n01 corosync[2360]:   [QUORUM] Members[2]: 1 2
May 21 21:59:40 an-a03n01 corosync[2360]:   [QUORUM] Members[2]: 1 2
May 21 21:59:40 an-a03n01 corosync[2360]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.30.2) ; members(old:1 left:0)
May 21 21:59:40 an-a03n01 corosync[2360]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 21:59:44 an-a03n01 fenced[2427]: fenced 3.0.12.1 started
May 21 21:59:44 an-a03n01 dlm_controld[2442]: dlm_controld 3.0.12.1 started
May 21 21:59:45 an-a03n01 gfs_controld[2504]: gfs_controld 3.0.12.1 started
May 21 22:00:01 an-a03n01 kernel: dlm: Using TCP for communications
May 21 22:00:01 an-a03n01 kernel: dlm: connecting to 2
May 21 22:00:02 an-a03n01 rgmanager[2605]: I am node #1
May 21 22:00:02 an-a03n01 rgmanager[2605]: Resource Group Manager Starting
May 21 22:00:04 an-a03n01 rgmanager[2605]: Starting stopped service service:storage_n01
May 21 22:00:04 an-a03n01 kernel: drbd: events: mcg drbd: 2
May 21 22:00:04 an-a03n01 kernel: drbd: initialized. Version: 8.4.9-1 (api:1/proto:86-101)
May 21 22:00:04 an-a03n01 kernel: drbd: GIT-hash: 9976da086367a2476503ef7f6b13d4567327a280 build by root.ca, 2017-02-05 15:36:19
May 21 22:00:04 an-a03n01 kernel: drbd: registered as block device major 147
May 21 22:00:04 an-a03n01 kernel: drbd r0: Starting worker thread (from drbdsetup-84 [3718])
May 21 22:00:04 an-a03n01 kernel: block drbd0: disk( Diskless -> Attaching )
May 21 22:00:04 an-a03n01 kernel: drbd r0: Method to ensure write ordering: drain
May 21 22:00:04 an-a03n01 kernel: block drbd0: max BIO size = 1048576
May 21 22:00:04 an-a03n01 kernel: block drbd0: drbd_bm_resize called with capacity == 326093016
May 21 22:00:04 an-a03n01 kernel: block drbd0: resync bitmap: bits=40761627 words=636901 pages=1244
May 21 22:00:04 an-a03n01 kernel: block drbd0: size = 155 GB (163046508 KB)
May 21 22:00:04 an-a03n01 kernel: block drbd0: recounting of set bits took additional 3 jiffies
May 21 22:00:04 an-a03n01 kernel: block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
May 21 22:00:04 an-a03n01 kernel: block drbd0: disk( Attaching -> Outdated )
May 21 22:00:04 an-a03n01 kernel: block drbd0: attached to UUIDs 421EA68818FE0736:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993
May 21 22:00:04 an-a03n01 kernel: drbd r0: conn( StandAlone -> Unconnected )
May 21 22:00:04 an-a03n01 kernel: drbd r0: Starting receiver thread (from drbd_w_r0 [3720])
May 21 22:00:04 an-a03n01 kernel: drbd r0: receiver (re)started
May 21 22:00:04 an-a03n01 kernel: drbd r0: conn( Unconnected -> WFConnection )
May 21 22:00:04 an-a03n01 rgmanager[2605]: Starting stopped service service:libvirtd_n01
May 21 22:00:05 an-a03n01 rgmanager[2605]: Service service:libvirtd_n01 started
May 21 22:00:05 an-a03n01 kernel: Ebtables v2.0 registered
May 21 22:00:05 an-a03n01 kernel: ip6_tables: (C) 2000-2006 Netfilter Core Team
May 21 22:00:05 an-a03n01 kernel: lo: Disabled Privacy Extensions
May 21 22:00:14 an-a03n01 kernel: drbd r0: Handshake successful: Agreed network protocol version 101
May 21 22:00:14 an-a03n01 kernel: drbd r0: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
May 21 22:00:14 an-a03n01 kernel: drbd r0: conn( WFConnection -> WFReportParams )
May 21 22:00:14 an-a03n01 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [3726])
May 21 22:00:14 an-a03n01 kernel: block drbd0: drbd_sync_handshake:
May 21 22:00:14 an-a03n01 kernel: block drbd0: self 421EA68818FE0736:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993 bits:0 flags:0
May 21 22:00:14 an-a03n01 kernel: block drbd0: peer AB97F17A92B4AA55:421EA68818FE0737:B471D4A67EBFB992:B470D4A67EBFB993 bits:15 flags:0
May 21 22:00:14 an-a03n01 kernel: block drbd0: uuid_compare()=-1 by rule 50
May 21 22:00:14 an-a03n01 kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 21 22:00:14 an-a03n01 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total 27; compression: 100.0%
May 21 22:00:14 an-a03n01 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total 27; compression: 100.0%
May 21 22:00:14 an-a03n01 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
May 21 22:00:14 an-a03n01 kernel: block drbd0: updated sync uuid 421FA68818FE0736:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993
May 21 22:00:14 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
May 21 22:00:14 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
May 21 22:00:14 an-a03n01 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
May 21 22:00:14 an-a03n01 kernel: block drbd0: Began resync as SyncTarget (will sync 60 KB [15 bits set]).
May 21 22:00:14 an-a03n01 kernel: block drbd0: role( Secondary -> Primary )
May 21 22:00:14 an-a03n01 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 60 K/sec)
May 21 22:00:14 an-a03n01 kernel: block drbd0: updated UUIDs AB97F17A92B4AA55:0000000000000000:421FA68818FE0737:421EA68818FE0737
May 21 22:00:14 an-a03n01 kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
May 21 22:00:14 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0
May 21 22:00:14 an-a03n01 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
May 21 22:00:16 an-a03n01 clvmd: Cluster LVM daemon started - connected to CMAN
May 21 22:00:45 an-a03n01 kernel: bcn_bond1: link status definitely up for interface bcn_link2, 1000 Mbps full duplex
May 21 22:00:50 an-a03n01 kernel: ifn_bond1: link status definitely up for interface ifn_link2, 1000 Mbps full duplex
May 21 22:00:50 an-a03n01 kernel: sn_bond1: link status definitely up for interface sn_link2, 1000 Mbps full duplex
May 21 22:01:16 an-a03n01 kernel: Slow work thread pool: Starting up
May 21 22:01:16 an-a03n01 kernel: Slow work thread pool: Ready
May 21 22:01:16 an-a03n01 kernel: GFS2 (built Mar 21 2017 19:30:12) installed
May 21 22:01:16 an-a03n01 kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "an-anvil-03:shared"
May 21 22:01:16 an-a03n01 kernel: GFS2: fsid=an-anvil-03:shared.0: Joined cluster. Now mounting FS...
May 21 22:01:17 an-a03n01 kernel: GFS2: fsid=an-anvil-03:shared.0: jid=0, already locked for use
May 21 22:01:17 an-a03n01 kernel: GFS2: fsid=an-anvil-03:shared.0: jid=0: Looking at journal...
May 21 22:01:17 an-a03n01 kernel: GFS2: fsid=an-anvil-03:shared.0: jid=0: Done
May 21 22:01:17 an-a03n01 rgmanager[2605]: Service service:storage_n01 started
May 21 22:01:31 an-a03n01 modclusterd: startup succeeded
May 21 22:01:31 an-a03n01 oddjobd: oddjobd startup succeeded
May 21 22:01:32 an-a03n01 saslauthd[4937]: detach_tty      : master pid is: 4937
May 21 22:01:32 an-a03n01 saslauthd[4937]: ipc_init        : listening on socket: /var/run/saslauthd/mux
May 21 22:01:32 an-a03n01 ricci: startup succeeded
====

The fence of node 1 and it coming back up, from node 2's perspective.

====
May 21 21:29:55 an-a03n02 kernel: dlm: got connection from 1
May 21 21:58:04 an-a03n02 fence_node[3571]: fence an-a03n01 success
May 21 21:58:10 an-a03n02 corosync[16913]:   [TOTEM ] A processor failed, forming new configuration.
May 21 21:58:12 an-a03n02 corosync[16913]:   [QUORUM] Members[1]: 2
May 21 21:58:12 an-a03n02 corosync[16913]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:58:12 an-a03n02 corosync[16913]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.30.2) ; members(old:2 left:1)
May 21 21:58:12 an-a03n02 corosync[16913]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 21:58:12 an-a03n02 kernel: dlm: closing connection to node 1
May 21 21:58:12 an-a03n02 fenced[17038]: node_history_fence_external no nodeid -1
May 21 21:58:12 an-a03n02 fenced[17038]: fencing node an-a03n01.alteeve.com
May 21 21:58:30 an-a03n02 fenced[17038]: fence an-a03n01.alteeve.com success
May 21 21:59:40 an-a03n02 corosync[16913]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 21 21:59:40 an-a03n02 corosync[16913]:   [QUORUM] Members[2]: 1 2
May 21 21:59:40 an-a03n02 corosync[16913]:   [QUORUM] Members[2]: 1 2
May 21 21:59:40 an-a03n02 corosync[16913]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.30.2) ; members(old:1 left:0)
May 21 21:59:40 an-a03n02 corosync[16913]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 21 22:00:01 an-a03n02 kernel: dlm: got connection from 1
May 21 22:00:15 an-a03n02 kernel: drbd r0: Handshake successful: Agreed network protocol version 101
May 21 22:00:15 an-a03n02 kernel: drbd r0: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
May 21 22:00:15 an-a03n02 kernel: drbd r0: conn( WFConnection -> WFReportParams )
May 21 22:00:15 an-a03n02 kernel: drbd r0: Starting ack_recv thread (from drbd_r_r0 [17369])
May 21 22:00:15 an-a03n02 kernel: block drbd0: drbd_sync_handshake:
May 21 22:00:15 an-a03n02 kernel: block drbd0: self AB97F17A92B4AA55:421EA68818FE0737:B471D4A67EBFB992:B470D4A67EBFB993 bits:15 flags:0
May 21 22:00:15 an-a03n02 kernel: block drbd0: peer 421EA68818FE0736:0000000000000000:B471D4A67EBFB993:B470D4A67EBFB993 bits:0 flags:0
May 21 22:00:15 an-a03n02 kernel: block drbd0: uuid_compare()=1 by rule 70
May 21 22:00:15 an-a03n02 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )
May 21 22:00:15 an-a03n02 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total 27; compression: 100.0%
May 21 22:00:15 an-a03n02 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total 27; compression: 100.0%
May 21 22:00:15 an-a03n02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
May 21 22:00:15 an-a03n02 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
May 21 22:00:15 an-a03n02 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
May 21 22:00:15 an-a03n02 kernel: block drbd0: Began resync as SyncSource (will sync 60 KB [15 bits set]).
May 21 22:00:15 an-a03n02 kernel: block drbd0: updated sync UUID AB97F17A92B4AA55:421FA68818FE0737:421EA68818FE0737:B471D4A67EBFB992
May 21 22:00:15 an-a03n02 kernel: block drbd0: peer( Secondary -> Primary )
May 21 22:00:15 an-a03n02 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 60 K/sec)
May 21 22:00:15 an-a03n02 kernel: block drbd0: updated UUIDs AB97F17A92B4AA55:0000000000000000:421FA68818FE0737:421EA68818FE0737
May 21 22:00:15 an-a03n02 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
May 21 22:00:15 an-a03n02 kernel: drbd r0: helper command: /sbin/drbdadm unfence-peer r0
May 21 22:00:15 an-a03n02 kernel: drbd r0: helper command: /sbin/drbdadm unfence-peer r0 exit code 0 (0x0)
====

Comment 7 Madison Kelly 2017-05-22 02:09:42 UTC

One more note; cluster 1 (an-anvil-01) is on physical hardware and cluster 2 (an-anvil-03) is a KVM-based cluster. So cluster 1's fencing is IPMI -> PDUs and cluster 2 is just fence_virsh. Their configs are otherwise identical (same installer).

Comment 8 Madison Kelly 2017-05-22 04:24:42 UTC

Created attachment 1280889 [details]
sos report from node 1

md5: 675c774212efa45d1e6b600f43d0b40c

Comment 9 Madison Kelly 2017-05-22 04:25:31 UTC

Created attachment 1280890 [details]
sos report from node 2

md5: b6ebbf0390a346585b2181357496026b

Comment 10 Madison Kelly 2017-05-22 08:12:04 UTC

Created attachment 1280953 [details]
sos report from an-anvil-01 - node 1

md5: 6c17bb2b210278bced00c081b6f51b2f

Comment 11 Madison Kelly 2017-05-22 08:12:54 UTC

Created attachment 1280954 [details]
sos report from an-anvil-01 - node 2

md5: 689ac077104c7032e8d1146440d602a8

Comment 12 Madison Kelly 2017-05-22 16:44:09 UTC

For posterity (from freenode -> #clusterlabs)

(unrelated chatter removed)

====
03:04 <@chrissie> digimer: can you enable clvmd debug logging please? (clvmd -d2)
03:39 < digimer> chrissie: it happens sporatically. is there a way to set -d2 permanently?
03:39 < digimer> I'll run a sosreport next time it happens
03:42 <@chrissie> You can set it in the the sysconfig file I think. (this is a memory test, right?)
03:44 < digimer> which sysconfig file?
03:47 <@chrissie> add it to CLVMDOPTS in /etc/init.d/clvmd
03:48 <@chrissie> There's /etc/sysconfig/clvmd too, but it's not shipped by default and you'd need to remember to add -T30 (which is the default in the init script but which would bet overridden)
03:49 < digimer> done
03:49 < digimer> ya, I looked in /etc/sysconfig/ but didn't see clvmd
03:50 < digimer> ok, restarted clvmd on both nodes
03:51 < digimer> storage back online
03:51 < digimer> I'll do more testing tomorrow of start/stop storage and fencing nodes
03:51 <@chrissie> It's odd that dlm is still OK, it's the most usual cause of clvmd hangs. The second is hung storage
03:51 < digimer> anything else I can do in the meantime?
03:52 <@chrissie> so I'm at a bit of a loss to explain it at the moment
03:52 < digimer> in this case, neither is the problem
03:52 <@chrissie> has it started happening since an upgrade?
03:52 < digimer> gfs2 still works fine, as do VMs using clvmd LVs as raw storage
03:52 < digimer> just can't touch pvscan or the like or else it locks everything up
03:52 < digimer> dunno
03:52 < digimer> I've recently started much more serious / heavy testing getting ready for release
03:53 < digimer> I didn't see it before
03:53 < digimer> or rather, I saw dlm hangs cause problems like this, as you said
03:53 < digimer> but I can't recall ever seeing just clvmd lock up when gfs2/rgmanager were still fine
03:54 < digimer> thought it is scaring me pretty badly that deployed systems might face this
03:54 < digimer> given that recovery seems to require reconnecting the peer, and that won't be possible if a node is destroyed in the field
03:54 <@chrissie> yes, it scares me too. I
03:54 <@chrissie> I've not seen anything like this happen on RHEL6
03:56 < digimer> let me see if I can trigger it
03:56 <@chrissie> in the logs I've seen, there's no reboot between cluster shutdown and restart, is that common?
03:57 < digimer> that was the case in the most recent event
03:57 < digimer> on an-anvil-03
03:57 < digimer> it happened previously on a node that was fenced after I 'echo c >/proc/sysrq-trigger'ed it with IPMI unplugged (to force backup fencing to be used)
03:58 < digimer> so it doesn't seem to matter if it reboots between restarts
03:58 <@chrissie> ok
03:59 < digimer> stopped -> started (no reboot), OK
03:59 < digimer> I don't have a reproducer, so I have to just repeat until it happens...
04:02 < digimer> ok that time, too
04:02  * digimer tries fencing the node
04:03 < digimer> this is odd. could this be related?
04:03 < digimer> May 22 04:02:04 an-a01n01 rsyslogd-2177: imuxsock begins to drop messages from pid 1270 due to rate-limiting
04:03 < digimer> May 22 04:02:09 an-a01n01 rsyslogd-2177: imuxsock lost 1924 messages from pid 1270 due to rate-limiting
04:03 < digimer> repeats over and over
04:03 < digimer> happening on node 2, as well
04:04 < digimer> [root@an-a01n01 ~]# ps aux | grep 1270
04:04 < digimer> root      1270  0.8  0.0 257052 40176 ?        SLsl 03:58   0:02 clvmd -T30 -d2
04:04 < digimer> it's clvmd
04:04 <@chrissie> ahh, those will be the debug log messages
04:04 < digimer> ok, so don't worry about that?
04:04 < digimer> if not, I'll go back to fence testing
04:04 <@chrissie> so it's doing *something* in a loop!
04:05 < digimer> oh, so that isn't normal?
04:05 < digimer> should I create a sosreport now?
04:05 <@chrissie> it shouldn't be logging 2000 messages a second
04:05 <@chrissie> yes please
04:05 < digimer> ok. so I should stop the cluster on both nodes and generate the report
04:05 < digimer> give me a few
04:06 <@chrissie> clvmd should only log when it's been asked to do something (which mostly it isn't - it's a management tool)
04:06 < digimer> these reports will be from actual RHEL 6.9 on hardware, fully updated
04:07 < digimer> the logging started when I withdrew and then rejoined the nodes
04:08 <@chrissie> hmm interestinng. I hope we have at least some of the clvmd firehose
04:08 < digimer> append them to the rhbz or just upload them where you can get at them?
04:08 < digimer> firehose?
04:09 <@chrissie> so it's doing *something* in a loop!mad amount of logging it seems to be doing
04:09 <@chrissie> oops
04:09  * chrissie needs more tea
04:09  * digimer nods
04:09 <@chrissie> best put it in the bz, then all the info is in one place
04:09 <@chrissie> what's the BZ#?
04:11 < digimer> 1452894
04:13 < digimer> uploaded; c#10 and c#11
04:18 <@chrissie> hmm nothing from clvmd
04:18 <@chrissie> That's cos *.debug logging is disabled #facepalm
04:18 < digimer> ?
04:18 < digimer> in the sosreport?
04:19 <@chrissie> in syslog.conf
04:19 < digimer> oh
04:19 < digimer> so I need to start over?
04:19 <@chrissie> sorry, yes :(
===
12:32 < digimer> chrissie_away: I enabled debug logging on rsyslog.conf
12:32 < digimer> and I noticed that there would be a flood of logs when I ran my program that scans the cluster (including calling pvs/vgs/lvs), formats and displays the state info
12:32 < digimer> there were bursts of logs when it ran
12:33 < digimer> so
12:33 < digimer> a) calling pvs/pvscan et. al. too often breaks things (which I imagine shouldn't happen), or these logs are not itself related to the crash
12:33 < digimer> note that when LVM hung, I stopped all our code to try and rule it out 
12:34 < digimer> I'll do more experiments though and try to see
12:36 < digimer> here is what our program calls; https://github.com/ClusterLabs/striker/blob/master/tools/anvil-report-state#L668
12:37 < digimer> basically, pvs + vgs + lvs
12:37 < digimer> there is a function above it that used to call pvscan + vgscan + lvscan, but that was disabled ages ago
12:38 < digimer> so if pvs + vgs + lvs causes a flood of debug logs, then that would explain it
12:38 < digimer> it == log rate limiting
12:38 < digimer> not it shouldn't cause a total lockup of cLVM
12:39 < digimer> I'll add this to the rhbz
====

Comment 13 Madison Kelly 2017-05-22 17:30:05 UTC

Created attachment 1281166 [details]
sos report from an-anvil-01 - node 1 - Post-fence hang

This sosreport is from after fencing node #2 and the issue occurred. As before, I could write to the gfs2 partition and run clustat once the node was successfully fenced, but I couldn't call pvscan.

md5: c887244d74e6a40654e10f16c645f9c2

Comment 14 Madison Kelly 2017-05-22 17:30:57 UTC

Created attachment 1281167 [details]
sos report from an-anvil-01 - node 2 - Post-fence hang

This is the node that was fenced. I crashed it with 'echo c > /proc/sysrq-trigger'.

md5: 6e50599634182408877e64f623da87b9

Comment 15 Madison Kelly 2017-05-22 17:33:21 UTC

Related to c#13 and c#14, before the hang, 'dmsetup info' returned;

====
[root@an-a01n01 ~]# dmsetup info
Name:              an--a01n01_vg0-srv01--centos7_0
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-8ZdU3fzKgMtr7523qxLmyBdAdiIuqd2h9YN9gLkz63EUc9WYdPqmI5Ue122Mm3Ek

Name:              an--a01n01_vg0-shared
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-8ZdU3fzKgMtr7523qxLmyBdAdiIuqd2hrVbI7JgURjpE0zKZ4kw0eq5in5tcLUwr
====

After node 2 was fenced, while clvmd was hung and gfs2/rgmanager worked, 'dmsetup info' returned;

====
[root@an-a01n01 ~]# dmsetup info
Name:              an--a01n01_vg0-srv01--centos7_0
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-8ZdU3fzKgMtr7523qxLmyBdAdiIuqd2h9YN9gLkz63EUc9WYdPqmI5Ue122Mm3Ek

Name:              an--a01n01_vg0-shared
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-8ZdU3fzKgMtr7523qxLmyBdAdiIuqd2hrVbI7JgURjpE0zKZ4kw0eq5in5tcLUwr
====

Comment 16 Christine Caulfield 2017-05-23 08:40:47 UTC

08:52 <@chrissie> So the sosreport managed to run pvs et al successfully (which it does by disabling 
                  locking) so I suspect a hung lock in the DLM somewhere. Annyoingly sosreport 
                  doesn't dump the dlm lockspace
09:00 <@chrissie> there's an un-unlocked lock on the VG showing in the log, I can't be sure it's not 
                  the rate-limiting that's removed the log line though!
09:06 <@chrissie> so it looks like clvmd hasn't got the next (unlock) command from the lvm 
                  command-line AND the command-line hadn't quit
09:11 <@chrissie> I'd be interested to hear an LVM developer's take on that
09:23 -!- mbecroft [~user.co.nz] has quit [Ping timeout: 240 seconds]
09:26 < digimer> do you know who to ping on that?
09:27 < digimer> I'm not sure how to monitor LVM without pvs/vgs/lvs...
09:27 < digimer> I loop about ever 45 seconds, give or take
09:27 < digimer> and I agree, a total lock up shouldn't occur
09:28 < digimer> do you want me to do something different to make sosreport more useful?
09:28  * digimer won't be doing it tonight though, it's 4:30am... need some sleep
09:28 < digimer> but I can make any changes and try to reproduce when I get back to it in a few hours
09:29 <@chrissie> without more knowlwdge I think the thing to do is to strace the pvs command that 
                  hangs
09:30 <@chrissie> or (maybe better) add verbose lvm logging to the command
09:30 <@chrissie> or (maybe better) add verbose lvm logging to the command
09:31 <@chrissie> -vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
09:31 <@chrissie> (not sure how many v's you need, I just repeat it :)
09:31 < digimer> ok
09:31 < digimer> strace -v^v pvscan
09:31 < digimer> yes?
09:32 <@chrissie> no, the -vvvvvv needs to go after the pvs command
09:32 <@chrissie> it's an argument to lvm
09:32 < digimer> ah
09:32 < digimer> strace pvscan -vvvv
09:32 < digimer> got it
09:32 <@chrissie> zkabelac is who I usually speak to on LVM matters. I'm not sure if he hangs out 
                  here or not. I'll ping him when I next seem him on the internal channels
09:32 < digimer> want another sosreport, or just the strace?
09:32 < digimer> thanks!
09:33 <@chrissie> just the command output & strace should be fine



The part of the DLM log I'm referring to is this (from the sosreport in comment #13
May 22 13:24:14 an-a01n01 lvm[5131]: Got new connection on fd 5 (0x7f0bb9a9ed00)
May 22 13:24:14 an-a01n01 lvm[5131]: Read on local socket 5, len = 35
May 22 13:24:14 an-a01n01 lvm[5131]: Creating pipe, [11, 12]
May 22 13:24:14 an-a01n01 lvm[5131]: Creating pre&post thread for pipe fd 11 (0x7f0bb9a9edf0)
May 22 13:24:14 an-a01n01 lvm[5131]: Created pre&post thread, state = 0
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00), pipe fd 12
May 22 13:24:14 an-a01n01 lvm[5131]: doing PRE command LOCK_VG 'V_an-a01n01_vg0' at 1 (client=0x7f0bb9a9ed00)
May 22 13:24:14 an-a01n01 lvm[5131]: sync_lock: 'V_an-a01n01_vg0' mode:3 flags=0
May 22 13:24:14 an-a01n01 lvm[5131]: sync_lock: returning lkid 9d0001
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) writes status 0 down to pipe fd 12
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) waiting to do post command, state = 0
May 22 13:24:14 an-a01n01 lvm[5131]: Read on pipe 11, 4 bytes, status 0
May 22 13:24:14 an-a01n01 lvm[5131]: Background routine status was 0, sock_client (0x7f0bb9a9ed00)
May 22 13:24:14 an-a01n01 lvm[5131]: distribute command: XID = 1373, flags=0x1 (LOCAL)
May 22 13:24:14 an-a01n01 lvm[5131]: Doing command explicitly on local node only
May 22 13:24:14 an-a01n01 lvm[5131]: add_to_lvmqueue: cmd=0x7f0bb9a9e480. client=0x7f0bb9a9ed00, msg=0x7f0bb9a960d0
, len=35, csid=(nil), xid=1373
May 22 13:24:14 an-a01n01 lvm[5131]: process_work_item: local
May 22 13:24:14 an-a01n01 lvm[5131]: process_local_command: LOCK_VG (0x33) msg=0x7f0bb9a9e4c0, msglen =35, client=0
x7f0bb9a9ed00
May 22 13:24:14 an-a01n01 lvm[5131]: do_lock_vg: resource 'V_an-a01n01_vg0', cmd = 0x1 LCK_VG (READ|VG), flags = 0x
4 ( DMEVENTD_MONITOR ), critical_section = 0
May 22 13:24:14 an-a01n01 lvm[5131]: Invalidating cached metadata for VG an-a01n01_vg0
May 22 13:24:14 an-a01n01 lvm[5131]: Reply from node an-a01n01.alteeve.com: 0 bytes
May 22 13:24:14 an-a01n01 lvm[5131]: Got 1 replies, expecting: 1
May 22 13:24:14 an-a01n01 lvm[5131]: LVM thread waiting for work
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) got post command condition...
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) waiting for next pre command
May 22 13:24:14 an-a01n01 lvm[5131]: Read on pipe 11, 4 bytes, status 0
May 22 13:24:14 an-a01n01 lvm[5131]: Background routine status was 0, sock_client (0x7f0bb9a9ed00)
May 22 13:24:14 an-a01n01 lvm[5131]: Send local reply
May 22 13:24:14 an-a01n01 lvm[5131]: Read on local socket 5, len = 84
May 22 13:24:14 an-a01n01 lvm[5131]: check_all_clvmds_running
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) got pre command condition...
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) writes status 0 down to pipe fd 12
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) waiting to do post command, state = 0
May 22 13:24:14 an-a01n01 lvm[5131]: Read on pipe 11, 4 bytes, status 0
...
There's more here but the PRE command gets a lock on the VG, runs the command it wants to do and then I don't see another PRE command to unlock the VG, the log ends here:

May 22 13:24:14 an-a01n01 lvm[5131]: add_to_lvmqueue: cmd=0x7f0bb9a9e480. client=0x7f0bb9a9ed00, msg=0x7f0bb9a960d0, len=31, csid=(nil), xid=1376
May 22 13:24:14 an-a01n01 lvm[5131]: Sending message to all cluster nodes
May 22 13:24:14 an-a01n01 lvm[5131]: process_work_item: local
May 22 13:24:14 an-a01n01 lvm[5131]: process_local_command: SYNC_NAMES (0x2d) msg=0x7f0bb9a9e4c0, msglen =31, client=0x7f0bb9a9ed00
May 22 13:24:14 an-a01n01 lvm[5131]: Syncing device names
May 22 13:24:14 an-a01n01 lvm[5131]: Reply from node an-a01n01.alteeve.com: 0 bytes
May 22 13:24:14 an-a01n01 lvm[5131]: Got 1 replies, expecting: 2
May 22 13:24:14 an-a01n01 lvm[5131]: LVM thread waiting for work
May 22 13:24:14 an-a01n01 lvm[5131]: Reply from node an-a01n02.alteeve.com: 0 bytes
May 22 13:24:14 an-a01n01 lvm[5131]: Got 2 replies, expecting: 2
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) got post command condition...
May 22 13:24:14 an-a01n01 lvm[5131]: Pre&post thread (0x7f0bb9a9ed00) waiting for next pre command

Comment 17 Madison Kelly 2017-06-18 20:04:52 UTC

It happened again. Here is the strace:

====
...
stat("/dev/drbd0", {st_mode=S_IFBLK|0660, st_rdev=makedev(147, 0), ...}) = 0
stat("/dev/block/253:0", 0x7ffc37313170) = -1 ENOENT (No such file or directory)
stat("/dev/block/253:0", 0x7ffc373130a0) = -1 ENOENT (No such file or directory)
stat("/dev/disk/by-label/ccrs:shared", 0x7ffc37313170) = -1 ENOENT (No such file or directory)
stat("/dev/disk/by-label/ccrs:shared", 0x7ffc373130a0) = -1 ENOENT (No such file or directory)
stat("/dev/disk/by-id/dm-name-node1_vg0-shared", 0x7ffc37313170) = -1 ENOENT (No such file or directory)
stat("/dev/disk/by-id/dm-name-node1_vg0-shared", 0x7ffc373130a0) = -1 ENOENT (No such file or directory)
stat("/dev/drbd/by-disk/vda5", {st_mode=S_IFBLK|0660, st_rdev=makedev(147, 0), ...}) = 0
stat("/dev/mapper/node1_vg0-shared", 0x7ffc37313170) = -1 ENOENT (No such file or directory)
stat("/dev/mapper/node1_vg0-shared", 0x7ffc373130a0) = -1 ENOENT (No such file or directory)
stat("/dev/node1_vg0/shared", 0x7ffc37313170) = -1 ENOENT (No such file or directory)
stat("/dev/node1_vg0/shared", 0x7ffc373130a0) = -1 ENOENT (No such file or directory)
stat("/dev/drbd/by-res/r0/0", {st_mode=S_IFBLK|0660, st_rdev=makedev(147, 0), ...}) = 0
stat("/dev/dm-0", 0x7ffc37313170)       = -1 ENOENT (No such file or directory)
stat("/dev/dm-0", 0x7ffc373130a0)       = -1 ENOENT (No such file or directory)
stat("/dev/disk/by-id/dm-uuid-LVM-u2b7folff7WHVHm9LKbB4lmPN5S6KfGYbMRcecVTfHUpeIQxsuKD5FqCAthuL8Ym", 0x7ffc37313170) = -1 ENOENT (No such file or directory)
stat("/dev/disk/by-id/dm-uuid-LVM-u2b7folff7WHVHm9LKbB4lmPN5S6KfGYbMRcecVTfHUpeIQxsuKD5FqCAthuL8Ym", 0x7ffc373130a0) = -1 ENOENT (No such file or directory)
open("/sys/dev/block", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
getdents(4, /* 34 entries */, 32768)    = 872
open("/sys/dev/block/1:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:1/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:2/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:3/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:4/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:5/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:6/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:7/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:8/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:9/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:10/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:11/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:12/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:13/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:14/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/1:15/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/7:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/7:1/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/7:2/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/7:3/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/7:4/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/7:5/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/7:6/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/7:7/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/252:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/252:1/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/252:2/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/252:3/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/252:4/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/252:5/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 3 entries */, 32768)     = 80
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/11:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
open("/sys/dev/block/147:0/holders/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
getdents(5, /* 2 entries */, 32768)     = 48
getdents(5, /* 0 entries */, 32768)     = 0
close(5)                                = 0
getdents(4, /* 0 entries */, 32768)     = 0
close(4)                                = 0
close(3)                                = 0
socket(PF_LOCAL, SOCK_STREAM, 0)        = 3
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/lvm/clvmd.sock"}, 110) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
write(3, "3\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0\0\4\4P_#global\0", 29) = 29
read(3,
==== 

This is where it sits. Full strace will be attached momentarily.

Comment 18 Madison Kelly 2017-06-18 20:05:35 UTC

Created attachment 1288909 [details]
Full strace of pvscan

Comment 19 Madison Kelly 2017-06-24 04:56:34 UTC

I just hit a bug that seems to be related to this, but didn't manifest in the normal way.

I tried to update cluster, and as part of that, called to stop rgmanager and it hung there. When I investigated, I found that a 'vgs' process was stuck and I couldn't kill it.

====
[root@node01 ~]# pvscan 
  PV /dev/drbd0   VG node01_vg0      lvm2 [1.18 TiB / 1.15 TiB free]
  Total: 1 [1.18 TiB] / in use: 1 [1.18 TiB] / in no VG: 0 [0   ]
[root@node01 ~]# vgscan 
  Reading all physical volumes.  This may take a while...
  Found volume group "node01_vg0" using metadata type lvm2
[root@node01 ~]# lvscan 
  ACTIVE            '/dev/node01_vg0/shared' [20.00 GiB] inherit
  ACTIVE            '/dev/node01_vg0/vm03-server_0' [20.00 GiB] inherit
[root@node01 ~]# lvchange -an /dev/node01_vg0/vm03-server_0
  Error locking on node node02.example.com: Logical volume node01_vg0/vm03-server_0 in use.
[root@node01 ~]# /etc/init.d/libvirtd start
Starting libvirtd daemon:                                  [  OK  ]
[root@node01 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------

[root@node01 ~]# ps aux | grep server
root     15614  0.0  0.0 105368   912 pts/4    S+   00:46   0:00 grep server
[root@node01 ~]# ps aux | grep vgs
root      1079  0.0  0.0 129420  2864 ?        S    Jun23   0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1
root      1360  0.0  0.0  32536  4780 ?        S<   Jun23   0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1
root     15649  0.0  0.0 105372   912 pts/4    S+   00:46   0:00 grep vgs
[root@node01 ~]# kill 1079
[root@node01 ~]# kill 1360
[root@node01 ~]# ps aux | grep vgs
root      1079  0.0  0.0 129420  2864 ?        S    Jun23   0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1
root      1360  0.0  0.0  32536  4780 ?        S<   Jun23   0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1
root     15711  0.0  0.0 105372   916 pts/4    S+   00:47   0:00 grep vgs
[root@node01 ~]# ps auxf | grep vgs -B 5 -A 5
root      9738  0.0  0.0 102072  4076 ?        Ss   00:25   0:00  \_ sshd: root@pts/3 
root      9748  0.0  0.0 112068  3632 pts/3    Ss+  00:25   0:00  |   \_ -bash
root     13447  0.0  0.0 102712  4616 ?        Ss   00:38   0:00  \_ sshd: root@pts/4 
root     13460  0.0  0.0 112188  3804 pts/4    Ss   00:38   0:00  |   \_ -bash
root     15777  0.0  0.0 112288  1600 pts/4    R+   00:47   0:00  |       \_ ps auxf
root     15778  0.0  0.0 105368   880 pts/4    S+   00:47   0:00  |       \_ grep vgs -B 5 -A 5
root     13455  0.0  0.0 102072  4100 ?        Ss   00:38   0:00  \_ sshd: root@pts/5 
root     13490  0.0  0.0 112188  3780 pts/5    Ss   00:38   0:00      \_ -bash
root     14429  0.0  0.0 103004   672 pts/5    S+   00:42   0:00          \_ tail -f -n 100 /var/log/messages
ntp       3413  0.0  0.0  32788  2148 ?        Ss   Jun21   0:07 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
root      3492  0.0  0.0  83000  3464 ?        Ss   Jun21   0:16 /usr/libexec/postfix/master
--
root     24294  0.0  0.0  38992  7276 ?        S<Ls Jun21   0:00 rgmanager
root     24296  0.0  0.0 789756  5548 ?        S<l  Jun21   1:29  \_ rgmanager
root      1303  0.0  0.0  13352  1488 ?        S<   Jun23   0:00      \_ /bin/bash /usr/share/cluster/script.sh status
root      1353  0.0  0.0  13616  1792 ?        S<   Jun23   0:00          \_ /bin/bash /etc/init.d/clvmd status
root      1359  0.0  0.0  13616  1000 ?        S<   Jun23   0:00              \_ /bin/bash /etc/init.d/clvmd status
root      1360  0.0  0.0  32536  4780 ?        S<   Jun23   0:00                  \_ /sbin/vgs --noheadings -o vg_name -S vg_clustered=1
root      1079  0.0  0.0 129420  2864 ?        S    Jun23   0:00 /sbin/vgs --noheadings -o vg_name -S vg_clustered=1
root     11311  0.0  0.0 108124  1504 ?        Ss   00:29   0:00 bash -c /etc/init.d/rgmanager stop && /etc/init.d/cman stop
root     11318  0.1  0.0 108524  1896 ?        S    00:29   0:01  \_ /bin/bash /etc/init.d/rgmanager stop
root     15776  0.0  0.0 102968   620 ?        S    00:47   0:00      \_ sleep 1
root     15486  0.2  0.0 698516  8092 ?        Sl   00:46   0:00 libvirtd --daemon
[root@node01 ~]# exit
====

(slightly edited to anonimize data, but just domains/server name). 

Whatever I tried, I couldn't recover the node, so I manually fenced it off (just a 'fence_node ...'). Alasdair mentioned on IRC that this might be related to a bug where things can lock up if an LVM operation happens just as membership tries to change?

Here are the syslogs from the same node:

====
Jun 24 00:29:51 node01 rgmanager[24296]: Shutting down
Jun 24 00:29:51 node01 rgmanager[24296]: Stopping service service:libvirtd_n01
Jun 24 00:29:52 node01 rgmanager[24296]: Service service:libvirtd_n01 is stopped
Jun 24 00:30:01 node01 rgmanager[24296]: Shutting down
Jun 24 00:42:41 node01 rgmanager[24296]: Shutting down
Jun 24 00:42:47 node01 rgmanager[24296]: Shutting down
Jun 24 00:43:45 node01 kernel: Slow work thread pool: Shutting down
Jun 24 00:43:45 node01 kernel: Slow work thread pool: Shut down complete
Jun 24 00:43:52 node01 kernel: block drbd0: State change failed: Device is held open by someone
Jun 24 00:43:52 node01 kernel: block drbd0:   state = { cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r----- }
Jun 24 00:43:52 node01 kernel: block drbd0:  wanted = { cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r----- }
Jun 24 00:43:52 node01 kernel: block drbd0: State change failed: Device is held open by someone
Jun 24 00:43:52 node01 kernel: block drbd0:   state = { cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r----- }
Jun 24 00:43:52 node01 kernel: block drbd0:  wanted = { cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r----- }
Jun 24 00:44:29 node01 kernel: block drbd0: State change failed: Device is held open by someone
Jun 24 00:44:29 node01 kernel: block drbd0:   state = { cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r----- }
Jun 24 00:44:29 node01 kernel: block drbd0:  wanted = { cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r----- }
Jun 24 00:46:28 node01 kernel: lo: Disabled Privacy Extensions
====

Being a user's system, I couldn't install debuginfo I am afraid. So this is mainly a comment in case it helps more than anything.

Comment 20 Madison Kelly 2017-06-24 05:03:40 UTC

I should mention that, after fencing the node 1 and it rebooted/rejoined the cluster, things started working normally again.

Comment 21 Alasdair Kergon 2017-07-03 15:07:36 UTC

There are 4 upstream patches now that might be relevant:

commit f2eda36cfa9baf3f7e132dbc3a3f34c908cb36d4
Author: Alasdair G Kergon <agk>
Date:   Sat Jul 1 01:34:38 2017 +0100

    clvmd: Fix client list corruption
    
    Centralise editing of the client list into _add_client() and
    _del_client().  Introduce _local_client_count to track the size of the
    list for debugging purposes.  Simplify and standardise the various ways
    the list gets walked.
    
    While processing one element of the list in main_loop(),
    cleanup_zombie() may be called and remove a different element, so make
    sure main_loop() refreshes its list state on return.  Prior to this
    patch, the list edits for clients disappearing could race against the
    list edits for new clients connecting and corrupt the list and cause a
    variety of segfaults.
    
    An easy way to trigger such failures was by repeatedly running shell
    commands such as:
      lvs &; lvs &; lvs &;...;killall -9 lvs; lvs &; lvs &;...
    
    Situations that occasionally lead to the failures can be spotted by
    looking for 'EOF' with 'inprogress=1' in the clvmd debug logs.

commit af789fd6d0cba45cc4897772bae3ca73a3f15b38
Author: Alasdair G Kergon <agk>
Date:   Sat Jul 1 01:17:40 2017 +0100

    clvmd: add client id to debug log messages
    
    Use standard format to make it easier to find the client to which each
    debug log message refers.

commit 17ed25409136162d091809d412596a45e0a30fd4
Author: Alasdair G Kergon <agk>
Date:   Sat Jul 1 00:58:39 2017 +0100

    clvmd: add debuglog mutex
    
    Log messages issued by different threads occasionally got intertwined.

commit dae4f53acb269219e876c229c8f034fcdaf3ff5a
Author: Zdenek Kabelac <zkabelac>
Date:   Sat Feb 4 14:47:27 2017 +0100

    clvmd: add mutex protection for cpg_ call
    
    The library for corosync multicasting is not supporting multithread
    usage - add local mutex to avoid parallel call of cpg_mcast_joined().

Comment 23 Jonathan Earl Brassow 2017-09-29 21:03:16 UTC

There will be no further work on this for RHEL6.  If the fixes from comment 21 do not address the issue, you can open a bug against RHEL7 (or reopen this bug and move it to RHEL7).

Comment 24 Madison Kelly 2017-09-29 21:20:31 UTC

RHEL 6 is still supported for bug and security fixes, isn't it? This bug causes production RHEL 6 clusters to hang badly. Does that not qualify for a bug fix under RHEL 6 support?

With the changes from cman/rgmanager to pacemaker, along with the changes from SysV Init -> systemd, simply upgrading to RHEL 7 isn't viable at this time.