Bug 1349755

Summary:	Crashed a node and a service in a restricted failoverdomain on the lost node remained showing 'started'
Product:	Red Hat Enterprise Linux 6	Reporter:	Madison Kelly <mkelly>
Component:	rgmanager	Assignee:	Ryan McCabe <rmccabe>
Status:	CLOSED WONTFIX	QA Contact:	cluster-qe <cluster-qe>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.7	CC:	cluster-maint, fdinitto, jkortus, jruemker, mkelly, rmccabe
Target Milestone:	rc	Keywords:	Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-06 10:28:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Madison Kelly 2016-06-24 07:37:57 UTC

Description of problem:

While testing, I had a cluster service in a restricted failover domain that is only allowed to run on one node get left marked as 'started' on a node that was fenced.

====
Cluster Status for ccrs @ Fri Jun 24 07:31:56 2016
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node1.ccrs.bcn                              1 Offline
 node2.ccrs.bcn                              2 Online, Local, rgmanager

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:libvirtd_n01           node1.ccrs.bcn                 started
 service:libvirtd_n02           node2.ccrs.bcn                 started
 service:storage_n01            (node1.ccrs.bcn)               stopped
 service:storage_n02            node2.ccrs.bcn                 started
 vm:server                      node2.ccrs.bcn                 started
====

The logs from the crash and fence:

====
Jun 24 07:26:55 node2 kernel: drbd r0: PingAck did not arrive in time.
Jun 24 07:26:55 node2 kernel: drbd r0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) 
Jun 24 07:26:55 node2 kernel: drbd r0: ack_receiver terminated
Jun 24 07:26:55 node2 kernel: drbd r0: Terminating drbd_a_r0
Jun 24 07:26:55 node2 kernel: drbd r0: Connection closed
Jun 24 07:26:55 node2 kernel: drbd r0: conn( NetworkFailure -> Unconnected ) 
Jun 24 07:26:55 node2 kernel: drbd r0: helper command: /sbin/drbdadm fence-peer r0
Jun 24 07:26:55 node2 kernel: drbd r0: receiver terminated
Jun 24 07:26:55 node2 kernel: drbd r0: Restarting receiver thread
Jun 24 07:26:55 node2 kernel: drbd r0: receiver (re)started
Jun 24 07:26:55 node2 kernel: drbd r0: conn( Unconnected -> WFConnection ) 
Jun 24 07:26:55 node2 rhcs_fence: Attempting to fence peer using RHCS from DRBD...
Jun 24 07:26:57 node2 corosync[2749]:   [TOTEM ] A processor failed, forming new configuration.
Jun 24 07:26:59 node2 corosync[2749]:   [QUORUM] Members[1]: 2
Jun 24 07:26:59 node2 corosync[2749]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 24 07:26:59 node2 corosync[2749]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.10.2) ; members(old:2 left:1)
Jun 24 07:26:59 node2 corosync[2749]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 24 07:26:59 node2 kernel: dlm: closing connection to node 1
Jun 24 07:26:59 node2 fenced[2853]: fencing node node1.ccrs.bcn
Jun 24 07:26:59 node2 kernel: GFS2: fsid=ccrs:shared.0: jid=1: Trying to acquire journal lock...
Jun 24 07:27:16 node2 fenced[2853]: fence node1.ccrs.bcn success
Jun 24 07:27:17 node2 rgmanager[5019]: Marking service:storage_n01 as stopped: Restricted domain unavailable
Jun 24 07:27:17 node2 rhcs_fence: Seems fenced was successfull
Jun 24 07:27:17 node2 kernel: drbd r0: helper command: /sbin/drbdadm fence-peer r0 exit code 7 (0x700)
Jun 24 07:27:17 node2 kernel: drbd r0: fence-peer helper returned 7 (peer was stonithed)
Jun 24 07:27:17 node2 kernel: drbd r0: pdsk( DUnknown -> Outdated ) 
Jun 24 07:27:17 node2 kernel: block drbd0: new current UUID F2B2107B04A9C309:730AB0C207C2C3F7:B090FEBF5E7709DB:B08FFEBF5E7709DB
Jun 24 07:27:17 node2 kernel: drbd r0: susp( 1 -> 0 ) 
Jun 24 07:27:17 node2 kernel: GFS2: fsid=ccrs:shared.0: jid=1: Looking at journal...
Jun 24 07:27:17 node2 kernel: GFS2: fsid=ccrs:shared.0: jid=1: Acquiring the transaction lock...
Jun 24 07:27:17 node2 kernel: GFS2: fsid=ccrs:shared.0: jid=1: Replaying journal...
Jun 24 07:27:17 node2 kernel: GFS2: fsid=ccrs:shared.0: jid=1: Replayed 2054 of 2070 blocks
Jun 24 07:27:17 node2 kernel: GFS2: fsid=ccrs:shared.0: jid=1: Found 7 revoke tags
Jun 24 07:27:17 node2 kernel: GFS2: fsid=ccrs:shared.0: jid=1: Journal replayed in 1s
Jun 24 07:27:17 node2 kernel: GFS2: fsid=ccrs:shared.0: jid=1: Done
Jun 24 07:27:17 node2 rgmanager[5019]: Taking over service vm:server from down member node1.ccrs.bcn
Jun 24 07:27:18 node2 kernel: device vnet0 entered promiscuous mode
Jun 24 07:27:18 node2 kernel: ifn_bridge1: port 2(vnet0) entering forwarding state
Jun 24 07:27:19 node2 rgmanager[5019]: Service vm:server started
Jun 24 07:27:21 node2 ntpd[8978]: Listen normally on 12 vnet0 fe80::fc54:ff:fe79:8c31 UDP 123
Jun 24 07:27:33 node2 kernel: ifn_bridge1: port 2(vnet0) entering forwarding state
====

The cluster.conf file:

====
<?xml version="1.0"?>
<cluster config_version="3" name="ccrs">
	<cman expected_votes="1" transport="udpu" two_node="1"/>
	<clusternodes>
		<clusternode name="node1.ccrs.bcn" nodeid="1">
			<fence>
				<method name="kvm">
					<device action="reboot" delay="15" name="kvm_host" port="an-a04n01"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="node2.ccrs.bcn" nodeid="2">
			<fence>
				<method name="kvm">
					<device action="reboot" name="kvm_host" port="an-a04n02"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_virsh" ipaddr="192.168.122.1" login="root" name="kvm_host" passwd="secret"/>
	</fencedevices>
	<fence_daemon post_join_delay="90"/>
	<totem rrp_mode="none" secauth="off"/>
	<rm log_level="5">
		<resources>
			<script file="/etc/init.d/drbd" name="drbd"/>
			<script file="/etc/init.d/wait-for-drbd" name="wait-for-drbd"/>
			<script file="/etc/init.d/clvmd" name="clvmd"/>
			<clusterfs device="/dev/node1_vg0/shared" force_unmount="1" fstype="gfs2" mountpoint="/shared" name="sharedfs"/>
			<script file="/etc/init.d/libvirtd" name="libvirtd"/>
		</resources>
		<failoverdomains>
			<failoverdomain name="only_n01" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="node1.ccrs.bcn"/>
			</failoverdomain>
			<failoverdomain name="only_n02" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="node2.ccrs.bcn"/>
			</failoverdomain>
			<failoverdomain name="primary_n01" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="node1.ccrs.bcn" priority="1"/>
				<failoverdomainnode name="node2.ccrs.bcn" priority="2"/>
			</failoverdomain>
			<failoverdomain name="primary_n02" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="node1.ccrs.bcn" priority="2"/>
				<failoverdomainnode name="node2.ccrs.bcn" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<service autostart="1" domain="only_n01" exclusive="0" name="storage_n01" recovery="restart">
			<script ref="drbd">
				<script ref="wait-for-drbd">
					<script ref="clvmd">
						<clusterfs ref="sharedfs"/>
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="only_n02" exclusive="0" name="storage_n02" recovery="restart">
			<script ref="drbd">
				<script ref="wait-for-drbd">
					<script ref="clvmd">
						<clusterfs ref="sharedfs"/>
					</script>
				</script>
			</script>
		</service>
		<service autostart="1" domain="only_n01" exclusive="0" name="libvirtd_n01" recovery="restart">
			<script ref="libvirtd"/>
		</service>
		<service autostart="1" domain="only_n02" exclusive="0" name="libvirtd_n02" recovery="restart">
			<script ref="libvirtd"/>
		</service>
		<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" migrate_options="--unsafe" name="server" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>
	</rm>
</cluster>
====

The nodes are running RHEL 6.7 on KVM/qemu virtual machines on a Fedora 23 host (this is a test/dev cluster, not prod of course). 

I noticed as well that clustat reported the node as 'Offline' before the fence ran (there is a 15 second delay against node1 and it was marked as Offline almost immediately). I don't think this is usually the case.


Version-Release number of selected component (if applicable):

====
cman-3.0.12.1-73.el6_7.2.x86_64
rgmanager-3.0.12.1-22.el6.x86_64
resource-agents-3.9.5-24.el6_7.1.x86_64
fence-agents-4.0.15-8.el6_7.2.x86_64
kernel-2.6.32-573.18.1.el6.x86_64
====
Red Hat Enterprise Linux Server release 6.7 (Santiago)


How reproducible:

Not sure, but probably hard. I can't remember ever seeing this happen before.


Steps to Reproduce:
1. Created a cluster using the above config.
2. Cut the power to node 1 (virsh destroy <vm>)
3.

Actual results:

Service 'libvirtd_n01' marked as 'started' on a node that is 'Offline'.


Expected results:

Service 'libvirtd_n01' should be marked as 'stopped'.


Additional info:

Comment 1 Madison Kelly 2016-06-24 07:40:33 UTC

Stopping the service manually marked it as stopped.

====
[root@node2 ~]# clusvcadm -d libvirtd_n01
Local machine disabling service:libvirtd_n01...Success
====
[root@node2 ~]# clustat 
Cluster Status for ccrs @ Fri Jun 24 07:39:12 2016
Member Status: Quorate

 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 node1.ccrs.bcn                                             1 Offline
 node2.ccrs.bcn                                             2 Online, Local, rgmanager

 Service Name                                  Owner (Last)                                  State         
 ------- ----                                  ----- ------                                  -----         
 service:libvirtd_n01                          (node1.ccrs.bcn)                              stopped       
 service:libvirtd_n02                          node2.ccrs.bcn                                started       
 service:storage_n01                           (node1.ccrs.bcn)                              stopped       
 service:storage_n02                           node2.ccrs.bcn                                started       
 vm:server                                     node2.ccrs.bcn                                started       
====

I checked the logs and noticed that there was no mention of the 'libvirtd_n01' service, where there was for 'storage_n02'. It would appear that rgmanager just missed that service entirely.

Comment 3 John Ruemker 2016-07-22 18:27:55 UTC

Hello,
Thank you for the problem report.  We would need to investigate the details of this behavior further with knowledge of your configuration and the conditions present at the time of the incident, and those are best explored through our support process.  Do you have a customer account with Red Hat Support through which you could open a case and engage us for further discussion on this matter?  If so, please go ahead and create that case and reference this bug and we can help drive the investigation forward to resolution.

Thanks,
John

Comment 4 Madison Kelly 2016-07-22 18:57:27 UTC

I'm an ISV with NFR self-support subscriptions. So this channel is the only one available for me.

Note that this happened in a test environment that has since been torn down. Possibly related; I've been noticing additional problems with rgmanager misbehaving. I've not been able to capture details, but a common issue is where I will start rgmanager but clustat does not show any services. I can verify in these cases that '/etc/init.d/rgmanager status' shows that it is running. Restarting rgmanager recovers this.

All of these tests on on fresh RHEL 6.7 (updated within the .z stream) and 6.8 installs.

Comment 5 John Ruemker 2016-07-22 20:03:45 UTC

Unfortunately there are a variety of conditions that can cause some of the symptoms you've described, so its difficult to conclude this is a defect of any kind, or to give specific guidance as to how you should proceed without being able to carry out a deeper investigation.  

The first problem you described with the libvirtd_n01 service being left in a started state is certainly not expected behavior and is not something that shows up in our testing, but its also not obvious what could have led to it that we could immediately resolve.  We'd need to work through reproductions of the issue while capturing additional diagnostic information in order to narrow the focus, and all of that would need to be done with consideration for your specific configuration and environment, which is where the support process would come in.  The code in question that evaluates resource groups to decide what to do with them after a node is fenced does iterate the entire config tree, and it should produce a log message for each one that fails or gets bypassed for some reason, so there's not an obvious reason why one would get missed.  We've seen similar issues to this when there have been past problems attempting to apply a configuration update that introduced the "missing" resources or services, so we would have to understand the history of events since the cluster started to rule in or out certain possibilities.  If you wanted to enable debug logging, you might see more clearly whether its actually attempting to process that resource group or not.  So again, without the ability to reproduce this in your original environment and work through it together through the support process, we're left without an obvious target for this bug report to focus on.

The other issue you described with services not appearing in clustat is more likely to be a result of some other problematic condition in the environment, and so is not immediately obvious as a bug.  This can result from an initialization (stop) operation blocking when rgmanager starts, it can result from membership changes requiring fencing that has not completed yet, it can result from DLM problems, or a variety of other conditions.  Its not an uncommon symptom to see, so without the ability to investigate further with you through our support offering, I'm afraid we couldn't give you more specific suggestions other than look for other problematic conditions in the cluster's state and try to resolve them.

With Self-Support entitlements, you should have access to the Red Hat Customer Portal, so you may want to search thoroughly through the knowledgebase and/or initiate a discussion with other users to see if they have had similar experiences.  

If you have any questions let us know, or whether you're ok with us closing this with INSUFFICIENT_DATA if you are unable to offer more direct evidence of a defect.

Thanks,
John Ruemker
Principal Software Maintenance Engineer
Red Hat Support

Comment 6 Madison Kelly 2016-07-22 20:22:05 UTC

Can you give some advice on how to best enable debug logging? We have a fully automated cluster build system and a very static configuration, so I am sure that I can hammer on the system until the issue returns. At that time, I can append to this ticket.

If you want to mark it as INSUFFICIENT_DATA for now, provided I can append/re-open this later, that would be fine by me. I do understand the difficulty of debugging without a reliable reproducer. :)

Comment 7 John Ruemker 2016-07-22 21:00:03 UTC

(In reply to digimer from comment #6)
> Can you give some advice on how to best enable debug logging? 

  How do I configure logging for the various components of a RHEL 6 High Availability or Resilient Storage cluster?
  https://access.redhat.com/solutions/35528

  Red Hat Enterprise Linux 6 Cluster Administration - 4.5.6. Logging Configuration
  https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/Cluster_Administration/index.html#s1-config-logging-conga-CA

  Red Hat Enterprise Linux 6 Cluster Administration - 6.14.4. Logging
  https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/Cluster_Administration/index.html#s2-logconfig-ccs-CA
  
  man(5) cluster.conf

The following will achieve debug logging for rgmanager in a way that would show it evaluating each resource group after a node is fenced:

  <logging_daemon name="rgmanager" debug="on"/>

I'll close this for now.

Thanks,
John

Comment 8 Madison Kelly 2016-08-01 21:56:42 UTC

Follow-up;

I'm not having much luck getting useful information out of rgmanager, but whatever is wrong with it seems to be getting worse.

Today, I have found myself in a situation where a storage service (drbd -> clvmd -> gfs2) failed on a node on start. After that, calling 'clusvcadm -d X' updated clustat to show that the storage service was stopped, but the shell call would not exit.

====
Aug  1 17:44:52 an-a05n02 rgmanager[30999]: [script] script:clvmd: status of /etc/init.d/clvmd failed (returned 3)
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: status on script "clvmd" returned 1 (generic error)
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: Stopping service service:storage_n02
Aug  1 17:44:52 an-a05n02 rgmanager[31036]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: #12: RG service:storage_n02 failed to stop; intervention required
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: Service service:storage_n02 is failed
====

There are no further messages.

====
root@an-a05n01 ~]# clustat
Cluster Status for an-anvil-05 @ Mon Aug  1 17:45:45 2016
Member Status: Quorate

 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 an-a05n01.alteeve.ca                                       1 Online, Local, rgmanager
 an-a05n02.alteeve.ca                                       2 Online, rgmanager

 Service Name                                  Owner (Last)                                  State         
 ------- ----                                  ----- ------                                  -----         
 service:libvirtd_n01                          an-a05n01.alteeve.ca                          started       
 service:libvirtd_n02                          an-a05n02.alteeve.ca                          started       
 service:storage_n01                           (an-a05n01.alteeve.ca)                        disabled      
 service:storage_n02                           an-a05n02.alteeve.ca                          starting      
====

And the attempt to disable:

====
[root@an-a05n01 ~]# timeout 30 /usr/sbin/clusvcadm -d storage_n02 || echo Timed out
Local machine disabling service:storage_n02...Timed out
====

My gut tells me this is somehow related to DLM and/or clvmd.

digimer

Comment 9 Madison Kelly 2016-08-01 21:58:24 UTC

Marking it as 'NEW' because the stability of rgmanager (directly or indirectly) is getting to be a big problem. These problems happen on multiple different systems of different eras over many different fresh installs. We've got an automated cluster build system that we're heavily testing at the moment, which is why we're doing so many reinstalls and various rebuild and failure tests.

Comment 10 Madison Kelly 2016-08-01 22:02:09 UTC

Ha! This might be useful, I am seeing new messages;

====
[root@an-a05n01 ~]# grep rgmanager /var/log/messages
Aug  1 15:20:25 new-node01 yum[16134]: Installed: rgmanager-3.0.12.1-26.el6_8.3.x86_64
Aug  1 16:39:03 an-a05n01 rgmanager[20072]: I am node #1
Aug  1 16:39:03 an-a05n01 rgmanager[20072]: Resource Group Manager Starting
Aug  1 16:39:05 an-a05n01 rgmanager[21212]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 16:39:05 an-a05n01 rgmanager[21266]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 16:39:06 an-a05n01 rgmanager[20072]: Starting stopped service service:storage_n01
Aug  1 16:39:06 an-a05n01 rgmanager[20072]: Starting stopped service service:libvirtd_n01
Aug  1 16:39:06 an-a05n01 rgmanager[20072]: Service service:libvirtd_n01 started
Aug  1 16:39:07 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:08 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:09 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:09 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:09 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:09 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:09 an-a05n01 rgmanager[20072]: Service service:storage_n01 started
Aug  1 16:39:10 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 16:39:10 an-a05n01 rgmanager[20072]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:03:16 an-a05n01 rgmanager[7788]: [script] script:drbd: status of /etc/init.d/drbd failed (returned 3)
Aug  1 17:03:16 an-a05n01 rgmanager[20072]: status on script "drbd" returned 1 (generic error)
Aug  1 17:03:16 an-a05n01 rgmanager[20072]: Stopping service service:storage_n01
Aug  1 17:03:16 an-a05n01 rgmanager[7826]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 17:03:16 an-a05n01 rgmanager[20072]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 17:03:16 an-a05n01 rgmanager[20072]: #12: RG service:storage_n01 failed to stop; intervention required
Aug  1 17:03:16 an-a05n01 rgmanager[20072]: Service service:storage_n01 is failed
Aug  1 17:13:58 an-a05n01 rgmanager[20072]: Stopping service service:storage_n01
Aug  1 17:13:58 an-a05n01 rgmanager[13159]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 17:13:58 an-a05n01 rgmanager[20072]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 17:13:59 an-a05n01 rgmanager[20072]: Marking service:storage_n01 as 'disabled', but some resources may still be allocated!
Aug  1 17:13:59 an-a05n01 rgmanager[20072]: Service service:storage_n01 is disabled
Aug  1 17:13:59 an-a05n01 rgmanager[20072]: Stopping service service:storage_n02
Aug  1 17:13:59 an-a05n01 rgmanager[13316]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 17:13:59 an-a05n01 rgmanager[20072]: Service service:storage_n02 is disabled
Aug  1 17:27:20 an-a05n01 rgmanager[20072]: Starting disabled service service:storage_n01
Aug  1 17:27:23 an-a05n01 rgmanager[20072]: Service service:storage_n01 started
Aug  1 17:27:23 an-a05n01 rgmanager[20072]: Shutting down
Aug  1 17:27:23 an-a05n01 rgmanager[20072]: Stopping service service:storage_n01
Aug  1 17:27:23 an-a05n01 rgmanager[20072]: Stopping service service:libvirtd_n01
Aug  1 17:27:24 an-a05n01 rgmanager[17753]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5)
Aug  1 17:27:24 an-a05n01 rgmanager[20072]: stop on script "clvmd" returned 1 (generic error)
Aug  1 17:27:24 an-a05n01 rgmanager[20072]: Service service:libvirtd_n01 is stopped
Aug  1 17:27:24 an-a05n01 rgmanager[20072]: #12: RG service:storage_n01 failed to stop; intervention required
Aug  1 17:27:24 an-a05n01 rgmanager[20072]: Service service:storage_n01 is failed
Aug  1 17:27:24 an-a05n01 rgmanager[20072]: Disconnecting from CMAN
Aug  1 17:27:39 an-a05n01 rgmanager[20072]: Exiting
Aug  1 17:28:07 an-a05n01 rgmanager[18632]: I am node #1
Aug  1 17:28:07 an-a05n01 rgmanager[18632]: Resource Group Manager Starting
Aug  1 17:28:11 an-a05n01 rgmanager[18632]: Marking service:storage_n02 as stopped: Restricted domain unavailable
Aug  1 17:28:11 an-a05n01 rgmanager[18632]: Starting stopped service service:storage_n01
Aug  1 17:28:11 an-a05n01 rgmanager[18632]: Starting stopped service service:libvirtd_n01
Aug  1 17:28:11 an-a05n01 rgmanager[18632]: Service service:libvirtd_n01 started
Aug  1 17:28:13 an-a05n01 rgmanager[18632]: Service service:storage_n01 started
Aug  1 17:28:13 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:28:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:28:15 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:28:15 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:28:15 an-a05n01 rgmanager[18632]: #6X: Invalid reply [-5] from member 2 during relocate operation!
Aug  1 17:44:51 an-a05n01 rgmanager[32419]: [script] script:clvmd: status of /etc/init.d/clvmd failed (returned 3)
Aug  1 17:44:51 an-a05n01 rgmanager[18632]: status on script "clvmd" returned 1 (generic error)
Aug  1 17:44:51 an-a05n01 rgmanager[18632]: Stopping service service:storage_n01
Aug  1 17:44:51 an-a05n01 rgmanager[32456]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 17:44:51 an-a05n01 rgmanager[18632]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 17:44:52 an-a05n01 rgmanager[18632]: #12: RG service:storage_n01 failed to stop; intervention required
Aug  1 17:44:52 an-a05n01 rgmanager[18632]: Service service:storage_n01 is failed
Aug  1 17:44:52 an-a05n01 rgmanager[18632]: Stopping service service:storage_n01
Aug  1 17:44:52 an-a05n01 rgmanager[32726]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 17:44:52 an-a05n01 rgmanager[18632]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 17:44:52 an-a05n01 rgmanager[18632]: Marking service:storage_n01 as 'disabled', but some resources may still be allocated!
Aug  1 17:44:53 an-a05n01 rgmanager[18632]: Service service:storage_n01 is disabled
Aug  1 17:44:55 an-a05n01 rgmanager[18632]: Stopping service service:storage_n02
Aug  1 17:44:55 an-a05n01 rgmanager[545]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 17:44:55 an-a05n01 rgmanager[18632]: Service service:storage_n02 is disabled
Aug  1 17:59:46 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 17:59:46 an-a05n01 rgmanager[18632]: Stopping service service:libvirtd_n01
Aug  1 17:59:46 an-a05n01 rgmanager[18632]: Service service:libvirtd_n01 is stopped
Aug  1 17:59:55 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 17:59:55 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:00:05 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:00:05 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:00:16 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:00:16 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:00:26 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:00:26 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:00:36 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:00:36 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
====

This is on the node that I pasted output for in #c8. After posting, I tried to restart the rgmanager service, which also hung like the attempt to restart the storage service did.

Comment 11 Madison Kelly 2016-08-01 22:05:09 UTC

In case it helps, here is what 'ps' shows on the effected node (slightly trimmed to remove unrelated ps processes):

====
[root@an-a05n01 ~]# ps auxf |grep rgmanager -B 10 -A 10
root      2812  0.0  0.0  66240  1248 ?        Ss   15:34   0:00 /usr/sbin/sshd
root      8419  0.0  0.0 100016  4088 ?        Ss   15:59   0:00  \_ sshd: root@pts/1 
root      8424  0.0  0.0 110160  3848 pts/1    Ss   15:59   0:00  |   \_ -bash
root      5616  0.0  0.0 106472  1832 pts/1    S+   17:59   0:00  |       \_ /bin/bash /etc/init.d/rgmanager restart
root      5621  0.1  0.0 106476  1880 pts/1    S+   17:59   0:00  |           \_ /bin/bash /etc/init.d/rgmanager stop
root      6771  0.0  0.0 100920   624 pts/1    S+   18:03   0:00  |               \_ sleep 1
root     28730  0.0  0.0 100012  4044 ?        Ss   17:42   0:00  \_ sshd: root@notty 
root       519  0.0  0.0 106076  1496 ?        Ss   17:44   0:00  |   \_ bash -c /usr/sbin/clusvcadm -d storage_n02 && /bin/sleep 10 && /usr/sbin/clusvcadm -F -e storage_n02
root       814  0.0  0.0  10404   688 ?        S    17:45   0:00  |       \_ /usr/sbin/clusvcadm -F -e storage_n02
root      5882  0.0  0.0 100016  4104 ?        Ss   18:00   0:00  \_ sshd: root@pts/0 
root      5885  0.0  0.0 110160  3812 pts/0    Ss   18:00   0:00      \_ -bash
root      6772  0.0  0.0 110680  1596 pts/0    R+   18:03   0:00          \_ ps auxf
root      6773  0.0  0.0 103316   880 pts/0    S+   18:03   0:00          \_ grep rgmanager -B 10 -A 10
--
root     18631  0.0  0.0  36952  6256 ?        S<Ls 17:28   0:00 rgmanager
root     18632  0.0  0.0 570640  3052 ?        S<l  17:28   0:01  \_ rgmanager
====

Comment 12 Madison Kelly 2016-08-01 22:28:56 UTC

Sorry for the comment spam... I'm adding notes as I try to fix things. I'm hoping "too much data" is the right side to err on. This comment alone is pretty long, but I wanted to document the entire recovery procedure.

So I tried killing the pids related to rgmanager:

====
[root@an-a05n01 ~]# ps aux | grep rgmanager
root      5616  0.0  0.0 106472  1832 pts/1    S+   17:59   0:00 /bin/bash /etc/init.d/rgmanager restart
root      5621  0.1  0.0 106476  1896 pts/1    S+   17:59   0:00 /bin/bash /etc/init.d/rgmanager stop
root      7516  0.0  0.0 103316   904 pts/0    S+   18:06   0:00 grep rgmanager
root     18631  0.0  0.0  36952  6256 ?        S<Ls 17:28   0:00 rgmanager
root     18632  0.0  0.0 580884  3060 ?        S<l  17:28   0:01 rgmanager
[root@an-a05n01 ~]# kill 5616 5621 18631 18632
[root@an-a05n01 ~]# grep rgmanager /var/log/messages
====

trimmed output:

====
Aug  1 18:00:36 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:00:46 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:00:46 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:00:56 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:00:56 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:01:07 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:01:07 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:01:17 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:01:17 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:01:27 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:01:27 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:01:37 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:01:37 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:01:47 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:01:47 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:01:58 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:01:58 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:02:08 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:02:08 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:02:18 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:02:18 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:02:28 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:02:28 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:02:38 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:02:38 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:02:49 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:02:49 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:02:59 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:02:59 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:03:09 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:03:09 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:03:19 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:03:19 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:03:30 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:03:30 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:03:40 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:03:40 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:03:50 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:03:50 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:04:00 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:04:00 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:04:10 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:04:10 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:04:21 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:04:21 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:04:31 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:04:31 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:04:41 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:04:41 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:04:51 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:04:51 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:05:01 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:05:01 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:05:12 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:05:12 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:05:22 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:05:22 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:05:32 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:05:32 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:05:42 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:05:42 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:05:51 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 18:05:52 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:05:52 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:06:01 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 18:06:03 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:06:03 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:06:13 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:06:13 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:06:23 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:06:23 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:06:23 an-a05n01 rgmanager[18632]: Shutting down
====

Now ps shows two instances of rgmanager still, despite the kill.

====
[root@an-a05n01 ~]# ps auxf | grep rgmanager
root      8075  0.0  0.0 103316   868 pts/0    S+   18:09   0:00          \_ grep rgmanager
root     18631  0.0  0.0  36952  6256 ?        S<Ls 17:28   0:00 rgmanager
root     18632  0.0  0.0 797980  3104 ?        S<l  17:28   0:01  \_ rgmanager
====

Shutting down still hangs though. So I checked, and there are hung clusvcadm calls;

====
[root@an-a05n01 ~]# ps aux | grep -e rgmanager -e clusvcadm
root       519  0.0  0.0 106076  1496 ?        Ss   17:44   0:00 bash -c /usr/sbin/clusvcadm -d storage_n02 && /bin/sleep 10 && /usr/sbin/clusvcadm -F -e storage_n02
root       814  0.0  0.0  10404   688 ?        S    17:45   0:00 /usr/sbin/clusvcadm -F -e storage_n02
root      8341  0.0  0.0 103320   956 pts/0    S+   18:10   0:00 grep -e rgmanager -e clusvcadm
root     18631  0.0  0.0  36952  6256 ?        S<Ls 17:28   0:00 rgmanager
root     18632  0.0  0.0 949540  5168 ?        S<l  17:28   0:01 rgmanager
====

So I tried to kill them as well, but no luck:

====
[root@an-a05n01 ~]# kill 519 814 18631 18632
[root@an-a05n01 ~]# ps aux | grep -e rgmanager -e clusvcadm
root      8502  0.5  0.0 106076  1500 ?        Ss   18:11   0:00 bash -c /etc/init.d/rgmanager stop; /bin/echo rc:$?
root      8509  0.5  0.0 106476  1868 ?        S    18:11   0:00 /bin/bash /etc/init.d/rgmanager stop
root      8523  0.0  0.0 103316   936 pts/0    S+   18:11   0:00 grep -e rgmanager -e clusvcadm
root     18631  0.0  0.0  36952  6256 ?        S<Ls 17:28   0:00 rgmanager
root     18632  0.0  0.0 1035564 5188 ?        S<l  17:28   0:01 rgmanager
====

So I try the dreaded kill -9;

====
[root@an-a05n01 ~]# kill -9 519 814 18631 18632
-bash: kill: (519) - No such process
-bash: kill: (814) - No such process
[root@an-a05n01 ~]# ps aux | grep -e rgmanager -e clusvcadm
root      8771  0.0  0.0 103316   936 pts/0    S+   18:12   0:00 grep -e rgmanager -e clusvcadm
====

And now it is dead.

====
Aug  1 18:06:23 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 18:06:33 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:06:33 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:06:33 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 18:06:43 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:06:43 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:06:54 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:06:54 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:07:04 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:07:04 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:07:14 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:07:14 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:07:24 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:07:24 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:07:35 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:07:35 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:07:45 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:07:45 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:07:55 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:07:55 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:08:05 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:08:05 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:08:15 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:08:15 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:08:26 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:08:26 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:08:36 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:08:36 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:08:46 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:08:46 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:08:56 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:08:56 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:09:06 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:09:06 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:09:17 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:09:17 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:09:27 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:09:27 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:09:37 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:09:37 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:09:44 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 18:09:44 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 18:09:47 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:09:47 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:09:57 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:09:57 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:10:07 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:10:07 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:10:18 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:10:18 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:10:28 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:10:28 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:10:38 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:10:38 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:10:48 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:10:48 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:10:58 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:10:58 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:11:09 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:11:09 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:11:19 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:11:19 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:11:29 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:11:29 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:11:29 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 18:11:30 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 18:11:39 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:11:39 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:11:40 an-a05n01 rgmanager[18632]: Shutting down
Aug  1 18:11:49 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:11:49 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:12:00 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:12:00 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
Aug  1 18:12:10 an-a05n01 rgmanager[18632]: #XX: Cancelling relocation: Shutting down
Aug  1 18:12:10 an-a05n01 rgmanager[18632]: #6X: Invalid reply [2] from member 2 during relocate operation!
====

Now, 'clustat' shows nothing on node 2, which might be because it's storage had failed earlier so a hung stop operation probably finally exited.

====
[root@an-a05n01 ~]# clustat
Cluster Status for an-anvil-05 @ Mon Aug  1 18:15:31 2016
Member Status: Quorate

 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 an-a05n01.alteeve.ca                                       1 Online, Local
 an-a05n02.alteeve.ca                                       2 Online

[root@an-a05n01 ~]# /etc/init.d/drbd status
drbd not loaded
[root@an-a05n01 ~]# /etc/init.d/clvmd status
clvmd is stopped
[root@an-a05n01 ~]# /etc/init.d/gfs2 status
GFS2: service is not running

----

[root@an-a05n02 ~]# clustat
Cluster Status for an-anvil-05 @ Mon Aug  1 18:15:33 2016
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 an-a05n01.alteeve.ca                                                1 Online
 an-a05n02.alteeve.ca                                                2 Online, Local

[root@an-a05n02 ~]# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.4.8-1 (api:1/proto:86-101)
GIT-hash: 22b4c802192646e433d3f7399d578ec7fecc6272 build by root.ca, 2016-07-18 15:13:48
m:res  cs            ro                 ds                     p  mounted  fstype
0:r0   WFConnection  Secondary/Unknown  Inconsistent/DUnknown  C
[root@an-a05n02 ~]# /etc/init.d/clvmd status
clvmd is stopped
[root@an-a05n02 ~]# /etc/init.d/gfs2 status
GFS2: service is not running
====

Here are the rgmanager logs from node 2:

====
[root@an-a05n02 ~]# grep rgmanager /var/log/messages
Aug  1 15:24:23 new-node02 yum[15984]: Installed: rgmanager-3.0.12.1-26.el6_8.3.x86_64
Aug  1 16:39:03 an-a05n02 rgmanager[19470]: I am node #2
Aug  1 16:39:03 an-a05n02 rgmanager[19470]: Resource Group Manager Starting
Aug  1 16:39:05 an-a05n02 rgmanager[20577]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 16:39:06 an-a05n02 rgmanager[20630]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 16:39:06 an-a05n02 rgmanager[19470]: Starting stopped service service:storage_n02
Aug  1 16:39:07 an-a05n02 rgmanager[19470]: Starting stopped service service:libvirtd_n02
Aug  1 16:39:07 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 9
Aug  1 16:39:07 an-a05n02 rgmanager[19470]: Service service:libvirtd_n02 started
Aug  1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 13
Aug  1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 14
Aug  1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 15
Aug  1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 16
Aug  1 16:39:08 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 17
Aug  1 16:39:09 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 18
Aug  1 16:39:09 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 19
Aug  1 16:39:09 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 20
Aug  1 16:39:09 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 21
Aug  1 16:39:10 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 22
Aug  1 16:39:10 an-a05n02 rgmanager[19470]: Ignoring M_CLOSE for destroyed context 24
Aug  1 16:39:10 an-a05n02 rgmanager[19470]: Service service:storage_n02 started
Aug  1 17:03:17 an-a05n02 rgmanager[6759]: [script] script:drbd: status of /etc/init.d/drbd failed (returned 3)
Aug  1 17:03:17 an-a05n02 rgmanager[19470]: status on script "drbd" returned 1 (generic error)
Aug  1 17:03:17 an-a05n02 rgmanager[19470]: Stopping service service:storage_n02
Aug  1 17:03:17 an-a05n02 rgmanager[6796]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 17:03:17 an-a05n02 rgmanager[19470]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 17:03:17 an-a05n02 rgmanager[19470]: #12: RG service:storage_n02 failed to stop; intervention required
Aug  1 17:03:17 an-a05n02 rgmanager[19470]: Service service:storage_n02 is failed
Aug  1 17:27:23 an-a05n02 rgmanager[19470]: Service service:storage_n02 started
Aug  1 17:27:24 an-a05n02 rgmanager[19470]: Member 1 shutting down
Aug  1 17:27:24 an-a05n02 rgmanager[19470]: Marking service:libvirtd_n01 as stopped: Restricted domain unavailable
Aug  1 17:27:39 an-a05n02 rgmanager[19470]: Shutting down
Aug  1 17:27:39 an-a05n02 rgmanager[19470]: Stopping service service:storage_n02
Aug  1 17:27:40 an-a05n02 rgmanager[19470]: Stopping service service:libvirtd_n02
Aug  1 17:27:40 an-a05n02 rgmanager[16465]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5)
Aug  1 17:27:40 an-a05n02 rgmanager[19470]: stop on script "clvmd" returned 1 (generic error)
Aug  1 17:27:40 an-a05n02 rgmanager[19470]: Service service:libvirtd_n02 is stopped
Aug  1 17:27:40 an-a05n02 rgmanager[19470]: #12: RG service:storage_n02 failed to stop; intervention required
Aug  1 17:27:40 an-a05n02 rgmanager[19470]: Service service:storage_n02 is failed
Aug  1 17:27:40 an-a05n02 rgmanager[19470]: Disconnecting from CMAN
Aug  1 17:27:55 an-a05n02 rgmanager[19470]: Exiting
Aug  1 17:28:07 an-a05n02 rgmanager[17323]: I am node #2
Aug  1 17:28:07 an-a05n02 rgmanager[17323]: Resource Group Manager Starting
Aug  1 17:28:12 an-a05n02 rgmanager[17323]: Starting stopped service service:libvirtd_n02
Aug  1 17:28:12 an-a05n02 rgmanager[17323]: Starting stopped service service:storage_n02
Aug  1 17:28:12 an-a05n02 rgmanager[17323]: Service service:libvirtd_n02 started
Aug  1 17:28:13 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 12
Aug  1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 13
Aug  1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 14
Aug  1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 15
Aug  1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 16
Aug  1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 17
Aug  1 17:28:14 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 18
Aug  1 17:28:15 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 19
Aug  1 17:28:15 an-a05n02 rgmanager[17323]: Ignoring M_CLOSE for destroyed context 20
Aug  1 17:28:15 an-a05n02 rgmanager[17323]: Service service:storage_n02 started
Aug  1 17:44:52 an-a05n02 rgmanager[30999]: [script] script:clvmd: status of /etc/init.d/clvmd failed (returned 3)
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: status on script "clvmd" returned 1 (generic error)
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: Stopping service service:storage_n02
Aug  1 17:44:52 an-a05n02 rgmanager[31036]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: #12: RG service:storage_n02 failed to stop; intervention required
Aug  1 17:44:52 an-a05n02 rgmanager[17323]: Service service:storage_n02 is failed
Aug  1 18:12:17 an-a05n02 rgmanager[17323]: Marking service:libvirtd_n01 as stopped: Restricted domain unavailable
Aug  1 18:12:17 an-a05n02 rgmanager[17323]: Shutting down
Aug  1 18:12:17 an-a05n02 rgmanager[17323]: Stopping service service:libvirtd_n02
Aug  1 18:12:17 an-a05n02 rgmanager[17323]: Service service:libvirtd_n02 is stopped
====

So I tried to start rgmanager on both nodes. It started on node 1, but hung on node 2...

====
[root@an-a05n01 ~]# /etc/init.d/rgmanager start
Starting Cluster Service Manager:                          [  OK  ]
[root@an-a05n01 ~]# clustat
Cluster Status for an-anvil-05 @ Mon Aug  1 18:17:31 2016
Member Status: Quorate

 Member Name                                            ID   Status
 ------ ----                                            ---- ------
 an-a05n01.alteeve.ca                                       1 Online, Local, rgmanager
 an-a05n02.alteeve.ca                                       2 Online, rgmanager

 Service Name                                  Owner (Last)                                  State         
 ------- ----                                  ----- ------                                  -----         
 service:libvirtd_n01                          an-a05n01.alteeve.ca                          started       
 service:libvirtd_n02                          (an-a05n02.alteeve.ca)                        stopped       
 service:storage_n01                           (an-a05n01.alteeve.ca)                        disabled      
 service:storage_n02                           an-a05n02.alteeve.ca                          starting      
====

====
[root@an-a05n02 ~]# /etc/init.d/rgmanager start
Starting Cluster Service Manager:                          [  OK  ]
[root@an-a05n02 ~]# clustat 
Cluster Status for an-anvil-05 @ Mon Aug  1 18:17:34 2016
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 an-a05n01.alteeve.ca                                                1 Online
 an-a05n02.alteeve.ca                                                2 Online, Local
====

And syslog from both node 1:

====
Aug  1 18:17:26 an-a05n01 rgmanager[9585]: I am node #1
Aug  1 18:17:26 an-a05n01 rgmanager[9585]: Resource Group Manager Starting
Aug  1 18:17:28 an-a05n01 rgmanager[10652]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 18:17:28 an-a05n01 rgmanager[10705]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 18:17:29 an-a05n01 rgmanager[9585]: Starting stopped service service:libvirtd_n01
Aug  1 18:17:29 an-a05n01 rgmanager[9585]: Service service:libvirtd_n01 started
====

Node 2 had nothing new in its syslog, making me thing rgmanager is hung. So I checked it's ps auxf and sure enough, earlier clusvcadm and stop operations are there...

====
[root@an-a05n02 ~]# ps aux | grep -e rgmanager -e clusvcadm
root     14436  0.0  0.0 106076  1492 ?        Ss   18:12   0:00 bash -c /etc/init.d/rgmanager stop; /bin/echo rc:$?
root     14443  0.0  0.0 106476  1896 ?        S    18:12   0:00 /bin/bash /etc/init.d/rgmanager stop
root     17321  0.0  0.0  36952  6256 ?        S<Ls 17:28   0:00 rgmanager
root     17323  0.0  0.0 570620  3044 ?        S<l  17:28   0:02 rgmanager
====

So I kill those:

====
[root@an-a05n02 ~]# kill 14436 14443 17321 17323
[root@an-a05n02 ~]# ps aux | grep -e rgmanager -e clusvcadm
root     17321  0.0  0.0  36952  6256 ?        S<Ls 17:28   0:00 rgmanager
root     17323  0.0  0.0 570620  3044 ?        S<l  17:28   0:02 rgmanager
====

rgmanager eventually exits. Here's what appear in the logs:

====
Aug  1 18:21:15 an-a05n02 rgmanager[17323]: Shutting down
Aug  1 18:21:25 an-a05n02 rgmanager[17323]: Shutting down
Aug  1 18:21:30 an-a05n02 rgmanager[17323]: Service service:storage_n02 started
Aug  1 18:21:30 an-a05n02 rgmanager[17323]: Stopping service service:storage_n02
Aug  1 18:21:30 an-a05n02 rgmanager[18460]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5)
Aug  1 18:21:30 an-a05n02 rgmanager[17323]: stop on script "clvmd" returned 1 (generic error)
Aug  1 18:21:30 an-a05n02 rgmanager[17323]: #12: RG service:storage_n02 failed to stop; intervention required
Aug  1 18:21:30 an-a05n02 rgmanager[17323]: Service service:storage_n02 is failed
Aug  1 18:21:31 an-a05n02 rgmanager[17323]: Disconnecting from CMAN
Aug  1 18:21:46 an-a05n02 rgmanager[17323]: Exiting
====

====
Aug  1 18:21:29 an-a05n01 rgmanager[9585]: Service service:storage_n01 started
Aug  1 18:21:30 an-a05n01 rgmanager[9585]: Member 2 shutting down
Aug  1 18:21:31 an-a05n01 rgmanager[9585]: Marking service:libvirtd_n02 as stopped: Restricted domain unavailable
Aug  1 18:21:31 an-a05n01 rgmanager[9585]: Stopping service service:storage_n02
Aug  1 18:21:32 an-a05n01 rgmanager[12670]: [script] script:clvmd: stop of /etc/init.d/clvmd failed (returned 5)
Aug  1 18:21:32 an-a05n01 rgmanager[9585]: stop on script "clvmd" returned 1 (generic error)
Aug  1 18:21:32 an-a05n01 rgmanager[9585]: Marking service:storage_n02 as 'disabled', but some resources may still be allocated!
Aug  1 18:21:32 an-a05n01 rgmanager[9585]: Service service:storage_n02 is disabled
====

Note that at this point, DRBD started on node 2 allowing node 1's storage service to finish coming up. So I try to start rgmanager again on node 2:

====
Aug  1 18:25:00 an-a05n02 rgmanager[19084]: I am node #2
Aug  1 18:25:00 an-a05n02 rgmanager[19084]: Resource Group Manager Starting
Aug  1 18:25:04 an-a05n02 rgmanager[19084]: Starting stopped service service:libvirtd_n02
Aug  1 18:25:05 an-a05n02 rgmanager[19084]: Service service:libvirtd_n02 started
====

This time it started, but it left the storage service off, despite being set to 'autostart="1"';

====
Aug  1 18:25:00 an-a05n02 rgmanager[19084]: I am node #2
Aug  1 18:25:00 an-a05n02 rgmanager[19084]: Resource Group Manager Starting
Aug  1 18:25:04 an-a05n02 rgmanager[19084]: Starting stopped service service:libvirtd_n02
Aug  1 18:25:05 an-a05n02 rgmanager[19084]: Service service:libvirtd_n02 started
====

I'm guessing this is because node 1 marked it as stopped because it was in a restricted domain, but it should have started when a member of that restricted domain rejoined, shouldn't of it?

Anyway, manually starting it:

====
[root@an-a05n01 ~]# clusvcadm -F -e storage_n02
Local machine trying to enable service:storage_n02...Success
service:storage_n02 is now running on an-a05n02.alteeve.ca
====

Finally works and the cluster is operational again.

What an ordeal! >_<

Comment 13 Madison Kelly 2016-08-02 02:36:06 UTC

OK, I think this most recent issue's root cause is DRBD's '/etc/init.d/drbd status' returning non-0. When this happens, rgmanager flags the service as 'failed' and tries to recover it, as it should. However, if fails at that is when things go sideways.

The node doesn't get fenced (is that possible to fence a node if a service enters a failed state?) and DLM hangs.

====
Aug  1 20:28:25 an-a05n01 rgmanager[9328]: [script] script:drbd: status of /etc/init.d/drbd failed (returned 3)
Aug  1 20:28:25 an-a05n01 rgmanager[22400]: status on script "drbd" returned 1 (generic error)
Aug  1 20:28:25 an-a05n01 rgmanager[22400]: Stopping service service:storage_n01
Aug  1 20:28:25 an-a05n01 rgmanager[9365]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 20:28:25 an-a05n01 rgmanager[22400]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 20:28:25 an-a05n01 rgmanager[22400]: #12: RG service:storage_n01 failed to stop; intervention required
Aug  1 20:28:25 an-a05n01 rgmanager[22400]: Service service:storage_n01 is failed
Aug  1 22:24:03 an-a05n01 rgmanager[22400]: Stopping service service:storage_n01
Aug  1 22:24:03 an-a05n01 rgmanager[9383]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 22:24:03 an-a05n01 rgmanager[22400]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 22:24:04 an-a05n01 rgmanager[22400]: Marking service:storage_n01 as 'disabled', but some resources may still be allocated!
Aug  1 22:24:04 an-a05n01 rgmanager[22400]: Service service:storage_n01 is disabled
Aug  1 22:24:09 an-a05n01 rgmanager[22400]: Starting disabled service service:storage_n01
====

Peer:

====
Aug  1 20:28:05 an-a05n02 rgmanager[12487]: [script] script:drbd: status of /etc/init.d/drbd failed (returned 3)
Aug  1 20:28:05 an-a05n02 rgmanager[26115]: status on script "drbd" returned 1 (generic error)
Aug  1 20:28:05 an-a05n02 rgmanager[26115]: Stopping service service:storage_n02
Aug  1 20:28:05 an-a05n02 rgmanager[12525]: [clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
Aug  1 20:28:05 an-a05n02 rgmanager[26115]: stop on clusterfs "sharedfs" returned 5 (program not installed)
Aug  1 20:28:05 an-a05n02 rgmanager[26115]: #12: RG service:storage_n02 failed to stop; intervention required
Aug  1 20:28:05 an-a05n02 rgmanager[26115]: Service service:storage_n02 is failed
====

The bit that stands out the most to me is:

====
[clusterfs] stop: Could not match /dev/an-a05n01_vg0/shared with a real device
stop on clusterfs "sharedfs" returned 5 (program not installed)
====

It can't match the LV because the backing device is gone. That's a problem, of course, but the the cluster should be able to handle it, shouldn't it?

Comment 14 Madison Kelly 2016-08-02 05:18:15 UTC

This is a test system (with no plans to put it into production). If you want access to poke away at it, I am happy to give it to you. It's already available over ssh, I can email the credentials.

Comment 15 Madison Kelly 2016-08-02 05:38:29 UTC

Also note; The cluster.conf at the start of this ticket is not the one for the recent posts (in so far as node and cluster names), but the structure is identical to the current one. It was the same installer, just different identification stuff was chosen at the time. 

The RPM versions of the nodes always matched each other exactly, but the two clusters differed (ccrs was 6.7.z updated and an-anvil-05 is rhel 6.8.z updated as of today's available updates).

Comment 16 John Ruemker 2016-08-02 14:10:21 UTC

Well there's quite a lot happening throughout your various comments that are difficult to fully explain with the information you've given, and things get complicated after you start doing things like attempting to kill daemons manually and then stopping and restarting them while they're not fully cleaned up, so its difficult to respond to all of this with clear suggestions to solve all of your problems.

Again, I'll note that the nature of what you're looking for assistance with is something we need to be handling in a support engagement, not here in bugzilla. Bugzilla is not a support tool, it is intended for the reporting of defects or feature enhancements to Red Hat products. There is no clear single defect in what you've described, and those unexplained areas that could have resulted from some unexpected flaw in rgmanager all require much deeper investigation to nail down the cause of, and could feasibly be a result of some other condition in your environment rather than a defect. It's clear that there are problems with rgmanager's behavior in your environment, but it is our support offering that is designed to guide users through assessing the nature of those problems and addressing them. Without a clear problem report here that highlights something that can be addressed in the product, I'm afraid there's not much we can do unless you're able to pursue this through our support team.

I will comment on a few items of note in your posts though, just to try to highlight areas where you need to focus.

- Indeed your problems start with drbd returning a failure code on a status op. Address that and perhaps the chain of events that follows will be prevented.

- rgmanager's recovery of the service after drbd failed results in stopping the other resources in the service, and clvmd fails to stop. A failure to stop puts the service in a failed state, and there is no further recovery after that. This requires manual intervention, as it indicates there was a condition that rgmanager could not automatically resolve to ensure the resources are no longer in use on that node.

- clvmd's stop failure indicates something was likely still using an open volume at that point, which is likely to be the GFS2 file system and libvirtd.

- DRBD having failed while clvmd had a mapped, active volume on it creates problems. If DRBD can fail in such a manner, then it leaves little recourse for rgmanager to recover the rest of the resources that depend on it in that chain. That is: If DRBD goes away, so too may the block device that backs clustered volumes in use by clvmd. If a GFS2 file system was using that volume, then DRBD devices going away means GFS2 can be blocked or withdraw, leaving that filesystem unusable and stuck in a mounted state. If that happens, clvmd will still see its volumes as in-use preventing it from stopping, but the resource it needs to be available to free itself up can't be made available until the service can stop, which is prevented by the volume being in use. Basically, I see no way to recover from this scenario, which calls into question the purpose of your cluster configuration.

- There are ways to cause the node to reboot itself and be fenced in the face of certain failures, like a GFS2 file system hitting a fatal error (use mount err=panic), a clusterfs resource failing to stop (force_unmount=1 self_fence=1 on the resource), or an lvm resource failing to stop (self_fence=1 on the resource). Our support team could help you better understand the appropriate course of action here. That's more a usage question or discussion about your preferences.

- After DRBD fails and clvmd fails to stop and is in a state where its not expected to be able to get cleaned up, you begin attempting to kill daemons. This is a bad idea for various reasons, not least of which is that its not going to entirely clean up those blocked resources, but then when you start rgmanager again its going to have no memory of the fact that resources were blocked, and can now once again try to interact with those resources, which may block or fail. This could block further interaction with rgmanager through clustat or clusvcadm, which could explain more of your problems from here on.

- After resources fail and you start killing daemons in this manner, we can't really predict what the behavior should be from that point on, so anything else unexpected is difficult to describe as a defect.

- You have 'invalid reply' messages through all of these events, which is troubling. These could certainly be contributing to the overall problematic behavior, as these are not expected and are generally not handled in any comprehensive way. If messages are coming across with invalid headers or data, then it indicates either a communication problem with packets being corrupted somehow, or rgmanager being in a problematic state on one node. But because its not clear what state rgmanager was in on both nodes from the comments you've provided, and we don't have any additional diagnostic information about its internal state (like an application core of rgmanager, internal debug dump, tcpdumps, a clear history showing what was done leading up to all of this, debug logs, etc), there's not really much we can say about this condition other than its unexpected and is likely going to cause further problems after it occurs.

My suggestion is to speak with your partner account team regarding support subscriptions so that you can pursue a support engagement allowing us to look at these events in the necessary level of detail. Alternatively you can reach out to Linbit to have them help you better understand why it failed to set off this chain of events. Aside from that, I don't see anything here that we can immediately address in the product.

-John

Comment 17 Madison Kelly 2016-08-02 15:43:07 UTC

> Again, I'll note that the nature of what you're looking for assistance with is something we need to be handling in a support engagement, not here in bugzilla.

Which is why I offered up SSH access to the effected machines. Being an ISV, as I mentioned, I am on a self-support tier so I can't officially engage support.


> - Indeed your problems start with drbd returning a failure code on a status op.  Address that and perhaps the chain of events that follows will be prevented.

No doubt, but the nature of HA is that you have to handle unexpected failures in a way that you can hopefully recover from. Getting a bad return status is exactly what rgmanager is designed to do. I am working in parallel to solve that, but when I do, it doesn't change the fact that rgmanager reacts badly to a bad status. The issue could arise again with a different component that throws an unexpected RC.


> - clvmd's stop failure indicates something was likely still using an open volume at that point, which is likely to be the GFS2 file system and libvirtd.

Reasonable assumption. That rgmanager itself stopped responding is why I think this is fundamentally a DLM issue. GFS2, clvmd and rgmanager are the three users of DLM. However, it's not hard locked like in a failed fence because it does respond to the kill commands, which is why I think rgmanager should have been able to handle it better. Leaving the system hung, when fencing is still an option, seems like a bad idea in an HA environment. Fencing the failed node to unblock the cluster strikes me as a must saner option.


> - There are ways to cause the node to reboot itself and be fenced in the face of certain failures, like a GFS2 file system hitting a fatal error (use mount err=panic), a clusterfs resource failing to stop (force_unmount=1 self_fence=1 on the resource), or an lvm resource failing to stop (self_fence=1 on the resource).  Our support team could help you better understand the appropriate course of action here.  That's more a usage question or discussion about your preferences.

I will try this, it might be just the ticket.


> - After DRBD fails and clvmd fails to stop and is in a state where its not expected to be able to get cleaned up, you begin attempting to kill daemons.  This is a bad idea for various reasons, not least of which is that its not going to entirely clean up those blocked resources, but then when you start rgmanager again its going to have no memory of the fact that resources were blocked, and can now once again try to interact with those resources, which may block or fail.  This could block further interaction with rgmanager through clustat or clusvcadm, which could explain more of your problems from here on.

I understand that 'kill' is bad, but faced with a non-responsive system, I was looking for the "least bad" solution.

So in general, if rgmanager fails, the node should be rebooted? Or would withdrawing the node from the cluster entirely (stopping cman and all resources) be enough to restore a clean state?


> - After resources fail and you start killing daemons in this manner, we can't really predict what the behavior should be from that point on, so anything else unexpected is difficult to describe as a defect.

Understandable.


> - You have 'invalid reply' messages through all of these events, which is troubling.

In all my years of using rgmanager, I've never seen those message before. I understand your trouble of diagnosing this without a reliable reproducer. When I tried to enable bugging according to man cluster.conf, all I got was rate-limit messages to the tune of ~1000 messages being suppressed every few seconds. I'm much more of a user than a dev, so I am probably not doing a good job helping you here.


> My suggestion is to speak with your partner account team regarding support subscriptions so that you can pursue a support engagement allowing us to look at these events in the necessary level of detail.

I'll try. As for LINBIT, I've already engaged them as well. However, knowing that rgmanager reacts poorly to certain script RCs worries me, independent of the current root cause.

Very much appreciate your lengthy reply and effort.

Comment 18 John Ruemker 2016-08-02 16:02:08 UTC

(In reply to digimer from comment #17)
> Which is why I offered up SSH access to the effected machines. Being an ISV,
> as I mentioned, I am on a self-support tier so I can't officially engage
> support.

Sorry, I won't be able to connect directly in that manner. 

> 
> 
> > - Indeed your problems start with drbd returning a failure code on a status op.  Address that and perhaps the chain of events that follows will be prevented.
> 
> No doubt, but the nature of HA is that you have to handle unexpected
> failures in a way that you can hopefully recover from. Getting a bad return
> status is exactly what rgmanager is designed to do. I am working in parallel
> to solve that, but when I do, it doesn't change the fact that rgmanager
> reacts badly to a bad status. The issue could arise again with a different
> component that throws an unexpected RC.

rgmanager did not, from what I see, respond badly to the DRBD status failure.  It executed its stop procedures as it should during an attempt to recover, and then clvmd returned a failure during stop, which is expected to put the service in a failed state.  As I said, if we can't stop, then we can't be sure the resource is free to be used throughout the cluster, so we require manual intervention in response.  This is expected and is part of the design of rgmanager.  

> 
> 
> > - clvmd's stop failure indicates something was likely still using an open volume at that point, which is likely to be the GFS2 file system and libvirtd.
> 
> Reasonable assumption. That rgmanager itself stopped responding is why I
> think this is fundamentally a DLM issue. GFS2, clvmd and rgmanager are the
> three users of DLM. However, it's not hard locked like in a failed fence
> because it does respond to the kill commands, which is why I think rgmanager
> should have been able to handle it better. Leaving the system hung, when
> fencing is still an option, seems like a bad idea in an HA environment.
> Fencing the failed node to unblock the cluster strikes me as a must saner
> option.

Based on what I see, I wouldn't agree this is what it was blocked on.  But without data about what rgmangaer's state was internally, there's no point in me arguing it.  We can't prove one way or another from the captured information.  A core of rgmanager's state internally on each node would be a good start.  DLM's lock states, netstat connection information, full logs, and a complete description of every action taken would also be helpful.  I can't guarantee that we'll be able to comprehensively investigate this and reach a conclusion without a support engagement, but having that data would at least give us the ability to see more of what is happening.

> 
> 
> > - There are ways to cause the node to reboot itself and be fenced in the face of certain failures, like a GFS2 file system hitting a fatal error (use mount err=panic), a clusterfs resource failing to stop (force_unmount=1 self_fence=1 on the resource), or an lvm resource failing to stop (self_fence=1 on the resource).  Our support team could help you better understand the appropriate course of action here.  That's more a usage question or discussion about your preferences.
> 
> I will try this, it might be just the ticket.
> 
> 
> > - After DRBD fails and clvmd fails to stop and is in a state where its not expected to be able to get cleaned up, you begin attempting to kill daemons.  This is a bad idea for various reasons, not least of which is that its not going to entirely clean up those blocked resources, but then when you start rgmanager again its going to have no memory of the fact that resources were blocked, and can now once again try to interact with those resources, which may block or fail.  This could block further interaction with rgmanager through clustat or clusvcadm, which could explain more of your problems from here on.
> 
> I understand that 'kill' is bad, but faced with a non-responsive system, I
> was looking for the "least bad" solution.

Understood.  My point was more that once you go down this path, everything that follows is off the rails, and we can't give much consideration to how everything else behaves following these actions. 

> 
> So in general, if rgmanager fails, the node should be rebooted? Or would
> withdrawing the node from the cluster entirely (stopping cman and all
> resources) be enough to restore a clean state?

Difficult to say without being able to pinpoint the exact reason for things being stuck.  But with clvmd and GFS2 in the state they were in, you're unlikely to be able to stop those cleanly, so you'd likely need to reboot instead.

> 
> 
> > - After resources fail and you start killing daemons in this manner, we can't really predict what the behavior should be from that point on, so anything else unexpected is difficult to describe as a defect.
> 
> Understandable.
> 
> 
> > - You have 'invalid reply' messages through all of these events, which is troubling.
> 
> In all my years of using rgmanager, I've never seen those message before. I
> understand your trouble of diagnosing this without a reliable reproducer.
> When I tried to enable bugging according to man cluster.conf, all I got was
> rate-limit messages to the tune of ~1000 messages being suppressed every few
> seconds. I'm much more of a user than a dev, so I am probably not doing a
> good job helping you here.

We've seen them before, and they've resulted from a variety of conditions, most often the two scenarios I described: something interfering with rgmanager communications but not with general token passing, or rgmanager put into some unexpected state through a problematic sequence of events.  There's no single condition I can point to that I would expect to explain your situation through.

> 
> 
> > My suggestion is to speak with your partner account team regarding support subscriptions so that you can pursue a support engagement allowing us to look at these events in the necessary level of detail.
> 
> I'll try. As for LINBIT, I've already engaged them as well. However, knowing
> that rgmanager reacts poorly to certain script RCs worries me, independent
> of the current root cause.
> 

Again, I don't see rgmanager reacting poorly to any script returns.  The sequence that follows the drbd failure is expected.  As I explained, the way you've got your resources laid out means that a DRBD failure pretty much spells the end of what rgmanager can do to recover.  DRBD means GFS2 withdraws/blocks and clvmd can't find the device it has mapped volumes for.  I can say, without a doubt, that we should expect rgmanager to be unable to recover those resources on that node if this happens.  So addressing why DRBD failed seems to me to be the most critical part of all this.  If there's something specific you think rgmanager did incorrectly in how it responded to that script failure, please restate it for me.  Perhaps I'm missing your point.

-John

Comment 19 Madison Kelly 2016-08-02 16:07:07 UTC

Thank you for your reply, and I understand your position.

> If there's something specific you think rgmanager did incorrectly in how it responded to that script failure, please restate it for me.  Perhaps I'm missing your point.

I suppose the only real point I can make is that 'clusvcadm -e <foo>' when <foo> had been failed and then disabled shouldn't endlessly hang. A timeout of some sort that exits (reporting a failure) would be much better. 

cheers

Comment 20 John Ruemker 2016-08-02 16:16:54 UTC

Sorry, I guess I did not address that point directly, but only hinted at it with mention of other conditions I felt were driving that.  

clusvcadm -d and clusvcadm -e could have very easily (and somewhat expectedly) blocked trying to stop or start one of the resources that was left in a bad state following DRBD's unexpected exit.  This is what I was trying to touch on by commenting on how if DRBD goes away, the other resources are mostly hosed at that point.  LVM operations could reasonably block depending on what state the volumes are left in.  If GFS2 withdraws, it suspends the underlying volume, which then can cause further LVM operations to block when they attempt to scan that device (depending on settings, but this is not uncommon to see).  So attempting to start clvmd after it'd been left running when the service stop failed could encounter this blocking.  Similarly, attempting to start the Filesystem resource could block for a variety of reasons if it had withdrawn.   

The point is: if DRBD goes away while these other resources are active, their design does not leave any simple way to then un-wedge them to start over fresh.  clvmd and GFS2 both have components in the kernel, and so starting or stopping these types of resources requires a somewhat complex sequence of operations involving userspace and kernelspace.  If one or the other gets stuck, such as if there is a withdrawal or a suspended device, then those components can be left in an uninterruptible state that can't just be cleared out and started fresh.  So again, this is why I say your service design is problematic.  If DRBD fails in this manner, there's not much else that can happen with the other components to clean them up.

>> A timeout of some sort that exits (reporting a failure) would be much better. 

This can be configured.  rgmanager doesn't enforce timeouts on resource operations by default, but it can if you enable __enforce_timeouts.  That said, it doesn't resolve the problems I noted previously with those components being stuck in a state that can't be recovered.  You'd simply have a more responsive clusvcadm command, but likely would still have no recourse to get those resources functional again.

Comment 24 Jan Kurik 2017-12-06 10:28:20 UTC

Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/

Comment 25 Red Hat Bugzilla 2023-09-14 03:27:17 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days