1030680 – depend_mode="soft" is not honoured in vm.sh (also, a typo)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1030680 - depend_mode="soft" is not honoured in vm.sh (also, a typo)

Summary: depend_mode="soft" is not honoured in vm.sh (also, a typo)

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	6.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Ryan McCabe
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-11-14 22:58 UTC by Madison Kelly
Modified:	2014-02-28 14:50 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-02-28 14:50:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Madison Kelly 2013-11-14 22:58:36 UTC

Description of problem:

When a VM server is set to soft-depend on another service, stopping that service causes the VM to stop as well, even it running on another node.

According to vm.sh's meta-data:

====
        <parameter name="depend_mode">
            <longdesc lang="en">
		Service dependency mode.
		hard - This service is stopped/started if its dependency
		       is stopped/started
		soft - This service only depends on the other service for
		       initial startip.  If the other service stops, this
		       service is not stopped.
            </longdesc>
====

note: s/startip/startup/

Here's the relevant cluster.conf entries;

====
		<service autostart="1" domain="only_n01" exclusive="0" name="libvirtd_n01" recovery="restart">
			<script ref="libvirtd"/>
		</service>
		<service autostart="1" domain="only_n02" exclusive="0" name="libvirtd_n02" recovery="restart">
			<script ref="libvirtd"/>
		</service>
		<vm autostart="1" depend="service:storage_n01" depend_mode="soft" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-win2008" path="/shared/definitions/" recovery="restart" restart_expire_time="600">
			<action name="stop" timeout="30m"/>
		</vm>
		<vm autostart="1" depend="service:storage_n02" depend_mode="soft" domain="primary_n02" exclusive="0" max_restarts="2" name="vm02-win2012" path="/shared/definitions/" recovery="restart" restart_expire_time="600">
			<action name="stop" timeout="30m"/>
		</vm>
====

When I stop 'rgmanager' on 'an-c05n01' while 'vm01-win2008' is running on 'an-c05n02', it shuts down the server, despite 'depend_mode="soft"'.

====
Nov 14 17:43:16 an-c05n01 rgmanager[2963]: Service service:storage_n01 is stopped
Nov 14 17:43:16 an-c05n01 rgmanager[2963]: Shutting down
Nov 14 17:43:17 an-c05n01 rgmanager[2963]: Disconnecting from CMAN
Nov 14 17:43:17 an-c05n01 rgmanager[2963]: Exiting
Nov 14 17:43:19 an-c05n01 kernel: dlm: closing connection to node 2
Nov 14 17:43:19 an-c05n01 kernel: dlm: closing connection to node 1
---
Nov 14 17:43:16 an-c05n02 rgmanager[14414]: Member 1 shutting down
Nov 14 17:43:17 an-c05n02 rgmanager[14414]: Marking service:storage_n01 as stopped: Restricted domain unavailable
Nov 14 17:43:17 an-c05n02 rgmanager[14414]: Marking service:libvirtd_n01 as stopped: Restricted domain unavailable
Nov 14 17:43:19 an-c05n02 corosync[14181]:   [QUORUM] Members[1]: 2
Nov 14 17:43:19 an-c05n02 corosync[14181]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 14 17:43:19 an-c05n02 corosync[14181]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.50.2) ; members(old:2 left:1)
Nov 14 17:43:19 an-c05n02 corosync[14181]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 14 17:43:19 an-c05n02 kernel: dlm: closing connection to node 1
Nov 14 17:43:20 an-c05n02 kernel: vbr2: port 2(vnet0) entering disabled state
Nov 14 17:43:20 an-c05n02 kernel: device vnet0 left promiscuous mode
Nov 14 17:43:20 an-c05n02 kernel: vbr2: port 2(vnet0) entering disabled state
Nov 14 17:43:21 an-c05n02 rgmanager[14414]: Service vm:vm01-win2008 is stopped
Nov 14 17:43:22 an-c05n02 ntpd[2318]: Deleting interface #15 vnet0, fe80::fc54:ff:fe8e:6732#123, interface stats: received=0, sent=0, dropped=0, active_time=22 secs
====

This is a pretty big problem, as it makes 'depend' useless.

Version-Release number of selected component (if applicable):

Fully updated RHEL 6.4.

[root@an-c05n01 ~]# rpm -q rgmanager cman resource-agents
rgmanager-3.0.12.1-17.el6.x86_64
cman-3.0.12.1-49.el6_4.2.x86_64
resource-agents-3.9.2-21.el6_4.8.x86_64

[root@an-c05n02 ~]# rpm -q rgmanager cman resource-agents
rgmanager-3.0.12.1-17.el6.x86_64
cman-3.0.12.1-49.el6_4.2.x86_64
resource-agents-3.9.2-21.el6_4.8.x86_64

How reproducible:

100%

Steps to Reproduce:
1. Setup a VM service dependent on another service in a restricted failover domain.
2. Stop rgmanager or disable the dependent service.
3. soft-dependent service stops

Actual results:

Dependent service stops, despite soft dependency.

Expected results:

Once the service starts, change in state of the dependent service should no longer impact the running service.

Additional info:

Full cluster.conf:

====
<?xml version="1.0"?>
<cluster config_version="13" name="an-cluster-05">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-c05n01.alteeve.ca" nodeid="1">
			<fence>
				<method name="ipmi">
					<device action="reboot" delay="15" name="ipmi_n01"/>
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="1"/>
					<device action="reboot" name="pdu2" port="1"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an-c05n02.alteeve.ca" nodeid="2">
			<fence>
				<method name="ipmi">
					<device action="reboot" name="ipmi_n02"/>
				</method>
				<method name="pdu">
					<device action="reboot" name="pdu1" port="2"/>
					<device action="reboot" name="pdu2" port="2"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" ipaddr="an-c05n01.ipmi" login="admin" name="ipmi_n01" passwd="secret"/>
		<fencedevice agent="fence_ipmilan" ipaddr="an-c05n02.ipmi" login="admin" name="ipmi_n02" passwd="secret"/>
		<fencedevice agent="fence_apc_snmp" ipaddr="an-p01.alteeve.ca" name="pdu1"/>
		<fencedevice agent="fence_apc_snmp" ipaddr="an-p02.alteeve.ca" name="pdu2"/>
	</fencedevices>
	<fence_daemon post_join_delay="30"/>
	<totem rrp_mode="none" secauth="off"/>
	<rm log_level="5">
		<resources>
			<script file="/etc/init.d/drbd" name="drbd"/>
			<script file="/etc/init.d/clvmd" name="clvmd"/>
			<script file="/etc/init.d/libvirtd" name="libvirtd"/>
			<clusterfs device="/dev/an-c05n01_vg0/shared" force_unmount="1" fstype="gfs2" mountpoint="/shared" name="sharedfs"/>
		</resources>
		<failoverdomains>
			<failoverdomain name="only_n01" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-c05n01.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="only_n02" nofailback="1" ordered="0" restricted="1">
				<failoverdomainnode name="an-c05n02.alteeve.ca"/>
			</failoverdomain>
			<failoverdomain name="primary_n01" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-c05n01.alteeve.ca" priority="1"/>
				<failoverdomainnode name="an-c05n02.alteeve.ca" priority="2"/>
			</failoverdomain>
			<failoverdomain name="primary_n02" nofailback="1" ordered="1" restricted="1">
				<failoverdomainnode name="an-c05n01.alteeve.ca" priority="2"/>
				<failoverdomainnode name="an-c05n02.alteeve.ca" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<service autostart="1" domain="only_n01" exclusive="0" name="storage_n01" recovery="restart">
			<script ref="drbd">
				<script ref="clvmd">
					<clusterfs ref="sharedfs"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="only_n02" exclusive="0" name="storage_n02" recovery="restart">
			<script ref="drbd">
				<script ref="clvmd">
					<clusterfs ref="sharedfs"/>
				</script>
			</script>
		</service>
		<service autostart="1" domain="only_n01" exclusive="0" name="libvirtd_n01" recovery="restart">
			<script ref="libvirtd"/>
		</service>
		<service autostart="1" domain="only_n02" exclusive="0" name="libvirtd_n02" recovery="restart">
			<script ref="libvirtd"/>
		</service>
		<vm autostart="1" depend="service:storage_n01" depend_mode="soft" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-win2008" path="/shared/definitions/" recovery="restart" restart_expire_time="600">
			<action name="stop" timeout="30m"/>
		</vm>
		<vm autostart="1" depend="service:storage_n02" depend_mode="soft" domain="primary_n02" exclusive="0" max_restarts="2" name="vm02-win2012" path="/shared/definitions/" recovery="restart" restart_expire_time="600">
			<action name="stop" timeout="30m"/>
		</vm>
	</rm>
</cluster>
====

Comment 2 Fabio Massimo Di Nitto 2013-11-15 05:49:53 UTC

Ryan, did we ever supported depend_mode for VM at all?

AFAIR it's only exposed for services.

Comment 3 RHEL Program Management 2013-11-18 06:15:45 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 4 Fabio Massimo Di Nitto 2014-02-28 14:50:38 UTC

RHEL never supported service interdependency. I strongly recommend to use pacemaker to address this issue.

Note You need to log in before you can comment on or make changes to this bug.