1079032 – Unable to add '<action name="stop" timeout="10m" />' child element to a <vm .../> service using 'ccs'

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1079032 - Unable to add '<action name="stop" timeout="10m" />' child element to a <vm .../> service using 'ccs'

Summary: Unable to add '<action name="stop" timeout="10m" />' child element to a <vm ....

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	ricci
Sub Component:
Version:	6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Feist
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1212516
TreeView+	depends on / blocked

Reported:	2014-03-20 19:28 UTC by Madison Kelly
Modified:	2015-07-22 07:33 UTC (History)
CC List:	4 users (show)
Fixed In Version:	ccs-0.16.2-80.el6.x86_64
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1212516 (view as bug list)
Environment:
Last Closed:	2015-07-22 07:33:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:1405	0	normal	SHIPPED_LIVE	ricci bug fix and enhancement update	2015-07-20 18:07:08 UTC

Description Madison Kelly 2014-03-20 19:28:06 UTC

Description of problem:

This is something of a two-part issue;

By default, rgmanager (or vm.sh) only gives virtual machines two minutes to shut down after calling 'disable' against a VM service. If the VM isn't finished shutting down by then, the server is forced off. This behaviour can be modified by adding '<vm ....><action name="stop" timeout="10m" /></vm>' to give the server more time to shut down, which is good, though an option to say "wait indefinitely" would be safest.

The reason this is a concern is that, by default, MS Windows guests will download updates but not install them until the OS shuts down. This can cause windows to take many minutes to actually shut down, during which time it warns the user not to turn off their computer. So if rgmanager forces the server off, the guest's OS can be damaged or destroyed.

So the first part of this bug is a request for a method of telling rgmanager to wait indefinitely for a VM service to stop. At the very least, marking the service as 'failed' instead of forcing it off, if you are worried about blocking.

The second part of this bug, and the most prescient in my mind, is that 'ccs' can not add/modify/remove the '<action .../>' sub-element under a VM resource. Without this support, a user could at least add a long enough time out to minimize the chances of killing a VM mid-OS update. Without this support, the user (or her app ;) ) would have to edit the cluster.conf config directly, which is not ideal in production environments, obviously.

To confirm this behaviour, I did the following test (copied from my linux-cluster ML post):

====
With 

<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>

I called disable on a VM with gnome running, so that I could abort the VM's shut down.

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:06:29 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:08:36 EDT 2014

2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been a windows guest in the middle of installing updates, it would be highly likely to be screwed now.

To confirm, I changed the config to:

<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600">
  <action name="stop" timeout="10m"/>
</vm>

Then I repeated the test:

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:13:18 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:23:31 EDT 2014

10 minutes and 13 seconds before the cluster killed the server, much less likely to interrupt a in-progress OS update (truth be told, I plan to set 30 minutes.
====

Version-Release number of selected component (if applicable):

ricci-0.16.2-69.el6_5.1.x86_64
ccs-0.16.2-69.el6_5.1.x86_64
rgmanager-3.0.12.1-19.el6.x86_64
resource-agents-3.9.2-40.el6_5.6.x86_64


How reproducible:

100%


Steps to Reproduce:
1. ccs doesn't support adding an 'action' child element to <vm /> resources
2. rgmanager doesn't support unlimited vm stop timeouts
3.

Actual results:

VMs are forced off after two minutes. No way to change that with 'ccs', forcing the user('s apps) to directly modify cluster.conf. Even so, a longer-but-still-arbitrary time-out can set.


Expected results:

1. ccs should be able to set <action ...> under a <vm> service.
2. rgmanager should not kill a VM that hasn't shut down by a given time (should mark it as 'failed' perhaps?).


Additional info:

Mailing list thread:
https://www.redhat.com/archives/linux-cluster/2014-March/msg00027.html

Comment 2 Madison Kelly 2014-03-20 19:49:42 UTC

Please ignore the rgmanager component of this bug. I will open a new bug for that.

Comment 3 Madison Kelly 2014-03-20 20:02:36 UTC

rgmanager section now here:

https://bugzilla.redhat.com/show_bug.cgi?id=1079039

Comment 5 Chris Feist 2015-03-02 23:29:11 UTC

Fixed upstream here:

https://github.com/feist/ccs/commit/10e7b6aedd41c59fee5416baa37c45c2f70fdb72

Comment 6 Chris Feist 2015-03-03 23:17:16 UTC

Before Fix:

[root@ask-03 ~]# rpm -q ccs  
ccs-0.16.2-75.el6.x86_64
[root@ask-03 ~]# ccs -f test.conf --createcluster test_cluster
[root@ask-03 ~]# ccs -f test.conf --addvm my_vm
[root@ask-03 ~]# ccs -f test.conf --addaction my_vm stop timeout=10m
Usage: ccs [OPTION]...
Cluster configuration system.
....


After Fix:
[root@ask-02 ~]# rpm -q ccs
ccs-0.16.2-77.el6.x86_64
[root@ask-02 ~]# ccs -f test.conf --createcluster test_cluster
[root@ask-02 ~]# ccs -f test.conf --addvm my_vm
[root@ask-02 ~]# ccs -f test.conf --addaction my_vm stop timeout=10m
[root@ask-02 ~]# cat test.conf
<cluster config_version="3" name="test_cluster">
  <fence_daemon/>
  <clusternodes/>
  <cman/>
  <fencedevices/>
  <rm>
    <failoverdomains/>
    <resources/>
    <vm name="my_vm">
      <action name="stop" timeout="10m"/>
    </vm>
  </rm>
</cluster>
[root@ask-02 ~]# ccs -h | grep '[add|rm]vm'
      --addvm <virtual machine name> [vm options] ...
      --rmvm <virtual machine name>

[root@ask-02 ~]# ccs -f test.conf --rmaction my_vm
[root@ask-02 ~]# cat test.conf
<cluster config_version="4" name="test_cluster">
  <fence_daemon/>
  <clusternodes/>
  <cman/>
  <fencedevices/>
  <rm>
    <failoverdomains/>
    <resources/>
    <vm name="my_vm">
    </vm>
  </rm>
</cluster>

Comment 11 Chris Feist 2015-04-21 21:53:25 UTC

Updated fix not allowing duplicate actions for the same resource here:

https://github.com/feist/ccs/commit/62bd05fdbbef0edd55968bdef767cdc81c6c506d

Comment 12 Chris Feist 2015-04-21 22:00:38 UTC

Before Fix:

[root@host-620 ~]# rpm -q ccs
ccs-0.16.2-79.el6.x86_64
[root@host-620 ~]# ccs -f test.conf --createcluster test
[root@host-620 ~]# ccs -f test.conf --addvm vmname1
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --lsservices
resources: 
virtual machines:
    vm: name=vmname1
      action: name=start, timeout=5m
      action: name=start, timeout=5m
      action: name=start, timeout=5m


After Fix:

[root@host-620 ~]# rpm -q ccs
ccs-0.16.2-80.el6.x86_64
[root@host-620 ~]# ccs -f test.conf --createcluster test
[root@host-620 ~]# ccs -f test.conf --addvm vmname1
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
Error: 'start' action already exists for 'vmname1'
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
Error: 'start' action already exists for 'vmname1'
[root@host-620 ~]# ccs -f test.conf --lsservices
resources: 
virtual machines:
    vm: name=vmname1
      action: name=start, timeout=5m

Comment 16 errata-xmlrpc 2015-07-22 07:33:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1405.html

Note You need to log in before you can comment on or make changes to this bug.