Bug 1079032 - Unable to add '<action name="stop" timeout="10m" />' child element to a <vm .../> service using 'ccs'
Summary: Unable to add '<action name="stop" timeout="10m" />' child element to a <vm ....
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: ricci
Version: 6.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Chris Feist
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1212516
TreeView+ depends on / blocked
 
Reported: 2014-03-20 19:28 UTC by digimer
Modified: 2015-07-22 07:33 UTC (History)
4 users (show)

Fixed In Version: ccs-0.16.2-80.el6.x86_64
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1212516 (view as bug list)
Environment:
Last Closed: 2015-07-22 07:33:43 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:1405 normal SHIPPED_LIVE ricci bug fix and enhancement update 2015-07-20 18:07:08 UTC

Description digimer 2014-03-20 19:28:06 UTC
Description of problem:

This is something of a two-part issue;

By default, rgmanager (or vm.sh) only gives virtual machines two minutes to shut down after calling 'disable' against a VM service. If the VM isn't finished shutting down by then, the server is forced off. This behaviour can be modified by adding '<vm ....><action name="stop" timeout="10m" /></vm>' to give the server more time to shut down, which is good, though an option to say "wait indefinitely" would be safest.

The reason this is a concern is that, by default, MS Windows guests will download updates but not install them until the OS shuts down. This can cause windows to take many minutes to actually shut down, during which time it warns the user not to turn off their computer. So if rgmanager forces the server off, the guest's OS can be damaged or destroyed.

So the first part of this bug is a request for a method of telling rgmanager to wait indefinitely for a VM service to stop. At the very least, marking the service as 'failed' instead of forcing it off, if you are worried about blocking.

The second part of this bug, and the most prescient in my mind, is that 'ccs' can not add/modify/remove the '<action .../>' sub-element under a VM resource. Without this support, a user could at least add a long enough time out to minimize the chances of killing a VM mid-OS update. Without this support, the user (or her app ;) ) would have to edit the cluster.conf config directly, which is not ideal in production environments, obviously.

To confirm this behaviour, I did the following test (copied from my linux-cluster ML post):

====
With 

<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>

I called disable on a VM with gnome running, so that I could abort the VM's shut down.

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:06:29 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:08:36 EDT 2014

2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been a windows guest in the middle of installing updates, it would be highly likely to be screwed now.

To confirm, I changed the config to:

<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600">
  <action name="stop" timeout="10m"/>
</vm>

Then I repeated the test:

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:13:18 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:23:31 EDT 2014

10 minutes and 13 seconds before the cluster killed the server, much less likely to interrupt a in-progress OS update (truth be told, I plan to set 30 minutes.
====

Version-Release number of selected component (if applicable):

ricci-0.16.2-69.el6_5.1.x86_64
ccs-0.16.2-69.el6_5.1.x86_64
rgmanager-3.0.12.1-19.el6.x86_64
resource-agents-3.9.2-40.el6_5.6.x86_64


How reproducible:

100%


Steps to Reproduce:
1. ccs doesn't support adding an 'action' child element to <vm /> resources
2. rgmanager doesn't support unlimited vm stop timeouts
3.

Actual results:

VMs are forced off after two minutes. No way to change that with 'ccs', forcing the user('s apps) to directly modify cluster.conf. Even so, a longer-but-still-arbitrary time-out can set.


Expected results:

1. ccs should be able to set <action ...> under a <vm> service.
2. rgmanager should not kill a VM that hasn't shut down by a given time (should mark it as 'failed' perhaps?).


Additional info:

Mailing list thread:
https://www.redhat.com/archives/linux-cluster/2014-March/msg00027.html

Comment 2 digimer 2014-03-20 19:49:42 UTC
Please ignore the rgmanager component of this bug. I will open a new bug for that.

Comment 3 digimer 2014-03-20 20:02:36 UTC
rgmanager section now here:

https://bugzilla.redhat.com/show_bug.cgi?id=1079039

Comment 5 Chris Feist 2015-03-02 23:29:11 UTC
Fixed upstream here:

https://github.com/feist/ccs/commit/10e7b6aedd41c59fee5416baa37c45c2f70fdb72

Comment 6 Chris Feist 2015-03-03 23:17:16 UTC
Before Fix:

[root@ask-03 ~]# rpm -q ccs  
ccs-0.16.2-75.el6.x86_64
[root@ask-03 ~]# ccs -f test.conf --createcluster test_cluster
[root@ask-03 ~]# ccs -f test.conf --addvm my_vm
[root@ask-03 ~]# ccs -f test.conf --addaction my_vm stop timeout=10m
Usage: ccs [OPTION]...
Cluster configuration system.
....


After Fix:
[root@ask-02 ~]# rpm -q ccs
ccs-0.16.2-77.el6.x86_64
[root@ask-02 ~]# ccs -f test.conf --createcluster test_cluster
[root@ask-02 ~]# ccs -f test.conf --addvm my_vm
[root@ask-02 ~]# ccs -f test.conf --addaction my_vm stop timeout=10m
[root@ask-02 ~]# cat test.conf
<cluster config_version="3" name="test_cluster">
  <fence_daemon/>
  <clusternodes/>
  <cman/>
  <fencedevices/>
  <rm>
    <failoverdomains/>
    <resources/>
    <vm name="my_vm">
      <action name="stop" timeout="10m"/>
    </vm>
  </rm>
</cluster>
[root@ask-02 ~]# ccs -h | grep '[add|rm]vm'
      --addvm <virtual machine name> [vm options] ...
      --rmvm <virtual machine name>

[root@ask-02 ~]# ccs -f test.conf --rmaction my_vm
[root@ask-02 ~]# cat test.conf
<cluster config_version="4" name="test_cluster">
  <fence_daemon/>
  <clusternodes/>
  <cman/>
  <fencedevices/>
  <rm>
    <failoverdomains/>
    <resources/>
    <vm name="my_vm">
    </vm>
  </rm>
</cluster>

Comment 11 Chris Feist 2015-04-21 21:53:25 UTC
Updated fix not allowing duplicate actions for the same resource here:

https://github.com/feist/ccs/commit/62bd05fdbbef0edd55968bdef767cdc81c6c506d

Comment 12 Chris Feist 2015-04-21 22:00:38 UTC
Before Fix:

[root@host-620 ~]# rpm -q ccs
ccs-0.16.2-79.el6.x86_64
[root@host-620 ~]# ccs -f test.conf --createcluster test
[root@host-620 ~]# ccs -f test.conf --addvm vmname1
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --lsservices
resources: 
virtual machines:
    vm: name=vmname1
      action: name=start, timeout=5m
      action: name=start, timeout=5m
      action: name=start, timeout=5m


After Fix:

[root@host-620 ~]# rpm -q ccs
ccs-0.16.2-80.el6.x86_64
[root@host-620 ~]# ccs -f test.conf --createcluster test
[root@host-620 ~]# ccs -f test.conf --addvm vmname1
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
Error: 'start' action already exists for 'vmname1'
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
Error: 'start' action already exists for 'vmname1'
[root@host-620 ~]# ccs -f test.conf --lsservices
resources: 
virtual machines:
    vm: name=vmname1
      action: name=start, timeout=5m

Comment 16 errata-xmlrpc 2015-07-22 07:33:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1405.html


Note You need to log in before you can comment on or make changes to this bug.