Bug 1079032
| Summary: | Unable to add '<action name="stop" timeout="10m" />' child element to a <vm .../> service using 'ccs' | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Madison Kelly <mkelly> | |
| Component: | ricci | Assignee: | Chris Feist <cfeist> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 6.5 | CC: | cluster-maint, fdinitto, rsteiger, tlavigne | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | ccs-0.16.2-80.el6.x86_64 | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1212516 (view as bug list) | Environment: | ||
| Last Closed: | 2015-07-22 07:33:43 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1212516 | |||
Please ignore the rgmanager component of this bug. I will open a new bug for that. rgmanager section now here: https://bugzilla.redhat.com/show_bug.cgi?id=1079039 Fixed upstream here: https://github.com/feist/ccs/commit/10e7b6aedd41c59fee5416baa37c45c2f70fdb72 Before Fix:
[root@ask-03 ~]# rpm -q ccs
ccs-0.16.2-75.el6.x86_64
[root@ask-03 ~]# ccs -f test.conf --createcluster test_cluster
[root@ask-03 ~]# ccs -f test.conf --addvm my_vm
[root@ask-03 ~]# ccs -f test.conf --addaction my_vm stop timeout=10m
Usage: ccs [OPTION]...
Cluster configuration system.
....
After Fix:
[root@ask-02 ~]# rpm -q ccs
ccs-0.16.2-77.el6.x86_64
[root@ask-02 ~]# ccs -f test.conf --createcluster test_cluster
[root@ask-02 ~]# ccs -f test.conf --addvm my_vm
[root@ask-02 ~]# ccs -f test.conf --addaction my_vm stop timeout=10m
[root@ask-02 ~]# cat test.conf
<cluster config_version="3" name="test_cluster">
<fence_daemon/>
<clusternodes/>
<cman/>
<fencedevices/>
<rm>
<failoverdomains/>
<resources/>
<vm name="my_vm">
<action name="stop" timeout="10m"/>
</vm>
</rm>
</cluster>
[root@ask-02 ~]# ccs -h | grep '[add|rm]vm'
--addvm <virtual machine name> [vm options] ...
--rmvm <virtual machine name>
[root@ask-02 ~]# ccs -f test.conf --rmaction my_vm
[root@ask-02 ~]# cat test.conf
<cluster config_version="4" name="test_cluster">
<fence_daemon/>
<clusternodes/>
<cman/>
<fencedevices/>
<rm>
<failoverdomains/>
<resources/>
<vm name="my_vm">
</vm>
</rm>
</cluster>
Updated fix not allowing duplicate actions for the same resource here: https://github.com/feist/ccs/commit/62bd05fdbbef0edd55968bdef767cdc81c6c506d Before Fix:
[root@host-620 ~]# rpm -q ccs
ccs-0.16.2-79.el6.x86_64
[root@host-620 ~]# ccs -f test.conf --createcluster test
[root@host-620 ~]# ccs -f test.conf --addvm vmname1
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --lsservices
resources:
virtual machines:
vm: name=vmname1
action: name=start, timeout=5m
action: name=start, timeout=5m
action: name=start, timeout=5m
After Fix:
[root@host-620 ~]# rpm -q ccs
ccs-0.16.2-80.el6.x86_64
[root@host-620 ~]# ccs -f test.conf --createcluster test
[root@host-620 ~]# ccs -f test.conf --addvm vmname1
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
Error: 'start' action already exists for 'vmname1'
[root@host-620 ~]# ccs -f test.conf --addaction vmname1 start timeout=5m
Error: 'start' action already exists for 'vmname1'
[root@host-620 ~]# ccs -f test.conf --lsservices
resources:
virtual machines:
vm: name=vmname1
action: name=start, timeout=5m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1405.html |
Description of problem: This is something of a two-part issue; By default, rgmanager (or vm.sh) only gives virtual machines two minutes to shut down after calling 'disable' against a VM service. If the VM isn't finished shutting down by then, the server is forced off. This behaviour can be modified by adding '<vm ....><action name="stop" timeout="10m" /></vm>' to give the server more time to shut down, which is good, though an option to say "wait indefinitely" would be safest. The reason this is a concern is that, by default, MS Windows guests will download updates but not install them until the OS shuts down. This can cause windows to take many minutes to actually shut down, during which time it warns the user not to turn off their computer. So if rgmanager forces the server off, the guest's OS can be damaged or destroyed. So the first part of this bug is a request for a method of telling rgmanager to wait indefinitely for a VM service to stop. At the very least, marking the service as 'failed' instead of forcing it off, if you are worried about blocking. The second part of this bug, and the most prescient in my mind, is that 'ccs' can not add/modify/remove the '<action .../>' sub-element under a VM resource. Without this support, a user could at least add a long enough time out to minimize the chances of killing a VM mid-OS update. Without this support, the user (or her app ;) ) would have to edit the cluster.conf config directly, which is not ideal in production environments, obviously. To confirm this behaviour, I did the following test (copied from my linux-cluster ML post): ==== With <vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/> I called disable on a VM with gnome running, so that I could abort the VM's shut down. an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date Wed Mar 19 21:06:29 EDT 2014 Local machine disabling vm:vm01-rhel6...Success Wed Mar 19 21:08:36 EDT 2014 2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been a windows guest in the middle of installing updates, it would be highly likely to be screwed now. To confirm, I changed the config to: <vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600"> <action name="stop" timeout="10m"/> </vm> Then I repeated the test: an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date Wed Mar 19 21:13:18 EDT 2014 Local machine disabling vm:vm01-rhel6...Success Wed Mar 19 21:23:31 EDT 2014 10 minutes and 13 seconds before the cluster killed the server, much less likely to interrupt a in-progress OS update (truth be told, I plan to set 30 minutes. ==== Version-Release number of selected component (if applicable): ricci-0.16.2-69.el6_5.1.x86_64 ccs-0.16.2-69.el6_5.1.x86_64 rgmanager-3.0.12.1-19.el6.x86_64 resource-agents-3.9.2-40.el6_5.6.x86_64 How reproducible: 100% Steps to Reproduce: 1. ccs doesn't support adding an 'action' child element to <vm /> resources 2. rgmanager doesn't support unlimited vm stop timeouts 3. Actual results: VMs are forced off after two minutes. No way to change that with 'ccs', forcing the user('s apps) to directly modify cluster.conf. Even so, a longer-but-still-arbitrary time-out can set. Expected results: 1. ccs should be able to set <action ...> under a <vm> service. 2. rgmanager should not kill a VM that hasn't shut down by a given time (should mark it as 'failed' perhaps?). Additional info: Mailing list thread: https://www.redhat.com/archives/linux-cluster/2014-March/msg00027.html