Bug 1079039
Summary: | rgmanager forces VMs to power off after an arbitrary timeout after calling 'disable', potentially destroying MS Windows guests | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Madison Kelly <mkelly> | ||||
Component: | resource-agents | Assignee: | David Vossel <dvossel> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.5 | CC: | agk, cluster-maint, dvossel, fdinitto, mnovacek, rmccabe, slevine | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | resource-agents-3.9.2-47.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1092726 1116993 (view as bug list) | Environment: | |||||
Last Closed: | 2014-10-14 05:00:46 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1055424, 1092726, 1116993, 1117040 | ||||||
Attachments: |
|
Description
Madison Kelly
2014-03-20 20:01:46 UTC
Created attachment 877018 [details]
Screenshot showing a windows VM service killed mid-OS update
This shows a VM on a cluster (viewed over VNC in-browser) that was killed 10% into applying OS updates. Note the "Do not turn off your computer." warning behind the "Connection Closed" overlay.
One possibility is simply to add a no-kill flag to vm.sh and simply mark as failed after the timeout. That would work for me just fine. Note: Even with "no-kill", I would still want an adjustable stop timeout. (In reply to Lon Hohberger from comment #3) > One possibility is simply to add a no-kill flag to vm.sh and simply mark as > failed after the timeout. I agree, I prefer to handle this in resource agents vs rgmanager. I added the 'no_kill' option to vm.sh. There is a patch upstream for this. https://github.com/ClusterLabs/resource-agents/pull/417 I'm not sure exactly on how to test this -- please advise. Install a windows VM (I use Windows 7 Professional SP1, you can use it for 30 days before you need to activate it). Once installed, download the initial round of OS updates. This will queue a large number of updates to install when next shut down. Of course, don't shut down. With the new VM under rgmanager control, use 'clusvcadm -d vm:foo' to initiate a power off. This will send an ACPI power off event to the new windows VM, which will start the shut down. However, instead of just powering off, windows will start installing all of the queued OS updates. This will take much longer than two minutes, so rgmanager will terminate the VM. (In reply to michal novacek from comment #10) > I'm not sure exactly on how to test this -- please advise. This is a little tricky. What we need to test this is a way to prevent 'virsh shutdown <vm>' from succeeding. In my experience with a rhel6 vm, I performed a 'halt -fin' within the vm, and then used the resource agent on the host machine. For some reason executing the halt manually like that prevented the vm from going down without forcing it using a 'virsh destroy <vm>' So. here' are the steps I'd try. 1. start rhel6 vm, execute 'halt -fin' 2. use resource agent to attempt to stop vm using the new 'no_kill' option to prevent forcing the vm to stop. 3. Verify the resource-agent doesn't force the vm off during the timeout period. 4. Call the agent again with the same options. While the agent is waiting for the vm to stop, in another terminal execute 'virsh destroy <vm>' Verify the agent detects the vm has stopped and exits. I hope that helps. -- Vossel I have verified that the new functionality works correctly with resource-agents-3.9.2-47.el6.x86_64 according to instructions in comment #12 ---- [root@duck-01 ~]# ccs -h localhost --lsservices service: name=le-service, autostart=0, recovery=relocate vm: ref=duck-01-node01 vm: ref=duck-01-node02 resources: lvm: name=halvm, vg_name=ha-vg vm: name=duck-01-node01, xmlfile=/var/lib/libvirt/qemu/duck-01-node01.xml, no_kill=yes vm: name=duck-01-node02, xmlfile=/var/lib/libvirt/qemu/duck-01-node02.xml virtual machines: vm: ref=duck-01-node01 vm: ref=duck-01-node0 [root@duck-01 ~]# clustat Cluster Status for STSRHTS8296 @ Wed May 21 15:54:36 2014 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ duck-01.cluster-qe.lab.eng.brq.redhat.com 1 Online, Local, rgmanager duck-02.cluster-qe.lab.eng.brq.redhat.com 2 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:le-service duck-01.cluster-qe.lab.eng.brq.redhat.com started [root@duck-01 ~]# virsh list --all Id Name State ---------------------------------------------------- 14 duck-01-node01 running 15 duck-01-node02 running [root@duck-01 ~]# ssh duck-01-node01 'hostname; halt -fin' & duck-01-node01 [1] 6518 [root@duck-01 ~]# ssh duck-01-node02 'hostname; halt -fin' & duck-01-node02 [2] 6519 [16:25:20 run in another terminal 'virsh destroy duck-01-node01] (16:22:30)[root@duck-01 ~]$ clusvcadm -d le-service Local machine disabling service:le-service...Success (16:25:36)[root@duck-01 ~]$ Clearing needinfo -- provided in comment 12. Is there a plan to add the no_kill option to vm.sh in RHEL 6 any time soon? (In reply to digimer from comment #15) > Is there a plan to add the no_kill option to vm.sh in RHEL 6 any time soon? This is scheduled for 6.6 release. Steven, What info is needed? Perhaps I can provide. digimer: It looks from this BZ that there is a new parameter you can set for a virtual machine resource -- a no_kill option. However, when you go to the latest luci screen -- on the luci I built just last week -- there is no new parameter to specify for the virtual machine resource. These are the parameters that appear on the screen: Migration Type Migration Mapping Status Program Path to xmlfile Used to Create the VM VM Configuration File Path Path to the VM Snapshot Directory Hypervisor URI Migration URI Tunnel data over ssh during migration Independent Subtree Non-Critical Resource (Plus some other parameters that are general service parameters) So it seems as if this BZ should have been cloned as a luci bug, but I don't think it was. My resource documentation is based on the luci screens. digimer: I'm removing the needinfo flag, as I have now heard from the luci developer. There does need to be a new parameter in luci here, as I surmised, but I have to wait until that gets settled before I can document this correctly. But my question has been answered -- yes, this will need to be documented, but I will follow the progress of the updated luci screens. Thanks, Steven Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1428.html |