1079039 – rgmanager forces VMs to power off after an arbitrary timeout after calling 'disable', potentially destroying MS Windows guests

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1079039 - rgmanager forces VMs to power off after an arbitrary timeout after calling 'disable', potentially destroying MS Windows guests

Summary: rgmanager forces VMs to power off after an arbitrary timeout after calling 'd...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	David Vossel
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1055424 1092726 1116993 1117040
TreeView+	depends on / blocked

Reported:	2014-03-20 20:01 UTC by Madison Kelly
Modified:	2014-10-14 05:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:	resource-agents-3.9.2-47.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1092726 1116993 (view as bug list)
Environment:
Last Closed:	2014-10-14 05:00:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Screenshot showing a windows VM service killed mid-OS update (467.83 KB, image/png) 2014-03-20 20:04 UTC, Madison Kelly	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1428	0	normal	SHIPPED_LIVE	resource-agents bug fix and enhancement update	2014-10-14 01:06:18 UTC

Description Madison Kelly 2014-03-20 20:01:46 UTC

Description of problem:

By default, rgmanager (or vm.sh) only gives virtual machines two minutes to shut down after calling 'disable' against a VM service. If the VM isn't finished shutting down by then, the server is forced off. This behaviour can be modified by adding '<vm ....><action name="stop" timeout="10m" /></vm>' to give the server more time to shut down, which is good, though an option to say "wait indefinitely" would be safest.

The reason this is a concern is that, by default, MS Windows guests will download updates but not install them until the OS shuts down. This can cause windows to take many minutes to actually shut down, during which time it warns the user not to turn off their computer. So if rgmanager forces the server off, the guest's OS can be damaged or destroyed.

So this is a request for a method of telling rgmanager to wait indefinitely for a VM service to stop. At the very least, marking the service as 'failed' instead of forcing it off, if you are worried about blocking.

To confirm this behaviour, I did the following test (copied from my linux-cluster ML post):

====
With 

<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>

I called disable on a VM with gnome running, so that I could abort the VM's shut down.

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:06:29 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:08:36 EDT 2014

2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been a windows guest in the middle of installing updates, it would be highly likely to be screwed now.

To confirm, I changed the config to:

<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600">
  <action name="stop" timeout="10m"/>
</vm>

Then I repeated the test:

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:13:18 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:23:31 EDT 2014

10 minutes and 13 seconds before the cluster killed the server, much less likely to interrupt a in-progress OS update (truth be told, I plan to set 30 minutes.
====

Version-Release number of selected component (if applicable):

ricci-0.16.2-69.el6_5.1.x86_64
rgmanager-3.0.12.1-19.el6.x86_64
resource-agents-3.9.2-40.el6_5.6.x86_64


How reproducible:

100%


Steps to Reproduce:
1. Create a cluster with a VM service
2. Setup a RHEL 6 guest running gnome, connect to the server over VNC.
3. Disable the service in rgmanager, then when prompted in the VM to shut down, cancel the shut down.
4. After two minutes (or X minutes if <action ...> set), the server will be forcibly powered off.


Actual results:

VMs are forced off after after an arbitrary set of time. This can destroy a windows server that is installing OS updates, and the default of 2 minutes is pretty short.


Expected results:

rgmanager should not kill a VM that hasn't shut down by a given time (should mark it as 'failed' perhaps?).


Additional info:

Mailing list thread:
https://www.redhat.com/archives/linux-cluster/2014-March/msg00027.html

Related rhbz#1079032

Comment 1 Madison Kelly 2014-03-20 20:04:55 UTC

Created attachment 877018 [details]
Screenshot showing a windows VM service killed mid-OS update

This shows a VM on a cluster (viewed over VNC in-browser) that was killed 10% into applying OS updates. Note the "Do not turn off your computer." warning behind the "Connection Closed" overlay.

Comment 3 Lon Hohberger 2014-03-20 20:14:41 UTC

One possibility is simply to add a no-kill flag to vm.sh and simply mark as failed after the timeout.

Comment 4 Madison Kelly 2014-03-20 20:21:18 UTC

That would work for me just fine.

Comment 5 Madison Kelly 2014-03-20 20:22:07 UTC

Note: Even with "no-kill", I would still want an adjustable stop timeout.

Comment 6 Fabio Massimo Di Nitto 2014-03-21 08:50:17 UTC

(In reply to Lon Hohberger from comment #3)
> One possibility is simply to add a no-kill flag to vm.sh and simply mark as
> failed after the timeout.

I agree, I prefer to handle this in resource agents vs rgmanager.

Comment 7 David Vossel 2014-04-25 21:31:04 UTC

I added the 'no_kill' option to vm.sh.  There is a patch upstream for this.

https://github.com/ClusterLabs/resource-agents/pull/417

Comment 10 michal novacek 2014-05-13 11:06:53 UTC

I'm not sure exactly on how to test this -- please advise.

Comment 11 Madison Kelly 2014-05-13 14:32:28 UTC

Install a windows VM (I use Windows 7 Professional SP1, you can use it for 30 days before you need to activate it). Once installed, download the initial round of OS updates. This will queue a large number of updates to install when next shut down. Of course, don't shut down.

With the new VM under rgmanager control, use 'clusvcadm -d vm:foo' to initiate a power off. This will send an ACPI power off event to the new windows VM, which will start the shut down. However, instead of just powering off, windows will start installing all of the queued OS updates. 

This will take much longer than two minutes, so rgmanager will terminate the VM.

Comment 12 David Vossel 2014-05-13 14:59:16 UTC

(In reply to michal novacek from comment #10)
> I'm not sure exactly on how to test this -- please advise.

This is a little tricky. What we need to test this is a way to prevent 'virsh shutdown <vm>' from succeeding.

In my experience with a rhel6 vm, I performed a 'halt -fin' within the vm, and then used the resource agent on the host machine. For some reason executing the halt manually like that prevented the vm from going down without forcing it using a 'virsh destroy <vm>'


So. here' are the steps I'd try.

1. start rhel6 vm, execute 'halt -fin'
2. use resource agent to attempt to stop vm using the new 'no_kill' option to prevent forcing the vm to stop.
3. Verify the resource-agent doesn't force the vm off during the timeout period.
4. Call the agent again with the same options. While the agent is waiting for the vm to stop, in another terminal execute 'virsh destroy <vm>' Verify the agent detects the vm has stopped and exits.


I hope that helps.
-- Vossel

Comment 13 michal novacek 2014-05-21 14:28:04 UTC

I have verified that the new functionality works correctly with resource-agents-3.9.2-47.el6.x86_64 according to instructions in comment #12

----

[root@duck-01 ~]# ccs -h localhost --lsservices
service: name=le-service, autostart=0, recovery=relocate
  vm: ref=duck-01-node01
  vm: ref=duck-01-node02
resources: 
  lvm: name=halvm, vg_name=ha-vg
  vm: name=duck-01-node01, xmlfile=/var/lib/libvirt/qemu/duck-01-node01.xml, no_kill=yes
  vm: name=duck-01-node02, xmlfile=/var/lib/libvirt/qemu/duck-01-node02.xml
virtual machines:
    vm: ref=duck-01-node01
    vm: ref=duck-01-node0

[root@duck-01 ~]# clustat
Cluster Status for STSRHTS8296 @ Wed May 21 15:54:36 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 duck-01.cluster-qe.lab.eng.brq.redhat.com                           1 Online, Local, rgmanager
 duck-02.cluster-qe.lab.eng.brq.redhat.com                           2 Offline

 Service Name                                                   Owner (Last)                                                   State         
 ------- ----                                                   ----- ------                                                   -----         
 service:le-service                                             duck-01.cluster-qe.lab.eng.brq.redhat.com                      started       

[root@duck-01 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 14    duck-01-node01                 running
 15    duck-01-node02                 running

[root@duck-01 ~]# ssh duck-01-node01 'hostname; halt -fin' &
duck-01-node01
[1] 6518
[root@duck-01 ~]# ssh duck-01-node02 'hostname; halt -fin' &
duck-01-node02
[2] 6519

[16:25:20 run in another terminal 'virsh destroy duck-01-node01]

(16:22:30)[root@duck-01 ~]$ clusvcadm -d le-service
Local machine disabling service:le-service...Success


(16:25:36)[root@duck-01 ~]$

Comment 14 michal novacek 2014-05-21 14:29:19 UTC

Clearing needinfo -- provided in comment 12.

Comment 15 Madison Kelly 2014-06-08 18:21:52 UTC

Is there a plan to add the no_kill option to vm.sh in RHEL 6 any time soon?

Comment 16 David Vossel 2014-06-09 14:58:24 UTC

(In reply to digimer from comment #15)
> Is there a plan to add the no_kill option to vm.sh in RHEL 6 any time soon?

This is scheduled for 6.6 release.

Comment 18 Madison Kelly 2014-06-26 22:17:07 UTC

Steven,

  What info is needed? Perhaps I can provide.

Comment 19 Steven J. Levine 2014-07-01 19:12:36 UTC

digimer:

It looks from this BZ that there is a new parameter you can set for a virtual machine resource -- a no_kill option.

However, when you go to the latest luci screen -- on the luci I built just last week -- there is no new parameter to specify for the virtual machine resource. These are the parameters that appear on the screen:

Migration Type 	
Migration Mapping 	
Status Program 	
Path to xmlfile Used to Create the VM 	
VM Configuration File Path 	
Path to the VM Snapshot Directory 	
Hypervisor URI 	
Migration URI 	
Tunnel data over ssh during migration 	
Independent Subtree 	
Non-Critical Resource

(Plus some other parameters that are general service parameters)

So it seems as if this BZ should have been cloned as a luci bug, but I don't think it was. My resource documentation is based on the luci screens.

Comment 20 Steven J. Levine 2014-07-01 19:23:44 UTC

digimer:

I'm removing the needinfo flag, as I have now heard from the luci developer.

There does need to be a new parameter in luci here, as I surmised, but I have to wait until that gets settled before I can document this correctly. 

But my question has been answered -- yes, this will need to be documented, but I will follow the progress of the updated luci screens.

Thanks,

Steven

Comment 21 errata-xmlrpc 2014-10-14 05:00:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1428.html

Note You need to log in before you can comment on or make changes to this bug.