Bug 1212516

Summary: Do not remove <action> tags under <vm> added by ccs
Product: Red Hat Enterprise Linux 6 Reporter: Radek Steiger <rsteiger>
Component: luciAssignee: Ryan McCabe <rmccabe>
Status: CLOSED WONTFIX QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.7CC: cfeist, cluster-maint, cluster-qe, fdinitto, jpokorny, jruemker, mkelly, rmccabe, rsteiger
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1079032 Environment:
Last Closed: 2017-12-06 13:03:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1079032    
Bug Blocks:    

Description Radek Steiger 2015-04-16 14:41:10 UTC
+++ This bug was initially created as a clone of Bug #1079032 +++

Description of problem:

This is something of a two-part issue;

By default, rgmanager (or vm.sh) only gives virtual machines two minutes to shut down after calling 'disable' against a VM service. If the VM isn't finished shutting down by then, the server is forced off. This behaviour can be modified by adding '<vm ....><action name="stop" timeout="10m" /></vm>' to give the server more time to shut down, which is good, though an option to say "wait indefinitely" would be safest.

The reason this is a concern is that, by default, MS Windows guests will download updates but not install them until the OS shuts down. This can cause windows to take many minutes to actually shut down, during which time it warns the user not to turn off their computer. So if rgmanager forces the server off, the guest's OS can be damaged or destroyed.

So the first part of this bug is a request for a method of telling rgmanager to wait indefinitely for a VM service to stop. At the very least, marking the service as 'failed' instead of forcing it off, if you are worried about blocking.

The second part of this bug, and the most prescient in my mind, is that 'ccs' can not add/modify/remove the '<action .../>' sub-element under a VM resource. Without this support, a user could at least add a long enough time out to minimize the chances of killing a VM mid-OS update. Without this support, the user (or her app ;) ) would have to edit the cluster.conf config directly, which is not ideal in production environments, obviously.

To confirm this behaviour, I did the following test (copied from my linux-cluster ML post):

====
With 

<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600"/>

I called disable on a VM with gnome running, so that I could abort the VM's shut down.

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:06:29 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:08:36 EDT 2014

2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been a windows guest in the middle of installing updates, it would be highly likely to be screwed now.

To confirm, I changed the config to:

<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" name="vm01-rhel6" path="/shared/definitions/" recovery="restart" restart_expire_time="600">
  <action name="stop" timeout="10m"/>
</vm>

Then I repeated the test:

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:13:18 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:23:31 EDT 2014

10 minutes and 13 seconds before the cluster killed the server, much less likely to interrupt a in-progress OS update (truth be told, I plan to set 30 minutes.
====

Version-Release number of selected component (if applicable):

ricci-0.16.2-69.el6_5.1.x86_64
ccs-0.16.2-69.el6_5.1.x86_64
rgmanager-3.0.12.1-19.el6.x86_64
resource-agents-3.9.2-40.el6_5.6.x86_64


How reproducible:

100%


Steps to Reproduce:
1. ccs doesn't support adding an 'action' child element to <vm /> resources
2. rgmanager doesn't support unlimited vm stop timeouts
3.

Actual results:

VMs are forced off after two minutes. No way to change that with 'ccs', forcing the user('s apps) to directly modify cluster.conf. Even so, a longer-but-still-arbitrary time-out can set.


Expected results:

1. ccs should be able to set <action ...> under a <vm> service.
2. rgmanager should not kill a VM that hasn't shut down by a given time (should mark it as 'failed' perhaps?).


Additional info:

Mailing list thread:
https://www.redhat.com/archives/linux-cluster/2014-March/msg00027.html

Comment 2 John Ruemker 2016-08-02 19:39:21 UTC
Frankly its a bit problematic that vm takes the approach of enforcing its op stop timeout internally.  With all of our other resource agents (that I'm aware of, at least; perhaps another behaves like vm that I haven't seen), rgmanager is responsible for monitoring the length of time the op is taking and enforcing it if configured to do so.  Since __enforce_timeouts defaults to 0, typically that achieves what the reporter was looking for: waiting indefinitely.  However vm seems to take a different approach and watch that timeout internally, then take action regardless of whether timeouts are enforced or not.  This deviates from typical expected behavior, and as was stated, now creates a need for additional tooling to ensure the necessary parameters can be configured.  It creates a problem too, in that you can't easily just say wait as long as you need to, other than by setting a crazy-high timeout value.  

Looking at this logically, it would seem ideal to stay consistent with other agents and let the administrator decide whether they want timeouts enforced or not.  But considering where we are in RHEL 6 and considering from a customer perspective, it seems like we should probably leave the agent alone since its clearly been this way for some time.  And I guess in that case, we should probably make it easier for users to tune these settings through Conga as was requested.

I don't have any customer cases asking for this, so I'm merely just commenting as I was passing by on the problem of vm implementing its own timeout handling.

Comment 4 Madison Kelly 2016-08-17 16:01:55 UTC
Can I get the rational behind WONTFIX?

Comment 8 Ryan McCabe 2016-11-01 16:35:11 UTC
(In reply to digimer from comment #4)
> Can I get the rational behind WONTFIX?

So the wontfix was regarding supporting adding them via luci. The rationale was if it needed to be done, it would be done via ccs. I have renamed and reopened the bug for luci to clarify what needs to be done there.

Comment 9 Madison Kelly 2016-11-01 17:03:05 UTC
OK, thanks.

Comment 10 Jan Pokorný [poki] 2016-11-15 16:52:28 UTC
Digimer, better overview of existing actions than nothing:
[bug 1173942]

Comment 11 Jan Pokorný [poki] 2016-11-15 16:57:21 UTC
That being said, we will ensure no <action> stanza already configured
will get accidentally dropped on configuration round trip through luci.

Comment 13 Jan Kurik 2017-12-06 13:03:30 UTC
Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/