Bug 594476

Summary:	status check program for vm.sh & user-controlled error tolerance
Product:	Red Hat Enterprise Linux 5	Reporter:	Benjamin Kahn <bkahn>
Component:	rgmanager	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	urgent
Version:	5.4	CC:	cluster-maint, djansa, edamato, fnadge, jkortus, jwest, lhh, pm-eus, psubrama, yeylon, ykaul
Target Milestone:	rc	Keywords:	FutureFeature, ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	rgmanager-2.0.52-6.el5_5.8	Doc Type:	Enhancement
Doc Text:	Previously, vm.sh only checked the status of the VM itself, not the status of any services inside. With this update, administrators may now use a newly provided status check program which checks the availability of services within virtual machines running Red Hat Enterprise Virtualization Manager. Timeouts for starting and stopping virtual machines are now configurable in cluster.conf. The start timeout is based on the status check program.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-08-25 06:33:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	583788
Bug Blocks:

Description Benjamin Kahn 2010-05-20 19:27:56 UTC

This bug has been copied from bug #583788 and has been proposed
to be backported to 5.5 z-stream (EUS).

Comment 3 Lon Hohberger 2010-05-20 20:45:01 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=75c7ecae61d9400d084227304d5b7068fdaa310b
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=7a1e2af8cc37409b3a38fbe9b5945d186be9f2ec
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=61a7c0f26c4248778d656bcecb47375c419fafd6
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=ea76067bc05359c1582bfd84be4c1b3eb05b230a

Comment 4 Lon Hohberger 2010-06-03 18:12:29 UTC

Last patch has been updated:

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=805e0bae5db683fb33ec2e3c14b12c6380885494

Comment 11 yeylon@redhat.com 2010-06-21 15:43:58 UTC

we need some more improvement to the rhevm-check validation.

1. in current state we have only one timeout interval for rhev-check every X min.
due to the VM restart take ~ 5 min. this is the minimum limit that the test can run. this is unelectable for the rhevm node period for downtime (5 min interval + 5 min boot time will cause 10 min. of downtime)

we need to add a way to reduce this timeout to a more manner time.
one way to do this is by adding two different types of intervals
a. interval= X - for regular testing
b. after_failure_interval = Y - time to wait after the VM was restarted before initial testing

2. in the current state after one failure of the rhev-check.sh the rhevm node will be rebooted which is not the best way to go, we need to take in account possible scenarios that the VM did not response due to load or other possible scenarios.

we need to add a way to test few times before we determining if the RHEVM VM is dead. lets say if rhev-check.sh return error MSG once keep retry for X times for Y intervals and if all attempts has failed migrate the VM  

__max_failures="5" __failure_expire_time="60"

3. in current state the VM shutdown is being executed using virtsh shutdown and after 15 sec the KVM process is being killed so the VM did not have time to properly shutdown which can (and will) lead for corruption. (i had one) we need to increase the timeout between the shutdown of the VM and the process being killed (100~120 sec. should be fine)

Comment 15 yeylon@redhat.com 2010-06-29 11:20:02 UTC

looks like at this stage rhev-check.sh does not work as expected. the 5 min timeout for starting a VM never ends.

1. migrate the VM service.
2. as soon as the VM was relocated kill the KVM process on the server
3. see that the rhev-check keep getting errors but will not try to migrate the service once again after 5 min as expected but only after half an hour.

this will require respin.

Comment 20 errata-xmlrpc 2010-08-25 06:33:37 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0647.html

Comment 21 Florian Nadge 2010-10-18 17:33:09 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
previously, vm.sh only checked the status of the VM itself, not the status of any services inside. With this update, administrators may now use a newly provided status check program which checks the availability of services within virtual machines running Red Hat Enterprise Virtualization Manager. Timeouts for starting and stopping virtual machines are now configurable in cluster.conf. The start timeout is based on the status check program.

Comment 22 Florian Nadge 2010-10-18 17:33:21 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-previously, vm.sh only checked the status of the VM itself, not the status of any services inside. With this update, administrators may now use a newly provided status check program which checks the availability of services within virtual machines running Red Hat Enterprise Virtualization Manager. Timeouts for starting and stopping virtual machines are now configurable in cluster.conf. The start timeout is based on the status check program.+Previously, vm.sh only checked the status of the VM itself, not the status of any services inside. With this update, administrators may now use a newly provided status check program which checks the availability of services within virtual machines running Red Hat Enterprise Virtualization Manager. Timeouts for starting and stopping virtual machines are now configurable in cluster.conf. The start timeout is based on the status check program.