Bug 594476

Summary: status check program for vm.sh & user-controlled error tolerance
Product: Red Hat Enterprise Linux 5 Reporter: Benjamin Kahn <bkahn>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: urgent    
Version: 5.4CC: cluster-maint, djansa, edamato, fnadge, jkortus, jwest, lhh, pm-eus, psubrama, yeylon, ykaul
Target Milestone: rcKeywords: FutureFeature, ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: rgmanager-2.0.52-6.el5_5.8 Doc Type: Enhancement
Doc Text:
Previously, vm.sh only checked the status of the VM itself, not the status of any services inside. With this update, administrators may now use a newly provided status check program which checks the availability of services within virtual machines running Red Hat Enterprise Virtualization Manager. Timeouts for starting and stopping virtual machines are now configurable in cluster.conf. The start timeout is based on the status check program.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-08-25 06:33:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 583788    
Bug Blocks:    

Description Benjamin Kahn 2010-05-20 19:27:56 UTC
This bug has been copied from bug #583788 and has been proposed
to be backported to 5.5 z-stream (EUS).

Comment 11 yeylon@redhat.com 2010-06-21 15:43:58 UTC
we need some more improvement to the rhevm-check validation.

1. in current state we have only one timeout interval for rhev-check every X min.
due to the VM restart take ~ 5 min. this is the minimum limit that the test can run. this is unelectable for the rhevm node period for downtime (5 min interval + 5 min boot time will cause 10 min. of downtime)

we need to add a way to reduce this timeout to a more manner time.
one way to do this is by adding two different types of intervals
a. interval= X - for regular testing
b. after_failure_interval = Y - time to wait after the VM was restarted before initial testing

2. in the current state after one failure of the rhev-check.sh the rhevm node will be rebooted which is not the best way to go, we need to take in account possible scenarios that the VM did not response due to load or other possible scenarios.

we need to add a way to test few times before we determining if the RHEVM VM is dead. lets say if rhev-check.sh return error MSG once keep retry for X times for Y intervals and if all attempts has failed migrate the VM  

__max_failures="5" __failure_expire_time="60"

3. in current state the VM shutdown is being executed using virtsh shutdown and after 15 sec the KVM process is being killed so the VM did not have time to properly shutdown which can (and will) lead for corruption. (i had one) we need to increase the timeout between the shutdown of the VM and the process being killed (100~120 sec. should be fine)

Comment 15 yeylon@redhat.com 2010-06-29 11:20:02 UTC
looks like at this stage rhev-check.sh does not work as expected. the 5 min timeout for starting a VM never ends.

1. migrate the VM service.
2. as soon as the VM was relocated kill the KVM process on the server
3. see that the rhev-check keep getting errors but will not try to migrate the service once again after 5 min as expected but only after half an hour.

this will require respin.

Comment 20 errata-xmlrpc 2010-08-25 06:33:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0647.html

Comment 21 Florian Nadge 2010-10-18 17:33:09 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
previously, vm.sh only checked the status of the VM itself, not the status of any services inside. With this update, administrators may now use a newly provided status check program which checks the availability of services within virtual machines running Red Hat Enterprise Virtualization Manager. Timeouts for starting and stopping virtual machines are now configurable in cluster.conf. The start timeout is based on the status check program.

Comment 22 Florian Nadge 2010-10-18 17:33:21 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-previously, vm.sh only checked the status of the VM itself, not the status of any services inside. With this update, administrators may now use a newly provided status check program which checks the availability of services within virtual machines running Red Hat Enterprise Virtualization Manager. Timeouts for starting and stopping virtual machines are now configurable in cluster.conf. The start timeout is based on the status check program.+Previously, vm.sh only checked the status of the VM itself, not the status of any services inside. With this update, administrators may now use a newly provided status check program which checks the availability of services within virtual machines running Red Hat Enterprise Virtualization Manager. Timeouts for starting and stopping virtual machines are now configurable in cluster.conf. The start timeout is based on the status check program.