1034787 – [RFE] Hosted-HA should not start infinite reboot loop when VDSM reports the engine VM as "Running"

Bug 1034787 - [RFE] Hosted-HA should not start infinite reboot loop when VDSM reports the engine VM as "Running"

Summary: [RFE] Hosted-HA should not start infinite reboot loop when VDSM reports the e...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Martin Sivák
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:	sla
Depends On:
Blocks:	GSS_RHEV_33_BETA
TreeView+	depends on / blocked

Reported:	2013-11-26 14:04 UTC by Pablo Iranzo Gómez
Modified:	2016-02-10 20:17 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ovirt-hosted-engine-ha-0.1.0-0.8.rc.el6ev
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-01-21 16:51:42 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Agent log (89.35 KB, text/x-log) 2013-11-26 14:07 UTC, Pablo Iranzo Gómez	no flags	Details
Broker log (259.23 KB, text/x-log) 2013-11-26 14:08 UTC, Pablo Iranzo Gómez	no flags	Details
vdsm log (3.52 MB, text/x-log) 2013-11-26 14:11 UTC, Pablo Iranzo Gómez	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:0080	0	normal	SHIPPED_LIVE	new package: ovirt-hosted-engine-ha	2014-01-21 21:00:07 UTC
oVirt gerrit	21734	0	None	None	None	Never

Description Pablo Iranzo Gómez 2013-11-26 14:04:45 UTC

Description of problem:
When monitoring the VM that provides RHEV-M in hosted-mode, it may take more time than the default configured, and make the VM to enter in a reboot-loop even with just one host.

Having a configurable value that user could tune for adapting to lower-performance mediums will help in doing evals with better user-experience.

It's posible to workarround it by setting the engine on maintenance or tweaking global variables.

Comment 2 Pablo Iranzo Gómez 2013-11-26 14:07:43 UTC

Created attachment 829281 [details]
Agent log

Comment 3 Pablo Iranzo Gómez 2013-11-26 14:08:31 UTC

Created attachment 829282 [details]
Broker log

Comment 4 Pablo Iranzo Gómez 2013-11-26 14:11:50 UTC

Created attachment 829295 [details]
vdsm log

Comment 5 Martin Sivák 2013-11-26 14:35:41 UTC

This is the result of vmGetStats call to VDSM. Notice the Status field with value Running. That is not a valid status..

{'statsList': [{'acpiEnable': 'true',
                'appsList': [u'rhevm-guest-agent-common-1.0.8-6.el6ev',
                             u'kernel-2.6.32-431.el6'],
                'balloonInfo': {},
                'clientIp': '',
                'cpuSys': '1.79',
                'cpuUser': '3.86',
                'disks': {u'hdc': {'apparentsize': '0',
                                   'flushLatency': '0',
                                   'readLatency': '0',
                                   'readRate': '0.00',
                                   'truesize': '0',
                                   'writeLatency': '0',
                                   'writeRate': '0.00'},
                          u'vda': {'apparentsize': '32212254720',
                                   'flushLatency': '95602',
                                   'imageID': '59bc6b3e-9109-4bdc-8141-2e1a27149a05',
                                   'readLatency': '14869881',
                                   'readRate': '1375.47',
                                   'truesize': '6963359744',
                                   'writeLatency': '390780435',
                                   'writeRate': '9824.76'}},
                'disksUsage': [{u'fs': u'ext4',
                                u'path': u'/',
                                u'total': '26958753792',
                                u'used': '6088613888'},
                               {u'fs': u'ext4',
                                u'path': u'/boot',
                                u'total': '507744256',
                                u'used': '40357888'}],
                'displayIp': '0',
                'displayPort': u'5900',
                'displaySecurePort': u'5901',
                'displayType': 'qxl',
                'elapsedTime': '1998',
                'guestFQDN': u'rhevm.example.com',
                'guestIPs': u'192.168.2.115',
                'guestName': u'rhevm.example.com',
                'guestOs': u'2.6.32-431.el6.x86_64',
                'hash': '-7610418427437194438',
                'kvmEnable': 'true',
                'lastLogin': 1385472572.382862,
                'memUsage': '41',
                'memoryStats': {u'majflt': '0',
                                u'mem_free': '1851624',
                                u'mem_total': '2956596',
                                u'mem_unused': '1503868',
                                u'pageflt': '19',
                                u'swap_in': '0',
                                u'swap_out': '0',
                                u'swap_total': '4194296',
                                u'swap_usage': '0'},
                'monitorResponse': '0',
                'netIfaces': [{u'hw': u'00:16:3e:15:97:16',
                               u'inet': [u'192.168.2.115'],
                               u'inet6': [u'fe80::216:3eff:fe15:9716'],
                               u'name': u'eth0'}],
                'network': {u'vnet0': {'macAddr': '00:16:3e:15:97:16',
                                       'name': u'vnet0',
                                       'rxDropped': '0',
                                       'rxErrors': '0',
                                       'rxRate': '0.0',
                                       'speed': '1000',
                                       'state': 'unknown',
                                       'txDropped': '0',
                                       'txErrors': '0',
                                       'txRate': '0.0'}},
                'pauseCode': 'NOERR',
                'pid': '11507',
                'session': 'Unknown',
                'statsAge': '1.80',
                'status': 'Running',
                'timeOffset': '0',
                'username': u'None',
                'vmId': 'ebdc068e-a1b6-4403-a8f3-1a44db957e15',
                'vmType': 'kvm'}],
 'status': {'code': 0, 'message': 'Done'}}

Comment 6 Doron Fediuck 2013-11-26 16:24:58 UTC

The running status is a result of having a guest agent installed, and it is
valid. The HA should be able to handle it as well.

Comment 7 Martin Sivák 2013-11-26 17:10:56 UTC

Here is the list of valid states:

vm.py:

VALID_STATES = ('Down', 'Migration Destination', 'Migration Source',
                'Paused', 'Powering down', 'RebootInProgress',
                'Restoring state', 'Saving State',
                'Up', 'WaitForLaunch')

and API description in the json file:

##
# @VmStatus:
#
# An enumeration of possible virtual machine statuses.
#
# @Down:                   The VM is powered off
#
# @Migration Destination:  The VM is migrating to this host
#
# @Migration Source:       The VM is migrating away from this host
#
# @Paused:                 The VM is paused
#
# @Powering down:          A shutdown command has been sent to the VM
#
# @RebootInProgress:       The VM is currently rebooting
#
# @Restoring state:        The VM is waking from hibernation
#
# @Saving State:           The VM is preparing for hibernation
#
# @Up:                     The VM is running
#
# @WaitForLaunch:          The VM is being created
#
# Since: 4.10.0
##
{'enum': 'VmStatus',
 'data': ['Down', 'Migration Destination', 'Migration Source', 'Paused',
          'Powering down', 'RebootInProgress', 'Restoring state',
          'Saving State', 'Up', 'WaitForLaunch']}

So if Running is valid, it is undocumented.

And here is the code that causes the bug:

vm.py:

    def _getStatsInternal(self):
        # used by API.Vm.getStats

        def _getGuestStatus():
            GUEST_WAIT_TIMEOUT = 60
            now = time.time()
            if now - self._guestEventTime < 5 * GUEST_WAIT_TIMEOUT and \
                    self._guestEvent == 'Powering down':
                return self._guestEvent
            if self.guestAgent and self.guestAgent.isResponsive() and \
                    self.guestAgent.getStatus():
                return self.guestAgent.getStatus() # !! HERE !!
            if now - self._guestEventTime < GUEST_WAIT_TIMEOUT:
                return self._guestEvent
            return 'Up'

Comment 9 Artyom 2013-12-08 09:12:12 UTC

Please provide correct information for bug reproducing
Thanks

Comment 10 Artyom 2013-12-08 09:32:46 UTC

Also I was unable to reach status "Running" for vm, via getVmStats I saw that first vm in "Powering Up" mode and after this in "Up" mode.

Comment 11 Pablo Iranzo Gómez 2013-12-09 09:06:20 UTC

(In reply to Artyom from comment #9)
> Please provide correct information for bug reproducing
> Thanks

Artyom, install the ovirt guest agent inside the VM, and it will start behaving like that. 

I had to manually apply the patch on each hosted-agent host to make it run stable.

Do you need anything else?

Comment 12 Artyom 2013-12-09 12:17:18 UTC

After rhevm-guest-agent installation vmGetStats show that vm status is "Running"(in vdsm.log), and hosted vm run without any troubleshoots and restarts.
Verified on ovirt-hosted-engine-ha-0.1.0-0.8.rc.el6ev.noarch

Comment 13 errata-xmlrpc 2014-01-21 16:51:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0080.html

Note You need to log in before you can comment on or make changes to this bug.