1093366 – Migration of hosted-engine vm put target host score to zero

Bug 1093366 - Migration of hosted-engine vm put target host score to zero

Summary: Migration of hosted-engine vm put target host score to zero

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	3.4.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Doron Fediuck
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:	sla
Depends On:	1123285
Blocks:	1107968 1119763 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-05-01 14:26 UTC by Artyom
Modified:	2019-04-28 09:39 UTC (History)
CC List:	12 users (show)
Fixed In Version:	ovirt-3.5.0-beta2
Doc Type:	Bug Fix
Doc Text:	Previously, in Hosted Engine environments with more than two host nodes, the host with the lowest engine status score was being chosen as the best alternative host for migration of the Hosted Engine virtual machine. This prevented the virtual machine from being migrated. Now, the host with the highest engine status score is chosen as the best alternative and migration occurs as expected.
Clone Of:
Clones:	1119763 (view as bug list)
Environment:
Last Closed:	2015-02-11 21:08:25 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
target host agent log (5.49 MB, text/x-log) 2014-05-01 14:26 UTC, Artyom	no flags	Details
engine logs (524.16 KB, application/gzip) 2014-06-10 21:10 UTC, Joop van de Wege	no flags	Details
host01 logs (1.37 MB, application/gzip) 2014-06-10 21:10 UTC, Joop van de Wege	no flags	Details
host02 logs (116.21 KB, application/gzip) 2014-06-10 21:11 UTC, Joop van de Wege	no flags	Details
host03 logs (225.07 KB, application/gzip) 2014-06-10 21:11 UTC, Joop van de Wege	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0194	normal	SHIPPED_LIVE	ovirt-hosted-engine-ha bug fix and enhancement update	2015-02-12 01:35:33 UTC
oVirt gerrit	29580	master	MERGED	Use max instead of min when computing the best score	Never
oVirt gerrit	29787	master	MERGED	Treat migration state as Up and check engine health normally	Never
oVirt gerrit	29902	ovirt-hosted-engine-ha-1.2	MERGED	Treat migration state as Up and check engine health normally	Never
oVirt gerrit	29903	ovirt-hosted-engine-ha-1.1	MERGED	Treat migration state as Up and check engine health normally	Never
oVirt gerrit	29920	ovirt-hosted-engine-ha-1.2	MERGED	use max not min when computing the best score	Never
oVirt gerrit	29922	ovirt-hosted-engine-ha-1.1	MERGED	use max not min when computing the best score	Never

Description Artyom 2014-05-01 14:26:49 UTC

Created attachment 891547 [details]
target host agent log

Description of problem:
Migration of hosted-engine vm put target host score to zero, in my case it also not change destination host score

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.1.2-2.el6ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Setup hosted-engine environment with two hosts, on where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP)
2. Wait until vm start on second host and check host score
3.

Actual results:
--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.64.85
Host ID                            : 1
Engine status                      : {'reason': 'bad vm status', 'health': 'bad', 'vm': 'up', 'detail': 'waitforlaunch'}
Score                              : 0
Local maintenance                  : False
Host timestamp                     : 1398953165
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1398953165 (Thu May  1 17:06:05 2014)
        host-id=1
        score=0
        maintenance=False
        state=EngineUpBadHealth
        timeout=Thu May  1 17:10:34 2014


--== Host 2 status ==--

Status up-to-date                  : False
Hostname                           : 10.35.97.36
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 1398953033
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1398953033 (Thu May  1 17:03:53 2014)
        host-id=2
        score=2400
        maintenance=False
        state=EngineUp


Expected results:
I expect that target vm will have 2400 score(like it was before migration started) and destination host will have some little score(because it have problems with connection to storage domain)

Additional info:

Comment 1 Artyom 2014-05-01 15:38:14 UTC

Target host receive state of engine state=EngineUnexpectedlyDown and also in vdsm.log
Thread-7780::ERROR::2014-05-01 18:28:25,314::vm::2285::vm.Vm::(_startUnderlyingVm) vmId=`a8d328ea-991a-4a06-ac3a-cf2c11d4f264`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 2245, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/vm.py", line 3172, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: internal error Failed to acquire lock: error -243
Thread-7780::DEBUG::2014-05-01 18:28:25,321::vm::2727::vm.Vm::(setDownStatus) vmId=`a8d328ea-991a-4a06-ac3a-cf2c11d4f264`::Changed state to Down: internal error Failed to acquire lock: error -243

Because this reason host score is zero

Comment 2 Artyom 2014-05-01 15:40:02 UTC

sanlock client status of source host
daemon d341e1a7-1277-492e-a7b7-b1c6649427f6.rose05.qa.
p -1 helper
p -1 listener
p -1 status
s 21caf848-8e2c-4d24-b709-c4e189fa5f4b:2:/rhev/data-center/mnt/10.35.160.108\:_RHEV_artyom__hosted__engine/21caf848-8e2c-4d24-b709-c4e189fa5f4b/dom_md/ids:0
s hosted-engine:2:/rhev/data-center/mnt/10.35.160.108\:_RHEV_artyom__hosted__engine/21caf848-8e2c-4d24-b709-c4e189fa5f4b/ha_agent/hosted-engine.lockspace:0

Comment 3 Andrew Lau 2014-06-07 23:42:49 UTC

In my case, this seems to happen after adding a third host to the cluster and it happens even without a migration.

Putting a host into local maintenance will jump it's score back up to 2400, but will drop back down to 0 after 5 or so minutes of maintenance --mode=none

Comment 4 Joop van de Wege 2014-06-10 21:04:29 UTC

I have 3 hosts too and have the same problem. I'll upload logs of all hosts + engine (vdsm, supervdsm, sanlock, agent-ha, agent-broker)
Sequence:
started ovirt01 which started engine01 while ovirt02 and 03 were powered off.
Then started ovirt02 and waited until stable, meaning hosted-deploy --vm-status gave a correct status.
Then started ovirt03 and waited until the error -243 showed up.
Collected the logs.

Comment 5 Joop van de Wege 2014-06-10 21:10:24 UTC

Created attachment 907423 [details]
engine logs

Comment 6 Joop van de Wege 2014-06-10 21:10:58 UTC

Created attachment 907424 [details]
host01 logs

Comment 7 Joop van de Wege 2014-06-10 21:11:22 UTC

Created attachment 907425 [details]
host02 logs

Comment 8 Joop van de Wege 2014-06-10 21:11:47 UTC

Created attachment 907426 [details]
host03 logs

Comment 9 Benoit Laniel 2014-07-03 20:05:08 UTC

I had the same problem when adding a third host.

According to hosted_engine.py, engine_status_score is

    engine_status_score_lookup = {
        'None': 0,
        'vm-down': 1,
        'vm-up bad-health-status': 2,
        'vm-up good-health-status': 3,
    }

It seems that in state_machine.py, refresh function of class EngineStateMachine sets best_engine to the host with the lowest engine_status_score.

Problem is, in consume function of EngineDown class, new_data.best_engine_status["vm"] can never be up.

Here's what I understood:

node1 is running the hosted engine, so it has the highest engine_status_score (vm-up good-health-status).

When node2 refreshes data, it becomes the best_engine since it has the lowest engine_status_score (None). It then tries to start the engine. The same applies to node3.

They cannot do this since engine is up and running on node1 and (I think) is locked. They finally transition to state EngineUnexpectedlyDown.

I think best_engine should be the host with the highest engine_status_score.

Changing line 124 of state_machine.py from "best_engine = min(alive_hosts," to "best_engine = max(alive_hosts," solved the problem for me. It never happened again and I could migrate engine from one node to another without issue (which was not the case without this change).

It may not be the ideal solution as I'm just starting with oVirt but I hope it will help solving this bug.

Comment 10 Jiri Moskovcak 2014-07-04 10:32:24 UTC

(In reply to Benoit Laniel from comment #9)
> 

- nice catch and good analysis Benoit! Thanks and welcome aboard!

Comment 11 Scott Herold 2014-07-07 14:22:02 UTC

Meital,

Can you check if you have capacity to test this for 3.4.1?  This is a pretty serious bug that we'd really like to get in.

Comment 13 Stefano Stagnaro 2014-07-08 15:58:06 UTC

I can confirm the bug on a freshly installed oVirt 3.4.1 with only two host in HA that rely on external NFSv4.

Comment 14 Martin Sivák 2014-07-09 10:21:26 UTC

I think there might be an issue in the host score calculation when a Vm is migrating away. My guess is that once the status changes to something else then Up, we drop the score.

Comment 15 Jason Brooks 2014-07-10 22:22:11 UTC

I have a three node setup w/ hosted engine, using gluster nfs fronted by ctdb for the engine's storage. Every hosted engine migration triggers the error: 

VM HostedEngine is down. Exit message: internal error: Failed to acquire lock: error -243.

As with other reports here, the HostedEngine never actually goes down.

This is on ovirt 3.4.2 w/ 3 F20 hosts and a CentOS 6.5 hosted engine.

I can upload logs, etc. if needed.

Comment 16 Artyom 2014-07-11 05:51:23 UTC

I can check it, we have three hosts for hosted engine and also I can also change this line in state_machine.py(from min to max)

Comment 17 Jiri Moskovcak 2014-07-11 07:13:16 UTC

(In reply to Jason Brooks from comment #15)
> I have a three node setup w/ hosted engine, using gluster nfs fronted by
> ctdb for the engine's storage. Every hosted engine migration triggers the
> error: 
> 
> VM HostedEngine is down. Exit message: internal error: Failed to acquire
> lock: error -243.
> 
> As with other reports here, the HostedEngine never actually goes down.
> 
> This is on ovirt 3.4.2 w/ 3 F20 hosts and a CentOS 6.5 hosted engine.
> 
> I can upload logs, etc. if needed.

This is a different bug, please create a separate ticket for it, upload the logs and describe how did you run the migration.

Comment 18 Jiri Moskovcak 2014-07-14 07:14:49 UTC

*** Bug 1093638 has been marked as a duplicate of this bug. ***

Comment 20 Artyom 2014-08-07 13:48:39 UTC

Verified on ovirt-hosted-engine-ha-1.2.1-0.2.master.20140805072346.el6.noarch
Checked with 3 hosts, all works fine, also check scenario in description vm migrated without dropping destination host score to zero.

Comment 25 errata-xmlrpc 2015-02-11 21:08:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0194.html

Note You need to log in before you can comment on or make changes to this bug.