1093638 – If two hosts have engine status 'vm not running on this host' ha agent not start vm automatically

Bug 1093638 - If two hosts have engine status 'vm not running on this host' ha agent not start vm automatically

Summary: If two hosts have engine status 'vm not running on this host' ha agent not st...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	3.4.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Doron Fediuck
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:	sla
Depends On:	1123285
Blocks:	1119705 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-05-02 09:31 UTC by Artyom
Modified:	2016-02-10 20:15 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-3.5.0-beta2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1119705 (view as bug list)
Environment:
Last Closed:	2015-02-11 21:08:39 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
agent broker and vdsm logs from two hosts (2.21 MB, application/zip) 2014-05-02 09:31 UTC, Artyom	no flags	Details
agent logs (529.11 KB, application/zip) 2014-08-07 11:55 UTC, Artyom	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0194	normal	SHIPPED_LIVE	ovirt-hosted-engine-ha bug fix and enhancement update	2015-02-12 01:35:33 UTC
oVirt gerrit	30025	master	MERGED	added EngineStarting state	2021-01-25 10:59:41 UTC
oVirt gerrit	30026	ovirt-hosted-engine-ha-1.1	MERGED	added EngineStarting state	2021-01-25 10:59:41 UTC
oVirt gerrit	30027	ovirt-hosted-engine-ha-1.2	MERGED	added EngineStarting state	2021-01-25 10:59:41 UTC
oVirt gerrit	31510	master	MERGED	try to start the vm even if other hosts have the same score	2021-01-25 10:59:42 UTC
oVirt gerrit	31513	ovirt-hosted-engine-ha-1.2	MERGED	try to start the vm even if other hosts have the same score	2021-01-25 10:59:42 UTC
oVirt gerrit	31515	ovirt-hosted-engine-ha-1.1	MERGED	try to start the vm even if other hosts have the same score	2021-01-25 10:59:41 UTC

Description Artyom 2014-05-02 09:31:32 UTC

Created attachment 891755 [details]
agent broker and vdsm logs from two hosts

Description of problem:
Like result from bug https://bugzilla.redhat.com/show_bug.cgi?id=1093621 I was need to mount storage domains manually via command hosted-engine --connect-storage and start ha agent via command service ovirt-ha-agent start, like result I have two hosts:
Status up-to-date                  : True
Hostname                           : 10.35.64.85
Host ID                            : 1
Engine status                      : {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 1399022514
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1399022514 (Fri May  2 12:21:54 2014)
        host-id=1
        score=2400
        maintenance=False
        state=EngineDown


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.97.36
Host ID                            : 2
Engine status                      : {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 1399022507
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1399022507 (Fri May  2 12:21:47 2014)
        host-id=2
        score=2400
        maintenance=False
        state=EngineDown
And HA agent not start engine vm automatically.


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.1.2-2.el6ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. See above
2.
3.

Actual results:
Engine vm still in down state and no error messages in agent broker or vdsm logs

Expected results:
HA agent must start engine vm automatically, and if it not must show ERROR message in agent log

Additional info:
Also it possible to start engine vm manually via (hosted-engine --vm-start), so after it seems like deadlock vanished

Comment 1 Scott Herold 2014-07-07 14:27:46 UTC

Jirka,

Can you please keep your eye on this and determine if this is a duplicate of 1093366 as you progress with that fix.

Comment 2 Benoit Laniel 2014-07-08 21:51:04 UTC

I have the same problem. When all hosts have engine status "vm not running on this host", they all have a score of 2400. So when EngineDown's consume function is called, none of them has the "best score" so none of them starts the engine vm.

About the manual starting of vm using the command line, there is a problem only when not in global maintenance mode.

What happens is EngineDown detects vm is "unexpectedly running locally" and then transition to EngineUp. While the vm is starting, the state transition to EngineUpBadHealth since the vm is up but the engine is not completely started (failed liveliness check). The score is then set to 0.

Then when EngineUpBadHealth's consume function calls EngineUp's consume function (if I understand correctly), the vm is immediately shut down since another host has a better score (which will always be the case since we have multiple hosts with a high score and no running vm). We then enter in a loop where each host starts and stops almost immediately the vm.

The only solution I found at the moment is to check if the class is an instance of EngineUpBadHealth :

in line 347 of ovirt_hosted_engine_ha/agent/states.py

        elif (new_data.best_score_host and
              new_data.best_score_host["host-id"] != new_data.host_id and
              new_data.best_score_host["score"] >= self.score(logger) +
              self.MIGRATION_THRESHOLD_SCORE and
              not isinstance(self, EngineUpBadHealth)):
            logger.error("Host %s (id %d) score is significantly better"
                         " than local score, shutting down VM on this host",
                         new_data.best_score_host['hostname'],
                         new_data.best_score_host["host-id"])
            return EngineStop(new_data)

I think it's harmless since EngineUpBadHealth has a timeout which will stop the vm if there is a problem. The election can then start again.

Comment 3 Jiri Moskovcak 2014-07-14 07:14:49 UTC

I believe it's a duplicate of 1093366. When the VM is starting the current code drops the score of the host starting the VM which makes other host better targets so agent stops starting the VM a leave it for another host and the cycle repeats...

*** This bug has been marked as a duplicate of bug 1093366 ***

Comment 4 Jiri Moskovcak 2014-07-14 11:06:27 UTC

Not a dupe after all.

Comment 6 Artyom 2014-08-07 11:52:49 UTC

Checked on ovirt-hosted-engine-ha-1.2.1-0.2.master.20140805072346.el6.noarch
I tried next scenario to reproduce bug:
1) Have hosted-engine environment with 2 hosts and running engine-vm
2) Set global maintenance mode(hosted-engine --set-maintenance --mode=global)
3) Destroy engine-vm(vdsClient -s 0 destroy vm_id)
4) Update maintenance mode to none(hosted-engine --set-maintenance --mode=none)
5) Wait...

After some like 10 minutes vm still down:
# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.64.85
Host ID                            : 1
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 1137844
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1137844 (Thu Aug  7 14:50:30 2014)
        host-id=1
        score=2400
        maintenance=False
        state=EngineDown


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.97.36
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 962199
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=962199 (Thu Aug  7 14:49:28 2014)
        host-id=2
        score=2400
        maintenance=False
        state=EngineDown
Like I see from agent log, each host think that another host is better to run vm.
Again I can to run vm manually

Comment 7 Artyom 2014-08-07 11:55:02 UTC

Created attachment 924891 [details]
agent logs

Comment 8 Artyom 2014-08-07 12:07:23 UTC

Correction, when I start vm manually via hosted-engine --vm-start it also failed, because:
MainThread::INFO::2014-08-07 14:55:45,661::states::567::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to bad engine health at Thu Aug  7 14:55:45 2014
MainThread::INFO::2014-08-07 14:55:45,662::hosted_engine::326::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 0)
MainThread::INFO::2014-08-07 14:55:45,662::hosted_engine::331::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 10.35.97.36 (id: 2, score: 2400)
MainThread::ERROR::2014-08-07 14:55:55,692::states::553::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 300 seconds
MainThread::INFO::2014-08-07 14:55:55,693::states::567::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to bad engine health at Thu Aug  7 14:55:55 2014
MainThread::ERROR::2014-08-07 14:55:55,693::states::382::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host 10.35.97.36 (id 2) score is significantly better than local score, shutting down VM on this host
MainThread::INFO::2014-08-07 14:55:55,705::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineUpBadHealth'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'>
MainThread::INFO::2014-08-07 14:55:55,717::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1407412555.72 type=state_transition detail=EngineUpBadHealth-EngineStop hostname='master-vds10.qa.lab.tlv.redhat.com'
MainThread::INFO::2014-08-07 14:55:56,481::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUpBadHealth-EngineStop) sent? sent
MainThread::INFO::2014-08-07 14:55:56,910::hosted_engine::326::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400)
MainThread::INFO::2014-08-07 14:55:56,911::hosted_engine::331::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 10.35.97.36 (id: 2, score: 2400)
MainThread::INFO::2014-08-07 14:56:06,940::hosted_engine::949::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown`
MainThread::INFO::2014-08-07 14:56:07,133::hosted_engine::954::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: Machine shutting down

MainThread::INFO::2014-08-07 14:56:07,134::hosted_engine::955::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr:
MainThread::ERROR::2014-08-07 14:56:07,134::hosted_engine::963::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost
MainThread::INFO::2014-08-07 14:56:07,147::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Thu Aug  7 15:01:06 2014 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'>
MainThread::INFO::2014-08-07 14:56:07,161::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1407412567.16 type=state_transition detail=EngineStop-EngineStop hostname='master-vds10.qa.lab.tlv.redhat.com'
MainThread::INFO::2014-08-07 14:56:07,225::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStop-EngineStop) sent? sent
MainThread::INFO::2014-08-07 14:56:07,699::hosted_engine::326::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400)

Comment 9 Artyom 2014-08-07 12:13:59 UTC

I also see that agent use shutdown vm instead of poweroff, but shutdown works only with guest-agent, so you stuck with vm in state Down and agent not start vm on other host until you destroy vm manually.

Comment 11 Artyom 2014-09-01 13:54:58 UTC

Verified on ovirt-hosted-engine-ha-1.2.1-0.2.master.20140818121322.20140818121320.gitcbf096f.el6.noarch

Comment 14 errata-xmlrpc 2015-02-11 21:08:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0194.html

Note You need to log in before you can comment on or make changes to this bug.