Created attachment 891755 [details] agent broker and vdsm logs from two hosts Description of problem: Like result from bug https://bugzilla.redhat.com/show_bug.cgi?id=1093621 I was need to mount storage domains manually via command hosted-engine --connect-storage and start ha agent via command service ovirt-ha-agent start, like result I have two hosts: Status up-to-date : True Hostname : 10.35.64.85 Host ID : 1 Engine status : {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} Score : 2400 Local maintenance : False Host timestamp : 1399022514 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1399022514 (Fri May 2 12:21:54 2014) host-id=1 score=2400 maintenance=False state=EngineDown --== Host 2 status ==-- Status up-to-date : True Hostname : 10.35.97.36 Host ID : 2 Engine status : {'reason': 'vm not running on this host', 'health': 'bad', 'vm': 'down', 'detail': 'unknown'} Score : 2400 Local maintenance : False Host timestamp : 1399022507 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1399022507 (Fri May 2 12:21:47 2014) host-id=2 score=2400 maintenance=False state=EngineDown And HA agent not start engine vm automatically. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-1.1.2-2.el6ev.noarch How reproducible: Always Steps to Reproduce: 1. See above 2. 3. Actual results: Engine vm still in down state and no error messages in agent broker or vdsm logs Expected results: HA agent must start engine vm automatically, and if it not must show ERROR message in agent log Additional info: Also it possible to start engine vm manually via (hosted-engine --vm-start), so after it seems like deadlock vanished
Jirka, Can you please keep your eye on this and determine if this is a duplicate of 1093366 as you progress with that fix.
I have the same problem. When all hosts have engine status "vm not running on this host", they all have a score of 2400. So when EngineDown's consume function is called, none of them has the "best score" so none of them starts the engine vm. About the manual starting of vm using the command line, there is a problem only when not in global maintenance mode. What happens is EngineDown detects vm is "unexpectedly running locally" and then transition to EngineUp. While the vm is starting, the state transition to EngineUpBadHealth since the vm is up but the engine is not completely started (failed liveliness check). The score is then set to 0. Then when EngineUpBadHealth's consume function calls EngineUp's consume function (if I understand correctly), the vm is immediately shut down since another host has a better score (which will always be the case since we have multiple hosts with a high score and no running vm). We then enter in a loop where each host starts and stops almost immediately the vm. The only solution I found at the moment is to check if the class is an instance of EngineUpBadHealth : in line 347 of ovirt_hosted_engine_ha/agent/states.py elif (new_data.best_score_host and new_data.best_score_host["host-id"] != new_data.host_id and new_data.best_score_host["score"] >= self.score(logger) + self.MIGRATION_THRESHOLD_SCORE and not isinstance(self, EngineUpBadHealth)): logger.error("Host %s (id %d) score is significantly better" " than local score, shutting down VM on this host", new_data.best_score_host['hostname'], new_data.best_score_host["host-id"]) return EngineStop(new_data) I think it's harmless since EngineUpBadHealth has a timeout which will stop the vm if there is a problem. The election can then start again.
I believe it's a duplicate of 1093366. When the VM is starting the current code drops the score of the host starting the VM which makes other host better targets so agent stops starting the VM a leave it for another host and the cycle repeats... *** This bug has been marked as a duplicate of bug 1093366 ***
Not a dupe after all.
Checked on ovirt-hosted-engine-ha-1.2.1-0.2.master.20140805072346.el6.noarch I tried next scenario to reproduce bug: 1) Have hosted-engine environment with 2 hosts and running engine-vm 2) Set global maintenance mode(hosted-engine --set-maintenance --mode=global) 3) Destroy engine-vm(vdsClient -s 0 destroy vm_id) 4) Update maintenance mode to none(hosted-engine --set-maintenance --mode=none) 5) Wait... After some like 10 minutes vm still down: # hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : 10.35.64.85 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 1137844 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1137844 (Thu Aug 7 14:50:30 2014) host-id=1 score=2400 maintenance=False state=EngineDown --== Host 2 status ==-- Status up-to-date : True Hostname : 10.35.97.36 Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 962199 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=962199 (Thu Aug 7 14:49:28 2014) host-id=2 score=2400 maintenance=False state=EngineDown Like I see from agent log, each host think that another host is better to run vm. Again I can to run vm manually
Created attachment 924891 [details] agent logs
Correction, when I start vm manually via hosted-engine --vm-start it also failed, because: MainThread::INFO::2014-08-07 14:55:45,661::states::567::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to bad engine health at Thu Aug 7 14:55:45 2014 MainThread::INFO::2014-08-07 14:55:45,662::hosted_engine::326::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUpBadHealth (score: 0) MainThread::INFO::2014-08-07 14:55:45,662::hosted_engine::331::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 10.35.97.36 (id: 2, score: 2400) MainThread::ERROR::2014-08-07 14:55:55,692::states::553::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine VM has bad health status, timeout in 300 seconds MainThread::INFO::2014-08-07 14:55:55,693::states::567::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to bad engine health at Thu Aug 7 14:55:55 2014 MainThread::ERROR::2014-08-07 14:55:55,693::states::382::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host 10.35.97.36 (id 2) score is significantly better than local score, shutting down VM on this host MainThread::INFO::2014-08-07 14:55:55,705::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineUpBadHealth'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> MainThread::INFO::2014-08-07 14:55:55,717::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1407412555.72 type=state_transition detail=EngineUpBadHealth-EngineStop hostname='master-vds10.qa.lab.tlv.redhat.com' MainThread::INFO::2014-08-07 14:55:56,481::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUpBadHealth-EngineStop) sent? sent MainThread::INFO::2014-08-07 14:55:56,910::hosted_engine::326::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400) MainThread::INFO::2014-08-07 14:55:56,911::hosted_engine::331::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 10.35.97.36 (id: 2, score: 2400) MainThread::INFO::2014-08-07 14:56:06,940::hosted_engine::949::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Shutting down vm using `/usr/sbin/hosted-engine --vm-shutdown` MainThread::INFO::2014-08-07 14:56:07,133::hosted_engine::954::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stdout: Machine shutting down MainThread::INFO::2014-08-07 14:56:07,134::hosted_engine::955::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) stderr: MainThread::ERROR::2014-08-07 14:56:07,134::hosted_engine::963::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_engine_vm) Engine VM stopped on localhost MainThread::INFO::2014-08-07 14:56:07,147::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout set to Thu Aug 7 15:01:06 2014 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineStop'> MainThread::INFO::2014-08-07 14:56:07,161::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1407412567.16 type=state_transition detail=EngineStop-EngineStop hostname='master-vds10.qa.lab.tlv.redhat.com' MainThread::INFO::2014-08-07 14:56:07,225::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStop-EngineStop) sent? sent MainThread::INFO::2014-08-07 14:56:07,699::hosted_engine::326::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineStop (score: 2400)
I also see that agent use shutdown vm instead of poweroff, but shutdown works only with guest-agent, so you stuck with vm in state Down and agent not start vm on other host until you destroy vm manually.
Verified on ovirt-hosted-engine-ha-1.2.1-0.2.master.20140818121322.20140818121320.gitcbf096f.el6.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0194.html