Created attachment 944564 [details] logs Description of problem: I have HE environment with two hosts, after that I put environment to global maintenance, I kill vm, and disable global maintenance. I see that vm started on one of hosts and second host receive state state=EngineUnexpectedlyDown. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-1.2.2-2.el6ev.noarch How reproducible: Always Steps to Reproduce: 1. Setup HE environment with two hosts 2. Put environment to global maintenance and kill vm 3. Disable global maintenance Actual results: --== Host 1 status ==-- Status up-to-date : True Hostname : 10.35.97.36 Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 2400 Local maintenance : False Host timestamp : 103047 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=103047 (Tue Oct 7 15:10:18 2014) host-id=1 score=2400 maintenance=False state=EngineUp --== Host 2 status ==-- Status up-to-date : True Hostname : 10.35.64.85 Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 0 Local maintenance : False Host timestamp : 102939 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=102939 (Tue Oct 7 15:12:25 2014) host-id=2 score=0 maintenance=False state=EngineUnexpectedlyDown timeout=Fri Jan 2 06:42:26 1970 Vm started, but second host receive state state=EngineUnexpectedlyDown Expected results: Vm started and both hosts have status state=EngineUp Additional info:
This is expected behavior. The agent should sync after some time. Also please note, that only the agent on host which is running the engine vm will have state=EngineUp, the other one will have state=EngineDown.
We should at least improve the log message to not confuse users.
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2
Checked on ovirt-hosted-engine-ha-1.3.0-0.3.beta.git183a4ff.el7ev.noarch Run scenario above, but do not see any explanation logger. Just want to clarify, from code, flow will never enter to except part, because action start VM via vds
There are two pieces of the fix. The start command might fail (the message there is already fixed) and two agents might try to start the VM twice. The second case is a duplicate of 1207634, but I will close the other one to keep the flags.
*** Bug 1207634 has been marked as a duplicate of this bug. ***
*** Bug 1164572 has been marked as a duplicate of this bug. ***
Verified on ovirt-hosted-engine-ha-1.3.0-0.5.beta.git9a2bd43.el7ev.noarch MainThread::INFO::2015-09-16 15:59:20,360::states::746::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Another host already took over.. MainThread::INFO::2015-09-16 15:59:20,368::state_decorators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineForceStop'>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0422.html