Created attachment 891547 [details] target host agent log Description of problem: Migration of hosted-engine vm put target host score to zero, in my case it also not change destination host score Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-1.1.2-2.el6ev.noarch How reproducible: Always Steps to Reproduce: 1. Setup hosted-engine environment with two hosts, on where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP) 2. Wait until vm start on second host and check host score 3. Actual results: --== Host 1 status ==-- Status up-to-date : True Hostname : 10.35.64.85 Host ID : 1 Engine status : {'reason': 'bad vm status', 'health': 'bad', 'vm': 'up', 'detail': 'waitforlaunch'} Score : 0 Local maintenance : False Host timestamp : 1398953165 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1398953165 (Thu May 1 17:06:05 2014) host-id=1 score=0 maintenance=False state=EngineUpBadHealth timeout=Thu May 1 17:10:34 2014 --== Host 2 status ==-- Status up-to-date : False Hostname : 10.35.97.36 Host ID : 2 Engine status : unknown stale-data Score : 2400 Local maintenance : False Host timestamp : 1398953033 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=1398953033 (Thu May 1 17:03:53 2014) host-id=2 score=2400 maintenance=False state=EngineUp Expected results: I expect that target vm will have 2400 score(like it was before migration started) and destination host will have some little score(because it have problems with connection to storage domain) Additional info:
Target host receive state of engine state=EngineUnexpectedlyDown and also in vdsm.log Thread-7780::ERROR::2014-05-01 18:28:25,314::vm::2285::vm.Vm::(_startUnderlyingVm) vmId=`a8d328ea-991a-4a06-ac3a-cf2c11d4f264`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 2245, in _startUnderlyingVm self._run() File "/usr/share/vdsm/vm.py", line 3172, in _run self._connection.createXML(domxml, flags), File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: internal error Failed to acquire lock: error -243 Thread-7780::DEBUG::2014-05-01 18:28:25,321::vm::2727::vm.Vm::(setDownStatus) vmId=`a8d328ea-991a-4a06-ac3a-cf2c11d4f264`::Changed state to Down: internal error Failed to acquire lock: error -243 Because this reason host score is zero
sanlock client status of source host daemon d341e1a7-1277-492e-a7b7-b1c6649427f6.rose05.qa. p -1 helper p -1 listener p -1 status s 21caf848-8e2c-4d24-b709-c4e189fa5f4b:2:/rhev/data-center/mnt/10.35.160.108\:_RHEV_artyom__hosted__engine/21caf848-8e2c-4d24-b709-c4e189fa5f4b/dom_md/ids:0 s hosted-engine:2:/rhev/data-center/mnt/10.35.160.108\:_RHEV_artyom__hosted__engine/21caf848-8e2c-4d24-b709-c4e189fa5f4b/ha_agent/hosted-engine.lockspace:0
In my case, this seems to happen after adding a third host to the cluster and it happens even without a migration. Putting a host into local maintenance will jump it's score back up to 2400, but will drop back down to 0 after 5 or so minutes of maintenance --mode=none
I have 3 hosts too and have the same problem. I'll upload logs of all hosts + engine (vdsm, supervdsm, sanlock, agent-ha, agent-broker) Sequence: started ovirt01 which started engine01 while ovirt02 and 03 were powered off. Then started ovirt02 and waited until stable, meaning hosted-deploy --vm-status gave a correct status. Then started ovirt03 and waited until the error -243 showed up. Collected the logs.
Created attachment 907423 [details] engine logs
Created attachment 907424 [details] host01 logs
Created attachment 907425 [details] host02 logs
Created attachment 907426 [details] host03 logs
I had the same problem when adding a third host. According to hosted_engine.py, engine_status_score is engine_status_score_lookup = { 'None': 0, 'vm-down': 1, 'vm-up bad-health-status': 2, 'vm-up good-health-status': 3, } It seems that in state_machine.py, refresh function of class EngineStateMachine sets best_engine to the host with the lowest engine_status_score. Problem is, in consume function of EngineDown class, new_data.best_engine_status["vm"] can never be up. Here's what I understood: node1 is running the hosted engine, so it has the highest engine_status_score (vm-up good-health-status). When node2 refreshes data, it becomes the best_engine since it has the lowest engine_status_score (None). It then tries to start the engine. The same applies to node3. They cannot do this since engine is up and running on node1 and (I think) is locked. They finally transition to state EngineUnexpectedlyDown. I think best_engine should be the host with the highest engine_status_score. Changing line 124 of state_machine.py from "best_engine = min(alive_hosts," to "best_engine = max(alive_hosts," solved the problem for me. It never happened again and I could migrate engine from one node to another without issue (which was not the case without this change). It may not be the ideal solution as I'm just starting with oVirt but I hope it will help solving this bug.
(In reply to Benoit Laniel from comment #9) > - nice catch and good analysis Benoit! Thanks and welcome aboard!
Meital, Can you check if you have capacity to test this for 3.4.1? This is a pretty serious bug that we'd really like to get in.
I can confirm the bug on a freshly installed oVirt 3.4.1 with only two host in HA that rely on external NFSv4.
I think there might be an issue in the host score calculation when a Vm is migrating away. My guess is that once the status changes to something else then Up, we drop the score.
I have a three node setup w/ hosted engine, using gluster nfs fronted by ctdb for the engine's storage. Every hosted engine migration triggers the error: VM HostedEngine is down. Exit message: internal error: Failed to acquire lock: error -243. As with other reports here, the HostedEngine never actually goes down. This is on ovirt 3.4.2 w/ 3 F20 hosts and a CentOS 6.5 hosted engine. I can upload logs, etc. if needed.
I can check it, we have three hosts for hosted engine and also I can also change this line in state_machine.py(from min to max)
(In reply to Jason Brooks from comment #15) > I have a three node setup w/ hosted engine, using gluster nfs fronted by > ctdb for the engine's storage. Every hosted engine migration triggers the > error: > > VM HostedEngine is down. Exit message: internal error: Failed to acquire > lock: error -243. > > As with other reports here, the HostedEngine never actually goes down. > > This is on ovirt 3.4.2 w/ 3 F20 hosts and a CentOS 6.5 hosted engine. > > I can upload logs, etc. if needed. This is a different bug, please create a separate ticket for it, upload the logs and describe how did you run the migration.
*** Bug 1093638 has been marked as a duplicate of this bug. ***
Verified on ovirt-hosted-engine-ha-1.2.1-0.2.master.20140805072346.el6.noarch Checked with 3 hosts, all works fine, also check scenario in description vm migrated without dropping destination host score to zero.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0194.html