Description of problem: When trying to start the hosted engine VM either manually or waiting for ha agent, it fails, because the agent's state machine is put directly to EngineUp state which expects fully operational engine and doesn't wait for the vm to start and thus falling to EngineUpBadHealth state and killing the VM. This situation then repeats on other hosts. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-1.2.1-0.2.master.20140805072346.el6.noarch How reproducible: 100% Steps to Reproduce: 1. install HE on a cluster with 3+ hosts 2. kill the HE guest (halt -p) 3. run hosted-engine --vm-start Actual results: the VM is killed while powering up Expected results: engnie up'm'running after a while... Additional info:
Missing merge on 1.2 branch
Same as 1147411 behaviour, here even on two hosts only, host on which vm being manually started, for the beginning being powered up, but then goes to powering down: [root@brown-vdsd ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : 10.35.103.12 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 96751 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=96751 (Wed Oct 22 11:10:48 2014) host-id=1 score=2400 maintenance=False state=EngineDown --== Host 3 status ==-- Status up-to-date : True Hostname : 10.35.106.13 Host ID : 3 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"} Score : 2400 Local maintenance : False Host timestamp : 77390 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=77390 (Wed Oct 22 08:10:47 2014) host-id=3 score=2400 maintenance=False state=EngineStop timeout=Thu Jan 1 23:34:27 1970 [root@brown-vdsd ~]# hosted-engine --vm-status [root@brown-vdsd ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : 10.35.103.12 Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 96961 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=96961 (Wed Oct 22 11:14:18 2014) host-id=1 score=2400 maintenance=False state=EngineDown --== Host 3 status ==-- Status up-to-date : True Hostname : 10.35.106.13 Host ID : 3 Engine status : {"health": "good", "vm": "up", "detail": "powering down"} Score : 2400 Local maintenance : False Host timestamp : 77593 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=77593 (Wed Oct 22 08:14:10 2014) host-id=3 score=2400 maintenance=False state=EngineStop timeout=Thu Jan 1 23:34:27 1970 Engine actually up and HE within GUI shown as being powered down, then powered up and stays up. After some time engine goes within the GUI to UP again and via CLI shown as follows: [root@brown-vdsd ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : 10.35.103.12 Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Score : 2400 Local maintenance : False Host timestamp : 97097 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=97097 (Wed Oct 22 11:16:34 2014) host-id=1 score=2400 maintenance=False state=EngineStarting --== Host 3 status ==-- Status up-to-date : True Hostname : 10.35.106.13 Host ID : 3 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "powering up"} Score : 2400 Local maintenance : False Host timestamp : 77732 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=77732 (Wed Oct 22 08:16:28 2014) host-id=3 score=2400 maintenance=False state=EngineStarting You have new mail in /var/spool/mail/root [root@brown-vdsd ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : 10.35.103.12 Host ID : 1 Engine status : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"} Score : 2400 Local maintenance : False Host timestamp : 97131 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=97131 (Wed Oct 22 11:17:08 2014) host-id=1 score=2400 maintenance=False state=EngineStarting --== Host 3 status ==-- Status up-to-date : True Hostname : 10.35.106.13 Host ID : 3 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "up"} Score : 2400 Local maintenance : False Host timestamp : 77765 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=77765 (Wed Oct 22 08:17:02 2014) host-id=3 score=2400 maintenance=False state=EngineStarting Behaviour should be stable, VM should be powered up.
The behaviour is not the same, in this bug the liveliness check fails, which means that agent fails to communicate with the engine (accessing the the health status page) so my guess here is that your network is somehow broken or the VM running the engine is overloaded. Either way, this is expected behaviour, and you should wait for a while if it will come back to 'up' and health 'good'. When you reproduce this again, please try to run this command, to test the accessibility of the engine status page: curl http://{fqdn}/ovirt-engine/services/health if it fetches the page correctly? also please not
Works for me on these components: ovirt-host-deploy-1.3.0-2.el6ev.noarch ovirt-hosted-engine-setup-1.2.1-8.el6ev.noarch qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64 mom-0.4.1-4.el6ev.noarch sanlock-2.8-1.el6.x86_64 vdsm-4.16.8.1-3.el6ev.x86_64 libvirt-0.10.2-46.el6_6.2.x86_64 ovirt-hosted-engine-ha-1.2.4-3.el6ev.noarch rhevm-3.5.0-0.25.el6ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0194.html