Description of problem: HE VM not powered up on alma03 host after ovirt-ha-broker service stopped on alma04. ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to unexpected vm shutdown. [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 2400 Local maintenance : False Host timestamp : 66481 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=66481 (Tue Mar 31 10:45:29 2015) host-id=1 score=2400 maintenance=False state=EngineDown --== Host 2 status ==-- Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 2400 Local maintenance : False Host timestamp : 66442 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=66442 (Tue Mar 31 10:44:59 2015) host-id=2 score=2400 maintenance=False state=EngineUp [root@alma03 ~]# hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 0 Local maintenance : False Host timestamp : 66600 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=66600 (Tue Mar 31 10:47:28 2015) host-id=1 score=0 maintenance=False state=EngineUnexpectedlyDown timeout=Thu Jan 1 18:39:07 1970 --== Host 2 status ==-- Status up-to-date : False Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : unknown stale-data Score : 2400 Local maintenance : False Host timestamp : 66442 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=66442 (Tue Mar 31 10:44:59 2015) host-id=2 score=2400 maintenance=False state=EngineUp Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1.Deploy HE on two RHEVHs6.6 (20150304.0.el6ev). 2.Stop service ovirt-ha-broker on host that currently running the HE VM. 3.Wait for HE VM to get started on another host and see that it's score changes to 0 for some unknown reason. Actual results: HE VM not started on second host after service ovirt-ha-broker stopped on host that is running HE VM. Expected results: HE VM should be started on second host and score should not be zero. Additional info: logs attached.
Created attachment 1008955 [details] agent.log
Created attachment 1008956 [details] broker.log
Created attachment 1008957 [details] alma03 logs
Created attachment 1008958 [details] alma03 logs
Components that were used on Red Hat Enterprise Virtualization Hypervisor 6.6 (20150304.0.el6ev): sanlock-2.8-1.el6.x86_64 mom-0.4.1-4.el6ev.noarch ovirt-node-selinux-3.2.1-9.el6.noarch ovirt-host-deploy-offline-1.3.0-3.el6ev.x86_64 ovirt-node-plugin-vdsm-0.2.0-19.el6ev.noarch ovirt-host-deploy-1.3.0-2.el6ev.noarch libvirt-client-0.10.2-46.el6_6.3.x86_64 ovirt-node-plugin-rhn-3.2.1-9.el6.noarch ovirt-node-3.2.1-9.el6.noarch vdsm-4.16.8.1-7.el6ev.x86_64 ovirt-hosted-engine-ha-1.2.5-1.el6ev.noarch ovirt-node-plugin-hosted-engine-0.2.0-9.0.el6ev.x86_64 ovirt-node-plugin-cim-3.2.1-9.el6.noarch ovirt-node-branding-rhev-3.2.1-9.el6.noarch qemu-kvm-rhev-0.12.1.2-2.446.el6.x86_64 ovirt-hosted-engine-setup-1.2.2-1.el6ev.noarch ovirt-node-plugin-snmp-3.2.1-9.el6.noarch On engine Red Hat Enterprise Linux Server release 6.6 (Santiago): rhevm-guest-agent-common-1.0.10-2.el6ev.noarch rhevm-3.5.1-0.2.el6ev.noarch
This is not urgent at all, because I have not seen it in production ever. The second host tries to start the engine when you stop the broker on the first host (because it is not getting any updates and thinks that the host is dead). But the engine is still running so sanlock prevents the VM from starting on the second host. That puts the host to EngineUnexpectedlyDown for ten minutes. The score is reduced to 0 while the host is in that state. There is one known issue here and that is we do not know the reason for the VM crash. We can't distinguish sanlock protection from a real crash here.
*** This bug has been marked as a duplicate of bug 1150087 ***