Description of problem: When the HE VM fails to start on a host, because it is already running elsewhere, QEMU returns a specific error message. The ha-broker compares this error message with a few hard-coded strings to detect if the VM failed because of locked storage or some other reason. Since qemu-kvm-ev version 2.12.0-33.1, the error message has changed and the ha-broker no longer recognizes this failure case. Version-Release number of selected component (if applicable): qemu-kvm-ev 10:2.12.0-33.1.el7_7.4 ovirt-hosted-engine-ha 2.3.6 Steps to Reproduce: 1. Have HE deployed on 2 hosts. 2. Set global maintenance 3. Shutdown the HE VM 4. Once the VM is down, clear the global maintenance mode. 5. Wait for the VM to start on one of the hosts. 6. Check the agent.log on the other one. Actual results: The agent misinterprets the reason why the VM failed to start and moves to EngineUnexpectedlyDown state. Expected results: The agent correctly detects that VM is running elsewhere, and moves to EngineForceStop state.
Engine got started without any issues, while I followed exact reproduction steps from https://bugzilla.redhat.com/show_bug.cgi?id=1794089#c0. Engine initially ran on puma18, after reproduction steps it got started on secondary ha-host puma19. MainThread::INFO::2020-04-06 21:57:47,687::states::736::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2020-04-06 21:57:47,825::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2020-04-06 21:57:57,853::states::736::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2020-04-06 21:57:58,002::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2020-04-06 21:58:08,031::states::736::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) VM is powering up.. MainThread::INFO::2020-04-06 21:58:08,147::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2020-04-06 21:58:18,180::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2020-04-06 21:58:18,218::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? ignored MainThread::INFO::2020-04-06 21:58:18,331::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2020-04-06 21:58:27,356::states::401::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost MainThread::INFO::2020-04-06 21:58:27,474::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) puma18 ~]# hosted-engine --vm-status !! Cluster is in GLOBAL MAINTENANCE mode !! --== Host puma18.scl.lab.tlv.redhat.com (id: 1) status ==-- Host ID : 1 Host timestamp : 4342 Score : 3400 Engine status : {"vm": "up", "health": "good", "detail": "Up"} Hostname : puma18.scl.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : bbf43f4c conf_on_shared_storage : True local_conf_timestamp : 4342 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4342 (Mon Apr 6 21:55:06 2020) host-id=1 score=3400 vm_conf_refresh_time=4342 (Mon Apr 6 21:55:06 2020) conf_on_shared_storage=True maintenance=False state=EngineUp stopped=False --== Host puma19.scl.lab.tlv.redhat.com (id: 2) status ==-- Host ID : 2 Host timestamp : 11337 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : puma19.scl.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : 8909896b conf_on_shared_storage : True local_conf_timestamp : 11337 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=11337 (Mon Apr 6 21:54:57 2020) host-id=2 score=3400 vm_conf_refresh_time=11337 (Mon Apr 6 21:54:57 2020) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! puma18 ~]# hosted-engine --vm-shutdown --== Host puma18.scl.lab.tlv.redhat.com (id: 1) status ==-- Host ID : 1 Host timestamp : 4362 Score : 3400 Engine status : {"vm": "up", "health": "bad", "detail": "Powering down", "reason": "failed liveliness check"} Hostname : puma18.scl.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : 02c3a731 conf_on_shared_storage : True local_conf_timestamp : 4362 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4362 (Mon Apr 6 21:55:27 2020) host-id=1 score=3400 vm_conf_refresh_time=4362 (Mon Apr 6 21:55:27 2020) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False --== Host puma19.scl.lab.tlv.redhat.com (id: 2) status ==-- Host ID : 2 Host timestamp : 11357 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : puma19.scl.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : e401dbf5 conf_on_shared_storage : True local_conf_timestamp : 11357 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=11357 (Mon Apr 6 21:55:17 2020) host-id=2 score=3400 vm_conf_refresh_time=11357 (Mon Apr 6 21:55:17 2020) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False !! Cluster is in GLOBAL MAINTENANCE mode !! [root@puma18 ~]# virsh -r list --all Id Name State ------------------------------- - HostedEngine shut off [root@puma18 ~]# virsh -r list --all Id Name State ------------------------------- - HostedEngine shut off [root@puma18 ~]# virsh -r list --all Id Name State ------------------------------- - HostedEngine shut off [root@puma18 ~]# hosted-engine --vm-status !! Cluster is in GLOBAL MAINTENANCE mode !! --== Host puma18.scl.lab.tlv.redhat.com (id: 1) status ==-- Host ID : 1 Host timestamp : 4442 Score : 3400 Engine status : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"} Hostname : puma18.scl.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : 0da71d4c conf_on_shared_storage : True local_conf_timestamp : 4442 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4442 (Mon Apr 6 21:56:47 2020) host-id=1 score=3400 vm_conf_refresh_time=4442 (Mon Apr 6 21:56:47 2020) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False --== Host puma19.scl.lab.tlv.redhat.com (id: 2) status ==-- Host ID : 2 Host timestamp : 11437 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : puma19.scl.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : 75cf755d conf_on_shared_storage : True local_conf_timestamp : 11437 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=11437 (Mon Apr 6 21:56:37 2020) host-id=2 score=3400 vm_conf_refresh_time=11437 (Mon Apr 6 21:56:37 2020) conf_on_shared_storage=True maintenance=False state=GlobalMaintenance stopped=False --== Host puma18.scl.lab.tlv.redhat.com (id: 1) status ==-- Host ID : 1 Host timestamp : 4462 Score : 3400 Engine status : {"vm": "down_unexpected", "health": "bad", "detail": "Down", "reason": "bad vm status"} Hostname : puma18.scl.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : 4ee64603 conf_on_shared_storage : True local_conf_timestamp : 4462 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=4462 (Mon Apr 6 21:57:07 2020) host-id=1 score=3400 vm_conf_refresh_time=4462 (Mon Apr 6 21:57:07 2020) conf_on_shared_storage=True maintenance=False state=EngineDown stopped=False --== Host puma19.scl.lab.tlv.redhat.com (id: 2) status ==-- Host ID : 2 Host timestamp : 11467 Score : 3400 Engine status : {"vm": "down", "health": "bad", "detail": "unknown", "reason": "vm not running on this host"} Hostname : puma19.scl.lab.tlv.redhat.com Local maintenance : False stopped : False crc32 : 2e430e8c conf_on_shared_storage : True local_conf_timestamp : 11467 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=11467 (Mon Apr 6 21:57:07 2020) host-id=2 score=3400 vm_conf_refresh_time=11467 (Mon Apr 6 21:57:07 2020) conf_on_shared_storage=True maintenance=False state=EngineStart stopped=False Tested on: ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch rhvm-appliance.x86_64 2:4.4-20200403.0.el8ev Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux release 8.2 (Ootpa)
This bugzilla is included in oVirt 4.4.0 release, published on May 20th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.