Description of problem: After doing SHE migration path 3.4->3.5->3.6->4.0 and ending global maintenance, HE VM was not automatically started and hosted-engine --vm-status showed that agent was in 'state=AgentStopped'. manually starting HE VM with hosted-engine --vm-start worked fine. ~~~ # hosted-engine --vm-status | sed 's/rhev.lab.eng.brq.redhat/example.com/' /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vd scli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli import vdsm.vdscli --== Host 1 status ==-- Status up-to-date : False Hostname : 10-34-60-151.example.com.com Host ID : 1 Engine status : unknown stale-data Score : 0 stopped : True Local maintenance : False crc32 : e9f3cf55 Host timestamp : 231104 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=231104 (Mon Aug 8 15:53:34 2016) host-id=1 score=0 maintenance=False state=AgentStopped stopped=True --== Host 2 status ==-- Status up-to-date : True Hostname : 10-34-60-215.example.com.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "dow n", "detail": "unknown"} Score : 0 stopped : False Local maintenance : True crc32 : 2e113351 Host timestamp : 234997 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=234997 (Mon Aug 8 18:39:40 2016) host-id=2 score=0 maintenance=True state=LocalMaintenance stopped=False ~~~ There's a lot of "ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: ''items'' - trying to restart agent" in the log. Both hosts were EL7 with 4.0 rpms. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch How reproducible: hard to reproduce, if at all possible Steps to Reproduce: 1. discovered as part of 3.4->3.5->3.6->4.0 SHE migration 2. 3. Actual results: HE VM was not started after ending global maintenance Expected results: HE VM should be started automatically. Additional info:
The issue is here: MainThread::WARNING::2016-08-08 15:51:03,712::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 445, in start_monitoring self._initialize_storage_images() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 667, in _initialize_storage_images img.prepare_images() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/image.py", line 141, in prepare_images for volUUID in vm_vol_uuid_list['items']: KeyError: 'items' MainThread::INFO::2016-08-08 15:51:05,328::hosted_engine::496::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 60 seconds MainThread::INFO::2016-08-08 15:52:05,455::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1470664325.46 type=state_transition detail=GlobalMaintenance-ReinitializeFSM hostname='10-34-60-151.rhev.lab.eng.brq.redhat.com' It seams that a certain time you got an image without a volume (still not sure how) and our code failed scanning it.
ok, ovirt-hosted-engine-ha-2.0.4-1.el7ev.noarch can't see the issue anymore as described in #2.