Description of problem: after upgrading the server that runs the hosted engine, I am unable to start the VM that hosts the engine. tail agent.log tells me: MainThread::INFO::2016-01-04 12:21:04,088::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140700645989264 MainThread::INFO::2016-01-04 12:21:04,088::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': '5a034fba-b54e-41fe-b65a-20cd069334b7', 'address': '0'} MainThread::INFO::2016-01-04 12:21:04,091::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140700645906832 MainThread::INFO::2016-01-04 12:21:04,091::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': '5a034fba-b54e-41fe-b65a-20cd069334b7', 'address': '0'} MainThread::INFO::2016-01-04 12:21:04,092::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140700914488144 MainThread::INFO::2016-01-04 12:21:04,320::brokerlink::178::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(set_storage_domain) Success, id 140701384236688 MainThread::INFO::2016-01-04 12:21:04,320::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started MainThread::INFO::2016-01-04 12:21:04,348::hosted_engine::710::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/86438929-9f4e-4873-a141-3f061b2edee2/86f47014-dd1f-43f1-84ca-f41e02f88f58) MainThread::INFO::2016-01-04 12:21:04,386::upgrade::916::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade) Upgrading to current version MainThread::ERROR::2016-01-04 12:21:04,592::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'unhashable type: 'dict'' - trying to restart agent Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha agent 1.3.2.1 How reproducible: update hosted engine with yum update, then update node with yum update Steps to Reproduce: 1. login in to hosted engine and update with yum update - hosted engine became unreachable via network after this 2. login to node that hosts the engine, update with yum update 3. reboot Actual results: hosted engine doesn't boot anymore, produces log shown above Expected results: upgrade succeeds and hosted engine starts up Additional info:
I just checked vdsclient list and got this: [root@localhost ~]# vdsClient -s localhost list 5a034fba-b54e-41fe-b65a-20cd069334b7 Status = Down emulatedMachine = pc guestDiskMapping = {} displaySecurePort = -1 cpuType = Westmere devices = [{'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '413084b1-841a-4b87-96a0-6bba242d6491', 'alias': 'console0'}, {'device': 'memballoon', 'specParams': {'model': 'none'}, 'type': 'balloon'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'device': 'vnc', 'specParams': {'spiceSecureChannels': 'smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir', 'displayIp': '0'}, 'type': 'graphics'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:48:44:e4', 'linkActive': 'true', 'network': 'ovirtmgmt', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': 'cbb23bc4-7070-4c8d-8473-aacd99faffea', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '3abe57e9-16f2-44d7-a978-ced738b9463e', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '/home/tmp/centos.iso', 'type': 'disk'}, {'poolID': '00000000-0000-0000-0000-000000000000', 'reqsize': '0', 'index': '0', 'iface': 'virtio', 'apparentsize': '26843545600', 'imageID': 'c261320f-1dc0-43db-8b6c-dd49f74b8007', 'readonly': 'false', 'shared': 'exclusive', 'truesize': '4974497792', 'type': 'disk', 'domainID': '48fb7be2-d8eb-44e4-8690-7770ccaf3766', 'volumeInfo': {'domainID': '48fb7be2-d8eb-44e4-8690-7770ccaf3766', 'volType': 'path', 'leaseOffset': 0, 'volumeID': '5c579441-98e0-43fb-8e35-8c3d619e8998', 'leasePath': '/rhev/data-center/mnt/ovirt-nfs.labtest.lab:_engine/48fb7be2-d8eb-44e4-8690-7770ccaf3766/images/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998.lease', 'imageID': 'c261320f-1dc0-43db-8b6c-dd49f74b8007', 'path': '/rhev/data-center/mnt/ovirt-nfs.labtest.lab:_engine/48fb7be2-d8eb-44e4-8690-7770ccaf3766/images/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998'}, 'format': 'raw', 'deviceId': 'c261320f-1dc0-43db-8b6c-dd49f74b8007', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'path': '/var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998', 'propagateErrors': 'off', 'optional': 'false', 'bootOrder': '1', 'volumeID': '5c579441-98e0-43fb-8e35-8c3d619e8998', 'specParams': {}, 'volumeChain': [{'domainID': '48fb7be2-d8eb-44e4-8690-7770ccaf3766', 'volType': 'path', 'leaseOffset': 0, 'volumeID': '5c579441-98e0-43fb-8e35-8c3d619e8998', 'leasePath': '/rhev/data-center/mnt/ovirt-nfs.labtest.lab:_engine/48fb7be2-d8eb-44e4-8690-7770ccaf3766/images/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998.lease', 'imageID': 'c261320f-1dc0-43db-8b6c-dd49f74b8007', 'path': '/rhev/data-center/mnt/ovirt-nfs.labtest.lab:_engine/48fb7be2-d8eb-44e4-8690-7770ccaf3766/images/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998'}]}] smp = 2 vmType = kvm memSize = 4096 vmName = HostedEngine exitMessage = Failed to acquire lock: No space left on device pid = 0 displayIp = 0 displayPort = -1 clientIp = exitCode = 1 nicModel = rtl8139,pv exitReason = 1 spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir statusTime = 4299084070 display = vnc
also: 2016-01-04 16:03:01+0100 3712 [1182]: r2 cmd_acquire 2,8,18691 invalid lockspace found -1 failed 0 name 48fb7be2-d8eb-44e4-8690-7770ccaf3766
from sanlock.log 2016-01-05 00:10:52+0100 15454 [1154]: s69 lockspace hosted-engine:1:/var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/86438929-9f4e-4873-a141-3f061b2edee2/86f47014-dd1f-43f1-84ca-f41e02f88f58:0 2016-01-05 00:10:52+0100 15454 [3371]: verify_leader 1 wrong magic 0 /var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/86438929-9f4e-4873-a141-3f061b2edee2/86f47014-dd1f-43f1-84ca-f41e02f88f58 2016-01-05 00:10:52+0100 15454 [3371]: leader1 delta_acquire_begin error -223 lockspace hosted-engine host_id 1 2016-01-05 00:10:52+0100 15454 [3371]: leader2 path /var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/86438929-9f4e-4873-a141-3f061b2edee2/86f47014-dd1f-43f1-84ca-f41e02f88f58 offset 0 2016-01-05 00:10:52+0100 15454 [3371]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0 2016-01-05 00:10:52+0100 15454 [3371]: leader4 sn rn ts 0 cs 0 2016-01-05 00:10:53+0100 15455 [1154]: s69 add_lockspace fail result -223
Just to be sure, are you use the same upgrade procedure, that described under: http://www.ovirt.org/Hosted_Engine_Howto#Upgrade_Hosted_Engine
yes, I managed to get things running again in the meanwhile. I tracked it down to problems with the storage pool id. so I changed the storage pool id in the hosted-engine.conf and on my gluster volume in dom_md/metadata to a new value and the machine booted again. Then, being able to log in again to the engine, I repaired the hosted engine storage and things started working again.
It is deployment over gluster storage? Because I not encounter this specific problem over NFS and ISCSI storages.
yeah, I set everything up according to http://community.redhat.com/blog/2014/10/up-and-running-with-ovirt-3-5/
Simone, this is NFS over Gluster in hyperconverged setup. Qiong Wu, can you please upload a sos report somewhere we can look at? (yum install sos ; sosreport)
Understood: as for comment 1 a VM was there when you tried the upgrade and we weren't correctly parsing the output of vdscli.list that returns a list of dictionaries and so it ended with Error: 'unhashable type: 'dict'' - trying to restart agent This doesn't affect the regular flow.
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Verified with the following flow: 1. On two RHEL 7.2 hosts Install latest 3.5 packages with: ovirt-hosted-engine-ha-1.2.10-1.el7ev.noarch.rpm ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch.rpm 2.hosted-engine --deploy from one of the hosts + install engine on the created vm. 3. hosted-engine --deploy from second host to add it to engine. 4. add stroage domain to engine and create vms to run on host 2. 5 set maintenace mode to global. 6. upgrade engine from rhevm-3.5.8-0.1.el6ev.noarch to rhevm-3.6.3.3-0.1.el6.noarch. 7. Disable global maintenance. 8. put the 1st host to maintenance on engine. 9. update host with 3.6 repos -> ovirt-ha-agent stop -> yum update. 10.restart vdsm, ovirt-ha-agent/broker. 11. start host on engine. 12. do step 9 on the second host without moving it to maintenance (3 vms still running on it) 13. restart ovirt-ha-agent on the second host. result: restart crashes, hosted engine vm goes down (then starts on the first host). in agent.log I get (Full log is attached): MainThread::INFO::2016-02-25 15:28:37,794::hosted_engine::757::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Acquired lock on host id 2 MainThread::INFO::2016-02-25 15:28:37,807::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version MainThread::INFO::2016-02-25 15:28:37,813::upgrade::831::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_in_engine_maintenance) This host is connected to other storage pools: ['00000002-0002-0002-0002-00000000003c'] MainThread::ERROR::2016-02-25 15:28:37,813::upgrade::980::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready
Created attachment 1130539 [details] agent log for host that did not move to maintenance