Bug 1295427 - hosted engine doesnt start - fails during storage server upgrade
Summary: hosted engine doesnt start - fails during storage server upgrade
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 1.3.3.6
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-3.6.3
: 1.3.4.3
Assignee: Simone Tiraboschi
QA Contact: sefi litmanovich
URL:
Whiteboard:
Depends On:
Blocks: ovirt-hosted-engine-ha-1.3.4.3
TreeView+ depends on / blocked
 
Reported: 2016-01-04 13:06 UTC by Qiong Wu
Modified: 2016-03-11 07:22 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The 3.5 -> 3.6 upgrade procedure was wrongly checking the maintenance status. Consequence: an ambiguous error (Error: 'unhashable type: 'dict'') was reported Fix: Correctly checking maintenance status Result: Now it reports: Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready
Clone Of:
Environment:
Last Closed: 2016-03-11 07:22:43 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-3.6.z+
rule-engine: exception+
rule-engine: planning_ack+
sbonazzo: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
agent log for host that did not move to maintenance (513.24 KB, text/plain)
2016-02-25 13:50 UTC, sefi litmanovich
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 53578 0 master MERGED upgrade: better checking maintenance mode at engine level 2016-02-19 10:12:34 UTC
oVirt gerrit 53755 0 ovirt-hosted-engine-ha-1.3 MERGED upgrade: better checking maintenance mode at engine level 2016-02-22 09:36:56 UTC

Description Qiong Wu 2016-01-04 13:06:22 UTC
Description of problem:
after upgrading the server that runs the hosted engine, I am unable to start the VM that hosts the engine.
tail agent.log tells me:

MainThread::INFO::2016-01-04 12:21:04,088::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140700645989264
MainThread::INFO::2016-01-04 12:21:04,088::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor cpu-load-no-engine, options {'use_ssl': 'true', 'vm_uuid': '5a034fba-b54e-41fe-b65a-20cd069334b7', 'address': '0'}
MainThread::INFO::2016-01-04 12:21:04,091::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140700645906832
MainThread::INFO::2016-01-04 12:21:04,091::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor engine-health, options {'use_ssl': 'true', 'vm_uuid': '5a034fba-b54e-41fe-b65a-20cd069334b7', 'address': '0'}
MainThread::INFO::2016-01-04 12:21:04,092::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140700914488144
MainThread::INFO::2016-01-04 12:21:04,320::brokerlink::178::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(set_storage_domain) Success, id 140701384236688
MainThread::INFO::2016-01-04 12:21:04,320::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Broker initialized, all submonitors started
MainThread::INFO::2016-01-04 12:21:04,348::hosted_engine::710::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/86438929-9f4e-4873-a141-3f061b2edee2/86f47014-dd1f-43f1-84ca-f41e02f88f58)
MainThread::INFO::2016-01-04 12:21:04,386::upgrade::916::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade) Upgrading to current version
MainThread::ERROR::2016-01-04 12:21:04,592::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'unhashable type: 'dict'' - trying to restart agent


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha agent 1.3.2.1

How reproducible:
update hosted engine with yum update, then update node with yum update

Steps to Reproduce:
1. login in to hosted engine and update with yum update - hosted engine became unreachable via network after this
2. login to node that hosts the engine, update with yum update 
3. reboot

Actual results:

hosted engine doesn't boot anymore, produces log shown above


Expected results:

upgrade succeeds and hosted engine starts up


Additional info:

Comment 1 Qiong Wu 2016-01-04 15:20:32 UTC
I just checked vdsclient list and got this:

[root@localhost ~]# vdsClient -s localhost list

5a034fba-b54e-41fe-b65a-20cd069334b7
        Status = Down
        emulatedMachine = pc
        guestDiskMapping = {}
        displaySecurePort = -1
        cpuType = Westmere
        devices = [{'device': 'console', 'specParams': {}, 'type': 'console', 'deviceId': '413084b1-841a-4b87-96a0-6bba242d6491', 'alias': 'console0'}, {'device': 'memballoon', 'specParams': {'model': 'none'}, 'type': 'balloon'}, {'device': 'scsi', 'model': 'virtio-scsi', 'type': 'controller'}, {'device': 'vnc', 'specParams': {'spiceSecureChannels': 'smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir', 'displayIp': '0'}, 'type': 'graphics'}, {'nicModel': 'pv', 'macAddr': '00:16:3e:48:44:e4', 'linkActive': 'true', 'network': 'ovirtmgmt', 'filter': 'vdsm-no-mac-spoofing', 'specParams': {}, 'deviceId': 'cbb23bc4-7070-4c8d-8473-aacd99faffea', 'address': {'slot': '0x03', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'bridge', 'type': 'interface'}, {'index': '2', 'iface': 'ide', 'specParams': {}, 'readonly': 'true', 'deviceId': '3abe57e9-16f2-44d7-a978-ced738b9463e', 'address': {'bus': '1', 'controller': '0', 'type': 'drive', 'target': '0', 'unit': '0'}, 'device': 'cdrom', 'shared': 'false', 'path': '/home/tmp/centos.iso', 'type': 'disk'}, {'poolID': '00000000-0000-0000-0000-000000000000', 'reqsize': '0', 'index': '0', 'iface': 'virtio', 'apparentsize': '26843545600', 'imageID': 'c261320f-1dc0-43db-8b6c-dd49f74b8007', 'readonly': 'false', 'shared': 'exclusive', 'truesize': '4974497792', 'type': 'disk', 'domainID': '48fb7be2-d8eb-44e4-8690-7770ccaf3766', 'volumeInfo': {'domainID': '48fb7be2-d8eb-44e4-8690-7770ccaf3766', 'volType': 'path', 'leaseOffset': 0, 'volumeID': '5c579441-98e0-43fb-8e35-8c3d619e8998', 'leasePath': '/rhev/data-center/mnt/ovirt-nfs.labtest.lab:_engine/48fb7be2-d8eb-44e4-8690-7770ccaf3766/images/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998.lease', 'imageID': 'c261320f-1dc0-43db-8b6c-dd49f74b8007', 'path': '/rhev/data-center/mnt/ovirt-nfs.labtest.lab:_engine/48fb7be2-d8eb-44e4-8690-7770ccaf3766/images/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998'}, 'format': 'raw', 'deviceId': 'c261320f-1dc0-43db-8b6c-dd49f74b8007', 'address': {'slot': '0x06', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'path': '/var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998', 'propagateErrors': 'off', 'optional': 'false', 'bootOrder': '1', 'volumeID': '5c579441-98e0-43fb-8e35-8c3d619e8998', 'specParams': {}, 'volumeChain': [{'domainID': '48fb7be2-d8eb-44e4-8690-7770ccaf3766', 'volType': 'path', 'leaseOffset': 0, 'volumeID': '5c579441-98e0-43fb-8e35-8c3d619e8998', 'leasePath': '/rhev/data-center/mnt/ovirt-nfs.labtest.lab:_engine/48fb7be2-d8eb-44e4-8690-7770ccaf3766/images/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998.lease', 'imageID': 'c261320f-1dc0-43db-8b6c-dd49f74b8007', 'path': '/rhev/data-center/mnt/ovirt-nfs.labtest.lab:_engine/48fb7be2-d8eb-44e4-8690-7770ccaf3766/images/c261320f-1dc0-43db-8b6c-dd49f74b8007/5c579441-98e0-43fb-8e35-8c3d619e8998'}]}]
        smp = 2
        vmType = kvm
        memSize = 4096
        vmName = HostedEngine
        exitMessage = Failed to acquire lock: No space left on device
        pid = 0
        displayIp = 0
        displayPort = -1
        clientIp =
        exitCode = 1
        nicModel = rtl8139,pv
        exitReason = 1
        spiceSecureChannels = smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir
        statusTime = 4299084070
        display = vnc

Comment 2 Qiong Wu 2016-01-04 15:23:31 UTC
also: 2016-01-04 16:03:01+0100 3712 [1182]: r2 cmd_acquire 2,8,18691 invalid lockspace found -1 failed 0 name 48fb7be2-d8eb-44e4-8690-7770ccaf3766

Comment 3 Qiong Wu 2016-01-04 23:11:57 UTC
from sanlock.log

2016-01-05 00:10:52+0100 15454 [1154]: s69 lockspace hosted-engine:1:/var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/86438929-9f4e-4873-a141-3f061b2edee2/86f47014-dd1f-43f1-84ca-f41e02f88f58:0
2016-01-05 00:10:52+0100 15454 [3371]: verify_leader 1 wrong magic 0 /var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/86438929-9f4e-4873-a141-3f061b2edee2/86f47014-dd1f-43f1-84ca-f41e02f88f58
2016-01-05 00:10:52+0100 15454 [3371]: leader1 delta_acquire_begin error -223 lockspace hosted-engine host_id 1
2016-01-05 00:10:52+0100 15454 [3371]: leader2 path /var/run/vdsm/storage/48fb7be2-d8eb-44e4-8690-7770ccaf3766/86438929-9f4e-4873-a141-3f061b2edee2/86f47014-dd1f-43f1-84ca-f41e02f88f58 offset 0
2016-01-05 00:10:52+0100 15454 [3371]: leader3 m 0 v 0 ss 0 nh 0 mh 0 oi 0 og 0 lv 0
2016-01-05 00:10:52+0100 15454 [3371]: leader4 sn  rn  ts 0 cs 0
2016-01-05 00:10:53+0100 15455 [1154]: s69 add_lockspace fail result -223

Comment 4 Artyom 2016-01-07 12:50:36 UTC
Just to be sure, are you use the same upgrade procedure, that described under:
http://www.ovirt.org/Hosted_Engine_Howto#Upgrade_Hosted_Engine

Comment 5 Qiong Wu 2016-01-07 12:56:39 UTC
yes, I managed to get things running again in the meanwhile.
I tracked it down to problems with the storage pool id. so I changed the storage pool id in the hosted-engine.conf and on my gluster volume in dom_md/metadata to a new value and the machine booted again. Then, being able to log in again to the engine, I repaired the hosted engine storage and things started working again.

Comment 6 Artyom 2016-01-07 13:13:45 UTC
It is deployment over gluster storage?
Because I not encounter this specific problem over NFS and ISCSI storages.

Comment 7 Qiong Wu 2016-01-07 16:37:59 UTC
yeah, I set everything up according to http://community.redhat.com/blog/2014/10/up-and-running-with-ovirt-3-5/

Comment 8 Sandro Bonazzola 2016-01-14 09:29:45 UTC
Simone, this is NFS over Gluster in hyperconverged setup.

Qiong Wu, can you please upload a sos report somewhere we can look at?
(yum install sos  ; sosreport)

Comment 9 Simone Tiraboschi 2016-02-17 13:19:47 UTC
Understood:
as for comment 1 a VM was there when you tried the upgrade and we weren't correctly parsing the output of vdscli.list that returns a list of dictionaries and so it ended with 
Error: 'unhashable type: 'dict'' - trying to restart agent

This doesn't affect the regular flow.

Comment 10 Red Hat Bugzilla Rules Engine 2016-02-18 09:53:00 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 11 sefi litmanovich 2016-02-25 13:49:21 UTC
Verified with the following flow:

1. On two RHEL 7.2 hosts Install latest 3.5 packages with:
ovirt-hosted-engine-ha-1.2.10-1.el7ev.noarch.rpm
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch.rpm
2.hosted-engine --deploy from one of the hosts + install engine on the created vm.
3. hosted-engine --deploy from second host to add it to engine.
4. add stroage domain to engine and create vms to run on host 2.
5 set maintenace mode to global.
6. upgrade engine from rhevm-3.5.8-0.1.el6ev.noarch to rhevm-3.6.3.3-0.1.el6.noarch.
7. Disable global maintenance.
8. put the 1st host to maintenance on engine.
9. update host with 3.6 repos ->  ovirt-ha-agent stop  -> yum update.
10.restart vdsm, ovirt-ha-agent/broker.
11. start host on engine.
12. do step 9 on the second host without moving it to maintenance (3 vms still running on it)
13. restart ovirt-ha-agent on the second host.

result:
restart crashes, hosted engine vm goes down (then starts on the first host).
in agent.log I get (Full log is attached):

MainThread::INFO::2016-02-25 15:28:37,794::hosted_engine::757::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Acquired lock on host id 2
MainThread::INFO::2016-02-25 15:28:37,807::upgrade::977::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Upgrading to current version
MainThread::INFO::2016-02-25 15:28:37,813::upgrade::831::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_in_engine_maintenance) This host is connected to other storage pools: ['00000002-0002-0002-0002-00000000003c']
MainThread::ERROR::2016-02-25 15:28:37,813::upgrade::980::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready

Comment 12 sefi litmanovich 2016-02-25 13:50:02 UTC
Created attachment 1130539 [details]
agent log for host that did not move to maintenance


Note You need to log in before you can comment on or make changes to this bug.