Bug 1660765
Summary: | Migration fails when clicking "Put this host into local maintenance" button first time on cockpit UI. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] cockpit-ovirt | Reporter: | Wei Wang <weiwang> | ||||||
Component: | Hosted Engine | Assignee: | Simone Tiraboschi <stirabos> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Wei Wang <weiwang> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | --- | CC: | akrejcir, bugs, cshao, huzhao, qiyuan, sbonazzo, stirabos, weiwang, yaniwang, ycui, yturgema | ||||||
Target Milestone: | ovirt-4.3.6 | Flags: | sbonazzo:
ovirt-4.3?
rule-engine: planning_ack? sbonazzo: devel_ack+ cshao: testing_ack+ |
||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-09-26 19:43:14 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Created attachment 1515507 [details]
operate from cockpit UI
Simone is this intended behavior? sbonazzo, I think this is a common bug, not a RFE. Could you please check it again? Thanks! (In reply to Sandro Bonazzola from comment #2) > Simone is this intended behavior? No, I tend to think it's a bug. In /var/log/messages we have Dec 19 12:11:14 localhost libvirtd: 2018-12-19 04:11:14.108+0000: 31096: error : qemuMigrationJobCheckStatus:1435 : operation failed: migration out job: unexpectedly failed Dec 19 12:11:19 localhost journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Migration failed: {u'status': {u'message': u'Fatal error during migration', u'code': 12}, u'progress': 0} and so the issue. Now let's try to see in vdsm.log what really happened: 2018-12-19 12:07:09,524+0800 INFO (jsonrpc/5) [api.virt] START migrate(params={u'tunneled': False, u'autoConverge': True, u'dstqemu': u'10.73.73.106', u'src': u'localhost', u'enableGuestEvents': True, u'dst': u'10.73.73.106', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [u'5000'], u'name': u'setDowntime'}, u'limit': -1}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942', u'abortOnError': True, u'compressed': False, u'method': u'online'}) from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46) 2018-12-19 12:07:09,526+0800 INFO (jsonrpc/5) [api.virt] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52) Please notice 'from=::1,44282' which means that the command has been sent by the same host so by ovirt-ha-agent (the user triggered it from cockpit). And so the migration started: 2018-12-19 12:07:09,955+0800 INFO (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Creation of destination VM took: 0 seconds (migration:473) 2018-12-19 12:07:09,955+0800 INFO (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') starting migration to qemu+tls://10.73.73.106/system with miguri tcp://10.73.73.106 (migration:502) And ovirt-ha-agent keeps monitoring it: 2018-12-19 12:08:39,873+0800 INFO (jsonrpc/0) [api.virt] START getMigrationStatus() from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46) 2018-12-19 12:08:39,873+0800 INFO (jsonrpc/0) [api.virt] FINISH getMigrationStatus return={'status': {'message': 'Done', 'code': 0}, 'migrationStats': {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 97L}} from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52) 2018-12-19 12:08:39,988+0800 INFO (migmon/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Migration Progress: 90 seconds elapsed, 99% of data processed, total data: 16444MB, processed data: 4616MB, remaining data: 207MB, transfer speed 52MBps, zero pages: 3197006MB, compressed: 0MB, dirty rate: 4819, memory iteration: 3 (migration:867) Please notice 'from=::1,44282' in getMigrationStatus And the migration successfully completed: 2018-12-19 12:08:48,218+0800 INFO (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') migration took 98 seconds to complete (migration:514) 2018-12-19 12:08:48,218+0800 INFO (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Changed state to Down: Migration succeeded (code=4) (vm:1689) Now the issue: at the same time we have also an active engine that queries VM status, please notice: 2018-12-19 12:08:45,791+0800 INFO (jsonrpc/3) [api.host] START getAllVmStats() from=::ffff:10.73.73.102,60254 (api:46) 2018-12-19 12:08:45,793+0800 INFO (jsonrpc/3) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::ffff:10.73.73.102,60254 (api:52) Please notice 'from=::ffff:10.73.73.102,60254' which means that this is the engine running on a different machine (the engine VM itself). So maybe the engine notices the engine VM also up on the target host or something like that and so it sends a destroy command: 2018-12-19 12:08:48,227+0800 INFO (jsonrpc/7) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:10.73.73.102,60254, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46) Please notice 'from=::ffff:10.73.73.102,60254' (it's the engine) and 'vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942' (it's exactly the engine VM). In this case the destroy failed since the engine VM completed the migration just in time (migration completed at 12:08:48,218+0800, destroy failed at 12:08:48,227+0800) 2018-12-19 12:08:48,227+0800 INFO (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Release VM resources (vm:5279) 2018-12-19 12:08:48,227+0800 WARN (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') trying to set state to Powering down when already Down (vm:613) 2018-12-19 12:08:48,227+0800 INFO (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Stopping connection (guestagent:442) 2018-12-19 12:08:48,227+0800 INFO (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') _destroyVmGraceful attempt #0 (vm:5316) 2018-12-19 12:08:48,228+0800 WARN (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') VM '6c11dd49-67a1-4917-be01-2a2f2bac6942' couldn't be destroyed in libvirt: Requested operation is not valid: domain is not running (vm:5337) 2018-12-19 12:08:48,229+0800 INFO (jsonrpc/7) [vdsm.api] START teardownImage(sdUUID='7b68bd24-f73e-40a6-9c4a-5db3ecdf102e', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='a0544f7f-2cf9-4c67-9386-df46a750ed74', volUUID=None) from=::ffff:10.73.73.102,60254, task_id=52e9d07a-9679-4302-ac64-1beb94445bc1 (api:46) but this is probably enough to update the VM exit reason. And indeed on the next monitoring round ovirt-ha-agent got an error: 2018-12-19 12:08:53,761+0800 INFO (jsonrpc/2) [api.virt] START getStats() from=::1,44280, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46) 2018-12-19 12:08:53,762+0800 INFO (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942'} (api:127) 2018-12-19 12:08:53,762+0800 INFO (jsonrpc/2) [api.virt] FINISH getStats return={'status': {'message': "Virtual machine does not exist: {'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942'}", 'code': 1}} from=::1,44280, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52) and this until 2018-12-19 12:10:03,884+0800 INFO (jsonrpc/5) [api.virt] START getStats() from=::1,44280, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46) 2018-12-19 12:10:03,884+0800 INFO (jsonrpc/5) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942'} (api:127) 2018-12-19 12:10:03,884+0800 INFO (jsonrpc/5) [api.virt] FINISH getStats return={'status': {'message': "Virtual machine does not exist: {'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942'}", 'code': 1}} from=::1,44280, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52) Then the engine (::ffff:10.73.73.106,52742) went up and tried to migrate back the engine VM: 2018-12-19 12:10:05,225+0800 INFO (jsonrpc/0) [api.virt] START migrationCreate(params={u'username': u'None', u'acpiEnable': u'true', u'emulatedMachine': u'pc-i440fx-rhel7.5.0', u'guestFQDN': u'rhevh-hostedengine-vm-06.lab.eng.pek2.redhat.com', u'enableGuestEvents': True, u'numOfIoThreads': u'1', u'memGuaranteedSize': 1024, u'timeOffset': u'0', u'cpuType': u'Broadwell', u'smp': u'4', u'migrationDest': u'libvirt', u'guestNumaNodes': [{u'nodeIndex': 0, u'cpus': u'0,1,2,3', u'memory': u'16348'}], u'xml': u'<domain type=\'kvm\' id=\'3\'>\n <name>HostedEngine</name>\n <uuid>6c11dd49-67a1-4917-be01-2a2f2bac6942</uuid>\n ... from=::ffff:10.73.73.106,52742, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46) 2018-12-19 12:10:05,226+0800 INFO (jsonrpc/0) [api.virt] START create(vmParams={u'username': u'None', u'acpiEnable': u'true', u'emulatedMachine': u'pc-i440fx-rhel7.5.0', u'guestFQDN': u'rhevh-hostedengine-vm-06.lab.eng.pek2.redhat.com', u'enableGuestEvents': True, u'numOfIoThreads': u'1', u'memGuaranteedSize': 1024, u'timeOffset': u'0', u'cpuType': u'Broadwell', u'smp': u'4', u'migrationDest': u'libvirt', u'guestNumaNodes': [{u'nodeIndex': 0, u'cpus': u'0,1,2,3', u'memory': u'16348'}], u'xml': u'<domain type=\'kvm\' id=\'3\'>\n <name>HostedEngine</name>\n <uuid>6c11dd49-67a1-4917-be01-2a2f2bac6942</uuid>\n ... from=::ffff:10.73.73.106,52742, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46) 2018-12-19 12:10:05,243+0800 INFO (jsonrpc/0) [api.virt] FINISH create return={'status': {'message': 'Done', 'code': 0}, 'vmList': {u'guestFQDN': u'rhevh-hostedengine-vm-06.lab.eng.pek2.redhat.com', u'acpiEnable': 'true', u'emulatedMachine': 'pc-i440fx-rhel7.5.0', u'enableGuestEvents': True, u'vmId': '6c11dd49-67a1-4917-be01-2a2f2bac6942', u'memGuaranteedSize': 1024, u'timeOffset': '0', u'cpuType': 'Broadwell', u'smp': '4', u'guestNumaNodes': [{'nodeIndex': 0, 'cpus': '0,1,2,3', 'memory': '16348'}], u'xml': u'<domain type=\'kvm\' id=\'3\'>\n <name>HostedEngine</name>\n <uuid>6c11dd49-67a1-4917-be01-2a2f2bac6942</uuid>\n <metadata xmlns:ns0="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">\n <ns0:qos/>\n <ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0">\n <ovirt-vm:clusterVersion>4.2</ovirt-vm:clusterVersion>\n <ovirt-vm:destroy_on_reboot type="bool">True</ovirt-vm:destroy_on_reboot>\n <ovirt-vm:guestAgentAPIVersion type="int">3</ovirt-vm:guestAgentAPIVersion>\n <ovirt-vm:launchPaused>false</ovirt-vm:launchPaused>\n <ovirt-vm:memGuaranteedSize type="int">1024</ovirt-vm:memGuaranteedSize>\n <ovirt-vm:minGuaranteedMemoryMb type="int">1024</ovirt-vm:minGuaranteedMemoryMb>\n <ovirt-vm:resumeBehavior>auto_resume</ovirt-vm:resumeBehavior>\n <ovirt-vm:startTime type="float">1545190809.52</ovirt-vm:startTime>\n <ovirt-vm:device mac_address="52:54:00:34:04:b0">\n <ovirt-vm:network>ovirtmgmt</ovirt-vm:network>\n <ovirt-vm:custom>\n <ovirt-vm:queues>4</ovirt-vm:queues>\n </ovirt-vm:custom>\n <ovirt-vm:specParams/>\n <ovirt-vm:vm_custom/>\n </ovirt-vm:device>\n <ovirt-vm:device devtype="disk" name="vda">\n <ovirt-vm:domainID>7b68bd24-f73e-40a6-9c4a-5db3ecdf102e</ovirt-vm:domainID>\n <ovirt-vm:guestName>/dev/vda</ovirt-vm:guestName>\n <ovirt-vm:imageID>a0544f7f-2cf9-4c67-9386-df46a750ed74</ovirt-vm:imageID>\n <ovirt-vm:poolID>00000000-0000-0000-0000-000000000000</ovirt-vm:poolID>\n <ovirt-vm:shared>exclusive</ovirt-vm:shared>\n <ovirt-vm:volumeID>503d88b0-2253-4567-9f90-ea8987da5d4b</ovirt-vm:volumeID>\n <ovirt-vm:specParams>\n <ovirt-vm:pinToIoThread>1</ovirt-vm:pinToIoThread>\n </ovirt-vm:specParams>\n <ovirt-vm:vm_custom/>\n <ovirt-vm:volumeChain>\n <ovirt-vm:volumeChainNode>\n <ovirt-vm:domainID>7b68bd24-f73e-40a6-9c4a-5db3ecdf102e</ovirt-vm:domainID>\n <ovirt-vm:imageID>a0544f7f-2cf9-4c67-9386-df46a750ed74</ovirt-vm:imageID>\n <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset>\n <ovirt-vm:leasePath>/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease</ovirt-vm:leasePath>\n <ovirt-vm:path>/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b</ovirt-vm:path>\n <ovirt-vm:volumeID>503d88b0-2253-4567-9f90-ea8987da5d4b</ovirt-vm:volumeID>\n </ovirt-vm:volumeChainNode>\n </ovirt-vm:volumeChain>\n </ovirt-vm:device>\n <ovirt-vm:device devtype="disk" name="hdc">\n <ovirt-vm:specParams/>\n <ovirt-vm:vm_custom/>\n </ovirt-vm:device>\n</ovirt-vm:vm>\n </metadata>\n <maxMemory slots=\'16\' unit=\'KiB\'>66961408</maxMemory>\n <memory unit=\'KiB\'>16740352</memory>\n <currentMemory unit=\'KiB\'>16740352</currentMemory>\n <vcpu placement=\'static\' current=\'4\'>64</vcpu>\n <iothreads>1</iothreads>\n <resource>\n <partition>/machine</partition>\n </resource>\n <sysinfo type=\'smbios\'>\n <system>\n <entry name=\'manufacturer\'>Red Hat</entry>\n <entry name=\'product\'>RHEV Hypervisor</entry>\n <entry name=\'version\'>7.6-8.0.el7</entry>\n <entry name=\'serial\'>30373238-3730-4e43-3737-31353042384c</entry>\n <entry name=\'uuid\'>6c11dd49-67a1-4917-be01-2a2f2bac6942</entry>\n </system>\n </sysinfo>\n <os>\n <type arch=\'x86_64\' machine=\'pc-i440fx-rhel7.5.0\'>hvm</type>\n <boot dev=\'hd\'/>\n <smbios mode=\'sysinfo\'/>\n </os>\n <features>\n <acpi/>\n <vmcoreinfo state=\'on\'/>\n </features>\n <cpu mode=\'custom\' match=\'exact\' check=\'full\'>\n <model fallback=\'forbid\'>Broadwell</model>\n <topology sockets=\'16\' cores=\'4\' threads=\'1\'/>\n <feature policy=\'require\' name=\'vme\'/>\n <feature policy=\'require\' name=\'f16c\'/>\n <feature policy=\'require\' name=\'rdrand\'/>\n <feature policy=\'require\' name=\'hypervisor\'/>\n <feature policy=\'require\' name=\'arat\'/>\n <feature policy=\'require\' name=\'xsaveopt\'/>\n <feature policy=\'require\' name=\'abm\'/>\n <numa>\n <cell id=\'0\' cpus=\'0-3\' memory=\'16740352\' unit=\'KiB\'/>\n </numa>\n </cpu>\n <clock offset=\'variable\' adjustment=\'0\' basis=\'utc\'>\n <timer name=\'rtc\' tickpolicy=\'catchup\'/>\n <timer name=\'pit\' tickpolicy=\'delay\'/>\n <timer name=\'hpet\' present=\'no\'/>\n </clock>\n <on_poweroff>destroy</on_poweroff>\n <on_reboot>destroy</on_reboot>\n <on_crash>destroy</on_crash>\n <pm>\n <suspend-to-mem enabled=\'no\'/>\n <suspend-to-disk enabled=\'no\'/>\n </pm>\n <devices>\n <emulator>/usr/libexec/qemu-kvm</emulator>\n <disk type=\'file\' device=\'cdrom\'>\n <driver name=\'qemu\' error_policy=\'report\'/>\n <source startupPolicy=\'optional\'/>\n <target dev=\'hdc\' bus=\'ide\'/>\n <readonly/>\n <alias name=\'ua-93ca26a0-224c-49bc-aedb-075164e7129b\'/>\n <address type=\'drive\' controller=\'0\' bus=\'1\' target=\'0\' unit=\'0\'/>\n </disk>\n <disk type=\'file\' device=\'disk\' snapshot=\'no\'>\n <driver name=\'qemu\' type=\'raw\' cache=\'none\' error_policy=\'stop\' io=\'threads\' iothread=\'1\'/>\n <source file=\'/var/run/vdsm/storage/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b\'/>\n <backingStore/>\n <target dev=\'vda\' bus=\'virtio\'/>\n <serial>a0544f7f-2cf9-4c67-9386-df46a750ed74</serial>\n <alias name=\'ua-a0544f7f-2cf9-4c67-9386-df46a750ed74\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x07\' function=\'0x0\'/>\n </disk>\n <controller type=\'virtio-serial\' index=\'0\' ports=\'16\'>\n <alias name=\'ua-21005fec-1ab4-4bd8-b546-258b75db337d\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x05\' function=\'0x0\'/>\n </controller>\n <controller type=\'scsi\' index=\'0\' model=\'virtio-scsi\'>\n <driver iothread=\'1\'/>\n <alias name=\'ua-2238c2d8-9078-4d1d-a35c-b8c07be921c7\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x06\' function=\'0x0\'/>\n </controller>\n <controller type=\'ide\' index=\'0\'>\n <alias name=\'ide\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x01\' function=\'0x1\'/>\n </controller>\n <controller type=\'usb\' index=\'0\' model=\'piix3-uhci\'>\n <alias name=\'usb\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x01\' function=\'0x2\'/>\n </controller>\n <controller type=\'pci\' index=\'0\' model=\'pci-root\'>\n <alias name=\'pci.0\'/>\n </controller>\n <lease>\n <lockspace>7b68bd24-f73e-40a6-9c4a-5db3ecdf102e</lockspace>\n <key>503d88b0-2253-4567-9f90-ea8987da5d4b</key>\n <target path=\'/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease\'/>\n </lease>\n <interface type=\'bridge\'>\n <mac address=\'52:54:00:34:04:b0\'/>\n <source bridge=\'ovirtmgmt\'/>\n <target dev=\'vnet0\'/>\n <model type=\'virtio\'/>\n <driver name=\'vhost\' queues=\'4\'/>\n <filterref filter=\'vdsm-no-mac-spoofing\'/>\n <link state=\'up\'/>\n <mtu size=\'1500\'/>\n <alias name=\'ua-1742b18c-41b7-421e-a898-a09e758587e1\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x03\' function=\'0x0\'/>\n </interface>\n <channel type=\'unix\'>\n <source mode=\'bind\' path=\'/var/lib/libvirt/qemu/channels/6c11dd49-67a1-4917-be01-2a2f2bac6942.ovirt-guest-agent.0\'/>\n <target type=\'virtio\' name=\'ovirt-guest-agent.0\' state=\'connected\'/>\n <alias name=\'channel0\'/>\n <address type=\'virtio-serial\' controller=\'0\' bus=\'0\' port=\'1\'/>\n </channel>\n <channel type=\'unix\'>\n <source mode=\'bind\' path=\'/var/lib/libvirt/qemu/channels/6c11dd49-67a1-4917-be01-2a2f2bac6942.org.qemu.guest_agent.0\'/>\n <target type=\'virtio\' name=\'org.qemu.guest_agent.0\' state=\'connected\'/>\n <alias name=\'channel1\'/>\n <address type=\'virtio-serial\' controller=\'0\' bus=\'0\' port=\'2\'/>\n </channel>\n <channel type=\'spicevmc\'>\n <target type=\'virtio\' name=\'com.redhat.spice.0\' state=\'disconnected\'/>\n <alias name=\'channel2\'/>\n <address type=\'virtio-serial\' controller=\'0\' bus=\'0\' port=\'3\'/>\n </channel>\n <channel type=\'unix\'>\n <source mode=\'bind\' path=\'/var/lib/libvirt/qemu/channels/6c11dd49-67a1-4917-be01-2a2f2bac6942.org.ovirt.hosted-engine-setup.0\'/>\n <target type=\'virtio\' name=\'org.ovirt.hosted-engine-setup.0\' state=\'disconnected\'/>\n <alias name=\'channel3\'/>\n <address type=\'virtio-serial\' controller=\'0\' bus=\'0\' port=\'4\'/>\n </channel>\n <input type=\'mouse\' bus=\'ps2\'>\n <alias name=\'input0\'/>\n </input>\n <input type=\'keyboard\' bus=\'ps2\'>\n <alias name=\'input1\'/>\n </input>\n <graphics type=\'vnc\' port=\'5900\' autoport=\'yes\' listen=\'10.73.73.106\' keymap=\'en-us\' passwdValidTo=\'1970-01-01T00:00:01\'>\n <listen type=\'network\' address=\'10.73.73.106\' network=\'vdsm-ovirtmgmt\'/>\n </graphics>\n <graphics type=\'spice\' port=\'5901\' tlsPort=\'5902\' autoport=\'yes\' listen=\'10.73.73.106\' passwdValidTo=\'1970-01-01T00:00:01\'>\n <listen type=\'network\' address=\'10.73.73.106\' network=\'vdsm-ovirtmgmt\'/>\n <channel name=\'main\' mode=\'secure\'/>\n <channel name=\'display\' mode=\'secure\'/>\n <channel name=\'inputs\' mode=\'secure\'/>\n <channel name=\'cursor\' mode=\'secure\'/>\n <channel name=\'playback\' mode=\'secure\'/>\n <channel name=\'record\' mode=\'secure\'/>\n <channel name=\'smartcard\' mode=\'secure\'/>\n <channel name=\'usbredir\' mode=\'secure\'/>\n </graphics>\n <sound model=\'ich6\'>\n <alias name=\'ua-b31e3318-8378-47d0-ac8e-1de601b9e069\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x04\' function=\'0x0\'/>\n </sound>\n <video>\n <model type=\'qxl\' ram=\'65536\' vram=\'32768\' vgamem=\'16384\' heads=\'1\' primary=\'yes\'/>\n <alias name=\'ua-2f0f0a2f-0eec-45c7-8e58-0886fb06c747\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x02\' function=\'0x0\'/>\n </video>\n <memballoon model=\'virtio\'>\n <stats period=\'5\'/>\n <alias name=\'ua-23e814e2-af58-43c6-8413-3cc513f8842a\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x08\' function=\'0x0\'/>\n </memballoon>\n <rng model=\'virtio\'>\n <backend model=\'random\'>/dev/urandom</backend>\n <alias name=\'ua-09ed90c4-a4c5-4f7e-98e9-3eb7677c6e92\'/>\n <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x09\' function=\'0x0\'/>\n </rng>\n </devices>\n <seclabel type=\'dynamic\' model=\'selinux\' relabel=\'yes\'>\n <label>system_u:system_r:svirt_t:s0:c69,c782</label>\n <imagelabel>system_u:object_r:svirt_image_t:s0:c69,c782</imagelabel>\n </seclabel>\n <seclabel type=\'dynamic\' model=\'dac\' relabel=\'yes\'>\n <label>+107:+107</label>\n <imagelabel>+107:+107</imagelabel>\n </seclabel>\n</domain>\n', u'custom': {}, u'smpCoresPerSocket': '4', u'vmName': 'HostedEngine', u'username': u'None', u'maxMemSize': 65392, u'bootMenuEnable': 'false', u'numOfIoThreads': '1', u'smpThreadsPerCore': '1', u'guestDiskMapping': {}, u'arch': 'x86_64', u'guestIPs': u'10.73.73.102', u'kvmEnable': 'true', 'devices': [{u'index': 2, u'iface': u'ide', u'name': u'hdc', u'vm_custom': {}, u'format': u'raw', u'vmid': u'6c11dd49-67a1-4917-be01-2a2f2bac6942', u'diskType': u'file', u'alias': u'ua-93ca26a0-224c-49bc-aedb-075164e7129b', u'readonly': u'True', u'specParams': {}, u'address': {u'bus': u'1', u'controller': u'0', u'type': u'drive', u'target': u'0', u'unit': u'0'}, u'device': u'cdrom', u'discard': False, u'path': u'', u'propagateErrors': u'report', u'type': u'disk'}, {u'address': {u'slot': u'0x07', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'reqsize': u'0', u'serial': u'a0544f7f-2cf9-4c67-9386-df46a750ed74', u'index': 0, u'iface': u'virtio', u'apparentsize': u'62277025792', u'specParams': {u'pinToIoThread': u'1'}, u'guestName': u'/dev/vda', u'cache': u'none', u'imageID': u'a0544f7f-2cf9-4c67-9386-df46a750ed74', u'readonly': u'False', u'shared': u'exclusive', u'truesize': u'3757756416', u'type': u'disk', u'domainID': u'7b68bd24-f73e-40a6-9c4a-5db3ecdf102e', u'volumeInfo': {u'path': u'/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b', u'type': u'file'}, u'format': u'raw', u'poolID': u'00000000-0000-0000-0000-000000000000', u'device': u'disk', u'path': u'/var/run/vdsm/storage/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b', u'propagateErrors': u'off', u'name': u'vda', u'vm_custom': {}, u'vmid': u'6c11dd49-67a1-4917-be01-2a2f2bac6942', u'volumeID': u'503d88b0-2253-4567-9f90-ea8987da5d4b', u'diskType': u'file', u'alias': u'ua-a0544f7f-2cf9-4c67-9386-df46a750ed74', u'discard': False, u'volumeChain': [{u'domainID': u'7b68bd24-f73e-40a6-9c4a-5db3ecdf102e', u'leaseOffset': 0, u'volumeID': u'503d88b0-2253-4567-9f90-ea8987da5d4b', u'leasePath': u'/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease', u'imageID': u'a0544f7f-2cf9-4c67-9386-df46a750ed74', u'path': u'/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b'}]}], u'status': 'Migration Destination', u'maxVCpus': '64', 'clientIp': u'', u'statusTime': '4298529960', u'maxMemSlots': 16}} from=::ffff:10.73.73.106,52742, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52) 2018-12-19 12:10:05,243+0800 INFO (vm/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') VM wrapper has started (vm:2785) but with 2018-12-19 12:10:05,283+0800 ERROR (jsonrpc/0) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Alias not found for device type lease during migration at destination host (vm:5674) that got ignored, and so VDSM started to prepare the storage and so on until: 2018-12-19 12:10:49,926+0800 WARN (vm/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Cannot find device alias for _conf:{'lease_id': '503d88b0-2253-4567-9f90-ea8987da5d4b', 'sd_id': '7b68bd24-f73e-40a6-9c4a-5db3ecdf102e', 'offset': '', 'device': 'lease', 'path': '/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease', 'type': 'lease'} _deviceXML:None alias:None config:<bound method Device.config of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> createXmlElem:<bound method Device.createXmlElem of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> custom:{} device:lease deviceId:None deviceType:None from_xml_tree:<bound method type.from_xml_tree of <class 'vdsm.virt.vmdevices.lease.Device'>> getXML:<bound method Device.getXML of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> get_extra_xmls:<bound method Device.get_extra_xmls of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> get_identifying_attrs:<bound method type.get_identifying_attrs of <class 'vdsm.virt.vmdevices.lease.Device'>> get_metadata:<bound method Device.get_metadata of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> is_attached_to:<bound method Device.is_attached_to of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> is_hostdevice:False lease_id:503d88b0-2253-4567-9f90-ea8987da5d4b log:<vdsm.common.logutils.SimpleLogAdapter object at 0x7f81f0074110> offset: path:/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease sd_id:7b68bd24-f73e-40a6-9c4a-5db3ecdf102e setup:<bound method Device.setup of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> specParams:{} teardown:<bound method Device.teardown of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> type:lease update_device_info:<bound method type.update_device_info of <class 'vdsm.virt.vmdevices.lease.Device'>> vmid:None (core:957) that got ignored. At the same time ovirt-ha-agent (::1,44282) is still active: 2018-12-19 12:11:10,550+0800 INFO (jsonrpc/0) [api.virt] START migrate(params={u'tunneled': False, u'autoConverge': True, u'dstqemu': u'10.73.73.106', u'src': u'localhost', u'enableGuestEvents': True, u'dst': u'10.73.73.106', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [u'5000'], u'name': u'setDowntime'}, u'limit': -1}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942', u'abortOnError': True, u'compressed': False, u'method': u'online'}) from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46) 2018-12-19 12:11:10,552+0800 INFO (jsonrpc/0) [api.virt] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52) 2018-12-19 12:11:10,552+0800 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.migrate succeeded in 0.00 seconds (__init__:573) 2018-12-19 12:11:10,590+0800 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:573) 2018-12-19 12:11:10,987+0800 INFO (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Creation of destination VM took: 0 seconds (migration:473) 2018-12-19 12:11:10,988+0800 INFO (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') starting migration to qemu+tls://10.73.73.106/system with miguri tcp://10.73.73.106 (migration:502) 2018-12-19 12:11:12,302+0800 INFO (jsonrpc/4) [vdsm.api] START getSpmStatus(spUUID=u'febfea04-033c-11e9-9c65-5254003404b0', options=None) from=::ffff:10.73.73.102,60254, task_id=5ae98c04-1bcd-40df-8a59-ffabf23d17fa (api:46) and it tries to migrate again the engine VM to tcp://10.73.73.106 but this time it fails with: 2018-12-19 12:11:14,225+0800 ERROR (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Domain not found: no domain with matching name 'HostedEngine' (migration:290) 2018-12-19 12:11:14,230+0800 ERROR (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Failed to migrate (migration:455) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437, in _regular_run self._startUnderlyingMigration(time.time()) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 509, in _startUnderlyingMigration self._perform_with_conv_schedule(duri, muri) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 587, in _perform_with_conv_schedule self._perform_migration(duri, muri) File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 529, in _perform_migration self._migration_flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1779, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirtError: Domain not found: no domain with matching name 'HostedEngine' which leads to Dec 19 12:11:14 localhost libvirtd: 2018-12-19 04:11:14.108+0000: 31096: error : qemuMigrationJobCheckStatus:1435 : operation failed: migration out job: unexpectedly failed In /var/log/messages so: 1. ovirt-ha-agent started the migration 2. the migration successfully completed 3. the engine sees the engine VM on the second host and tries to kill the engine VM on the first 4. ovirt-ha-agent got reported that the engine VM disappeared so it sucesfully went into local maintenance (which is not sent back to the engine) 5. the engine tries to migrate back the engine VM but it fails due to an issue with Volume Lease 6. but ovirt-ha-agent notices a new engine VM appearing here 6. ovirt-ha-agent tries to migrate it another time since it's supposed to be in local maintenance 7. but ovirt-ha-agent fails to trigger the migration since the engine VM never really went back agent.log confirms the behaviour described in comment 5: host 1 was in local maintenance and the VM was on host 2: MainThread::INFO::2018-12-19 12:09:10,444::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::INFO::2018-12-19 12:09:10,572::hosted_engine::613::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor) Stopped VDSM domain monitor MainThread::INFO::2018-12-19 12:09:10,572::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0) MainThread::INFO::2018-12-19 12:09:19,592::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::INFO::2018-12-19 12:09:19,727::hosted_engine::613::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor) Stopped VDSM domain monitor MainThread::INFO::2018-12-19 12:09:19,727::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0) Then it detected an anomalous state (the engine trying to migrate back the VM while in maintenance?) and so it went into ReinitializeFSM: MainThread::INFO::2018-12-19 12:09:29,775::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenance-ReinitializeFSM) sent? sent MainThread::INFO::2018-12-19 12:09:29,827::ovf_store::151::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/6469ecb5-0c22-4789-be72-0b16187143a3/dfd76e75-1a6d-42fe-8531-571605c76d6b MainThread::INFO::2018-12-19 12:09:29,888::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2018-12-19 12:09:39,934::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineDown) sent? sent MainThread::INFO::2018-12-19 12:09:40,058::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-12-19 12:09:40,058::hosted_engine::499::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host 10.73.73.106 (id: 2, score: 3400) MainThread::INFO::2018-12-19 12:09:50,077::states::486::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host 10.73.73.106 (id 2) MainThread::INFO::2018-12-19 12:09:50,208::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-12-19 12:10:00,351::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-12-19 12:10:10,510::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-12-19 12:10:20,671::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-12-19 12:10:29,838::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-12-19 12:10:39,998::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) MainThread::INFO::2018-12-19 12:10:50,145::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400) and then the engine VM appeared and so it moved to EngineUp MainThread::INFO::2018-12-19 12:11:00,164::states::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm unexpectedly running locally, monitoring vm MainThread::INFO::2018-12-19 12:11:00,202::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStarting) sent? sent MainThread::INFO::2018-12-19 12:11:00,330::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400) MainThread::INFO::2018-12-19 12:11:00,339::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'> MainThread::INFO::2018-12-19 12:11:00,365::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? sent MainThread::INFO::2018-12-19 12:11:00,378::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) but it was still in local maintenance: MainThread::INFO::2018-12-19 12:11:10,391::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected and so it tries to migrate away the engine VM again: MainThread::INFO::2018-12-19 12:11:10,424::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-LocalMaintenanceMigrateVm) sent? sent MainThread::WARNING::2018-12-19 12:11:10,542::hosted_engine::600::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-12-19 12:11:10,543::states::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to local maintenance mode MainThread::INFO::2018-12-19 12:11:10,543::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0) MainThread::INFO::2018-12-19 12:11:10,588::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-EngineMigratingAway) sent? sent MainThread::WARNING::2018-12-19 12:11:10,588::hosted_engine::600::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-12-19 12:11:10,592::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineMigratingAway (score: 3400) but it failed since the engine VM wasn't really here: MainThread::ERROR::2018-12-19 12:11:19,615::states::828::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Migration failed: {u'status': {u'message': u'Fatal error during migration', u'code': 12}, u'progress': 0} MainThread::INFO::2018-12-19 12:11:19,641::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineMigratingAway-ReinitializeFSM) sent? ignored MainThread::WARNING::2018-12-19 12:11:19,772::hosted_engine::600::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-12-19 12:11:19,772::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0) MainThread::INFO::2018-12-19 12:11:19,774::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::INFO::2018-12-19 12:11:19,803::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-LocalMaintenance) sent? sent MainThread::WARNING::2018-12-19 12:11:19,803::hosted_engine::600::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor. MainThread::INFO::2018-12-19 12:11:19,803::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0) At the end nothing really bad really happened but the logs, and what we reported to the user in the mean time, got really polluted. Looking at this, I noticed a few things: When ovirt-ha-agent is in LocalMaintenance state, it does not check if the HE VM is running on the host or not. So the movement from LocalMaintenance to ReinitializeFSM has to be caused by an external event from the command line, cockpit or the engine. The local maintenance mode was probably not canceled by the engine, because there is no log line in vdsm.log indicating it, The migration command from the engine is probably caused by balancing. There should not be any other reason why the engine would initiate a migration without user command. It would be useful to have engine logs, so we know why the engine sent destroy command to vdsm and if the host's HA score was not propagated to the engine before it started the migration. This issue haven't detected with recent build, e.g: RHVH-4.3-20190124.0-RHVH-x86_64-dvd1.iso. Migration is successful when clicking "Put this host into local maintenance" button first time on cockpit UI. Moving to 4.3.2 not being identified as blocker for 4.3.1 Re-targeting to 4.3.6 not being identified as blocker for 4.3.5. Already fixed in 4.3.5, testing in 4.3.6 scope Test with RHVH-4.3-20190801.2-RHVH-x86_64-dvd1.iso, bug is fixed. Move to "VERIFIED" This bugzilla is included in oVirt 4.3.6 release, published on September 26th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.6 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |
Created attachment 1515506 [details] operate from vm engine Description of problem: Migration fails when clicking "Put host into local maintenance" button first time on cockpit UI. The host changes into local maintenance successfully, but vm migrates to the other host failed. Version-Release number of selected component (if applicable): RHVH-4.2-20181212.0-RHVH-x86_64-dvd1.iso cockpit-bridge-176-4.el7.x86_64 cockpit-storaged-176-4.el7.noarch cockpit-176-4.el7.x86_64 cockpit-machines-ovirt-176-4.el7.noarch cockpit-ovirt-dashboard-0.11.37-1.el7ev.noarch cockpit-dashboard-176-4.el7.x86_64 cockpit-system-176-4.el7.noarch cockpit-ws-176-4.el7.x86_64 ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch rhvm-appliance-4.2-20181212.0.el7.noarch How reproducible: Not 100% Steps to Reproduce: 1. Clean install RHVH-4.2-20181212.0-RHVH-x86_64-dvd1.iso with two host (A and B) 2. Deploy hosted engine on host A 3. Add host B as additional host to the vm engine 4. Click "Put this host into local maintenance" button on host A cockpit UI 5. Check the migration status 6. Remove this host from maintenance 7. Click the button again, then check migration status Actual results: After step 5, migration fails After step 7, migration is successful Expected results: After step 5, migration should be successful Additional info: 1. Operating migration on vm engine UI can also reproduce this issue. 2. Sometimes the host changes into local maintenance failed (20%)