Bug 1660765

Summary: Migration fails when clicking "Put this host into local maintenance" button first time on cockpit UI.
Product: [oVirt] cockpit-ovirt Reporter: Wei Wang <weiwang>
Component: Hosted EngineAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Wei Wang <weiwang>
Severity: medium Docs Contact:
Priority: high    
Version: ---CC: akrejcir, bugs, cshao, huzhao, qiyuan, sbonazzo, stirabos, weiwang, yaniwang, ycui, yturgema
Target Milestone: ovirt-4.3.6Flags: sbonazzo: ovirt-4.3?
rule-engine: planning_ack?
sbonazzo: devel_ack+
cshao: testing_ack+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-26 19:43:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
operate from vm engine
none
operate from cockpit UI none

Description Wei Wang 2018-12-19 07:53:56 UTC
Created attachment 1515506 [details]
operate from vm engine

Description of problem:
Migration fails when clicking "Put host into local maintenance" button first time on cockpit UI.
The host changes into local maintenance successfully, but vm migrates to the other host failed.

Version-Release number of selected component (if applicable):
RHVH-4.2-20181212.0-RHVH-x86_64-dvd1.iso
cockpit-bridge-176-4.el7.x86_64
cockpit-storaged-176-4.el7.noarch
cockpit-176-4.el7.x86_64
cockpit-machines-ovirt-176-4.el7.noarch
cockpit-ovirt-dashboard-0.11.37-1.el7ev.noarch
cockpit-dashboard-176-4.el7.x86_64
cockpit-system-176-4.el7.noarch
cockpit-ws-176-4.el7.x86_64
ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
rhvm-appliance-4.2-20181212.0.el7.noarch

How reproducible:
Not 100%

Steps to Reproduce:
1. Clean install RHVH-4.2-20181212.0-RHVH-x86_64-dvd1.iso with two host (A and B)
2. Deploy hosted engine on host A
3. Add host B as additional host to the vm engine
4. Click "Put this host into local maintenance" button on host A cockpit UI
5. Check the migration status
6. Remove this host from maintenance
7. Click the button again, then check migration status

Actual results:
After step 5, migration fails
After step 7, migration is successful

Expected results:
After step 5, migration should be successful

Additional info:
1. Operating migration on vm engine UI can also reproduce this issue.
2. Sometimes the host changes into local maintenance failed (20%)

Comment 1 Wei Wang 2018-12-19 07:54:50 UTC
Created attachment 1515507 [details]
operate from cockpit UI

Comment 2 Sandro Bonazzola 2019-01-08 09:10:12 UTC
Simone is this intended behavior?

Comment 3 Wei Wang 2019-01-17 02:48:33 UTC
sbonazzo,
I think this is a common bug, not a RFE. Could you please check it again? 
Thanks!

Comment 4 Simone Tiraboschi 2019-01-22 09:23:42 UTC
(In reply to Sandro Bonazzola from comment #2)
> Simone is this intended behavior?

No, I tend to think it's a bug.

Comment 5 Simone Tiraboschi 2019-01-22 11:36:01 UTC
In /var/log/messages we have

 Dec 19 12:11:14 localhost libvirtd: 2018-12-19 04:11:14.108+0000: 31096: error : qemuMigrationJobCheckStatus:1435 : operation failed: migration out job: unexpectedly failed
 Dec 19 12:11:19 localhost journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Migration failed: {u'status': {u'message': u'Fatal error during migration', u'code': 12}, u'progress': 0}

and so the issue.

Now let's try to see in vdsm.log what really happened:

 2018-12-19 12:07:09,524+0800 INFO  (jsonrpc/5) [api.virt] START migrate(params={u'tunneled': False, u'autoConverge': True, u'dstqemu': u'10.73.73.106', u'src': u'localhost', u'enableGuestEvents': True, u'dst': u'10.73.73.106', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [u'5000'], u'name': u'setDowntime'}, u'limit': -1}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942', u'abortOnError': True, u'compressed': False, u'method': u'online'}) from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46)
 2018-12-19 12:07:09,526+0800 INFO  (jsonrpc/5) [api.virt] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52)

Please notice 'from=::1,44282' which means that the command has been sent by the same host so by ovirt-ha-agent (the user triggered it from cockpit).

And so the migration started:
 2018-12-19 12:07:09,955+0800 INFO  (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Creation of destination VM took: 0 seconds (migration:473)
 2018-12-19 12:07:09,955+0800 INFO  (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') starting migration to qemu+tls://10.73.73.106/system with miguri tcp://10.73.73.106 (migration:502)

And ovirt-ha-agent keeps monitoring it: 
 2018-12-19 12:08:39,873+0800 INFO  (jsonrpc/0) [api.virt] START getMigrationStatus() from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46)
 2018-12-19 12:08:39,873+0800 INFO  (jsonrpc/0) [api.virt] FINISH getMigrationStatus return={'status': {'message': 'Done', 'code': 0}, 'migrationStats': {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 97L}} from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52)
 2018-12-19 12:08:39,988+0800 INFO  (migmon/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Migration Progress: 90 seconds elapsed, 99% of data processed, total data: 16444MB, processed data: 4616MB, remaining data: 207MB, transfer speed 52MBps, zero pages: 3197006MB, compressed: 0MB, dirty rate: 4819, memory iteration: 3 (migration:867)

Please notice 'from=::1,44282' in getMigrationStatus

And the migration successfully completed:
 2018-12-19 12:08:48,218+0800 INFO  (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') migration took 98 seconds to complete (migration:514)
 2018-12-19 12:08:48,218+0800 INFO  (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Changed state to Down: Migration succeeded (code=4) (vm:1689)

Now the issue:
at the same time we have also an active engine that queries VM status, please notice:
 2018-12-19 12:08:45,791+0800 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::ffff:10.73.73.102,60254 (api:46)
 2018-12-19 12:08:45,793+0800 INFO  (jsonrpc/3) [api.host] FINISH getAllVmStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': (suppressed)} from=::ffff:10.73.73.102,60254 (api:52)

Please notice 'from=::ffff:10.73.73.102,60254' which means that this is the engine running on a different machine (the engine VM itself).
So maybe the engine notices the engine VM also up on the target host or something like that and so it sends a destroy command:

 2018-12-19 12:08:48,227+0800 INFO  (jsonrpc/7) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:10.73.73.102,60254, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46)

Please notice 'from=::ffff:10.73.73.102,60254' (it's the engine) and 'vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942' (it's exactly the engine VM).

In this case the destroy failed since the engine VM completed the migration just in time (migration completed at 12:08:48,218+0800, destroy failed at 12:08:48,227+0800)
 2018-12-19 12:08:48,227+0800 INFO  (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Release VM resources (vm:5279)
 2018-12-19 12:08:48,227+0800 WARN  (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') trying to set state to Powering down when already Down (vm:613)
 2018-12-19 12:08:48,227+0800 INFO  (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Stopping connection (guestagent:442)
 2018-12-19 12:08:48,227+0800 INFO  (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') _destroyVmGraceful attempt #0 (vm:5316)
 2018-12-19 12:08:48,228+0800 WARN  (jsonrpc/7) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') VM '6c11dd49-67a1-4917-be01-2a2f2bac6942' couldn't be destroyed in libvirt: Requested operation is not valid: domain is not running (vm:5337)
 2018-12-19 12:08:48,229+0800 INFO  (jsonrpc/7) [vdsm.api] START teardownImage(sdUUID='7b68bd24-f73e-40a6-9c4a-5db3ecdf102e', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='a0544f7f-2cf9-4c67-9386-df46a750ed74', volUUID=None) from=::ffff:10.73.73.102,60254, task_id=52e9d07a-9679-4302-ac64-1beb94445bc1 (api:46)

but this is probably enough to update the VM exit reason.

And indeed on the next monitoring round ovirt-ha-agent got an error:
 2018-12-19 12:08:53,761+0800 INFO  (jsonrpc/2) [api.virt] START getStats() from=::1,44280, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46)
 2018-12-19 12:08:53,762+0800 INFO  (jsonrpc/2) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942'} (api:127)
 2018-12-19 12:08:53,762+0800 INFO  (jsonrpc/2) [api.virt] FINISH getStats return={'status': {'message': "Virtual machine does not exist: {'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942'}", 'code': 1}} from=::1,44280, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52)

and this until
 2018-12-19 12:10:03,884+0800 INFO  (jsonrpc/5) [api.virt] START getStats() from=::1,44280, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46)
 2018-12-19 12:10:03,884+0800 INFO  (jsonrpc/5) [api] FINISH getStats error=Virtual machine does not exist: {'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942'} (api:127)
 2018-12-19 12:10:03,884+0800 INFO  (jsonrpc/5) [api.virt] FINISH getStats return={'status': {'message': "Virtual machine does not exist: {'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942'}", 'code': 1}} from=::1,44280, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52)

Then the engine (::ffff:10.73.73.106,52742) went up and tried to migrate back the engine VM:

 2018-12-19 12:10:05,225+0800 INFO  (jsonrpc/0) [api.virt] START migrationCreate(params={u'username': u'None', u'acpiEnable': u'true', u'emulatedMachine': u'pc-i440fx-rhel7.5.0', u'guestFQDN': u'rhevh-hostedengine-vm-06.lab.eng.pek2.redhat.com', u'enableGuestEvents': True, u'numOfIoThreads': u'1', u'memGuaranteedSize': 1024, u'timeOffset': u'0', u'cpuType': u'Broadwell', u'smp': u'4', u'migrationDest': u'libvirt', u'guestNumaNodes': [{u'nodeIndex': 0, u'cpus': u'0,1,2,3', u'memory': u'16348'}], u'xml': u'<domain type=\'kvm\' id=\'3\'>\n  <name>HostedEngine</name>\n  <uuid>6c11dd49-67a1-4917-be01-2a2f2bac6942</uuid>\n ... from=::ffff:10.73.73.106,52742, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46)
2018-12-19 12:10:05,226+0800 INFO  (jsonrpc/0) [api.virt] START create(vmParams={u'username': u'None', u'acpiEnable': u'true', u'emulatedMachine': u'pc-i440fx-rhel7.5.0', u'guestFQDN': u'rhevh-hostedengine-vm-06.lab.eng.pek2.redhat.com', u'enableGuestEvents': True, u'numOfIoThreads': u'1', u'memGuaranteedSize': 1024, u'timeOffset': u'0', u'cpuType': u'Broadwell', u'smp': u'4', u'migrationDest': u'libvirt', u'guestNumaNodes': [{u'nodeIndex': 0, u'cpus': u'0,1,2,3', u'memory': u'16348'}], u'xml': u'<domain type=\'kvm\' id=\'3\'>\n  <name>HostedEngine</name>\n  <uuid>6c11dd49-67a1-4917-be01-2a2f2bac6942</uuid>\n ... from=::ffff:10.73.73.106,52742, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46)
2018-12-19 12:10:05,243+0800 INFO  (jsonrpc/0) [api.virt] FINISH create return={'status': {'message': 'Done', 'code': 0}, 'vmList': {u'guestFQDN': u'rhevh-hostedengine-vm-06.lab.eng.pek2.redhat.com', u'acpiEnable': 'true', u'emulatedMachine': 'pc-i440fx-rhel7.5.0', u'enableGuestEvents': True, u'vmId': '6c11dd49-67a1-4917-be01-2a2f2bac6942', u'memGuaranteedSize': 1024, u'timeOffset': '0', u'cpuType': 'Broadwell', u'smp': '4', u'guestNumaNodes': [{'nodeIndex': 0, 'cpus': '0,1,2,3', 'memory': '16348'}], u'xml': u'<domain type=\'kvm\' id=\'3\'>\n  <name>HostedEngine</name>\n  <uuid>6c11dd49-67a1-4917-be01-2a2f2bac6942</uuid>\n  <metadata xmlns:ns0="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">\n    <ns0:qos/>\n    <ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0">\n    <ovirt-vm:clusterVersion>4.2</ovirt-vm:clusterVersion>\n    <ovirt-vm:destroy_on_reboot type="bool">True</ovirt-vm:destroy_on_reboot>\n    <ovirt-vm:guestAgentAPIVersion type="int">3</ovirt-vm:guestAgentAPIVersion>\n    <ovirt-vm:launchPaused>false</ovirt-vm:launchPaused>\n    <ovirt-vm:memGuaranteedSize type="int">1024</ovirt-vm:memGuaranteedSize>\n    <ovirt-vm:minGuaranteedMemoryMb type="int">1024</ovirt-vm:minGuaranteedMemoryMb>\n    <ovirt-vm:resumeBehavior>auto_resume</ovirt-vm:resumeBehavior>\n    <ovirt-vm:startTime type="float">1545190809.52</ovirt-vm:startTime>\n    <ovirt-vm:device mac_address="52:54:00:34:04:b0">\n        <ovirt-vm:network>ovirtmgmt</ovirt-vm:network>\n        <ovirt-vm:custom>\n            <ovirt-vm:queues>4</ovirt-vm:queues>\n        </ovirt-vm:custom>\n        <ovirt-vm:specParams/>\n        <ovirt-vm:vm_custom/>\n    </ovirt-vm:device>\n    <ovirt-vm:device devtype="disk" name="vda">\n        <ovirt-vm:domainID>7b68bd24-f73e-40a6-9c4a-5db3ecdf102e</ovirt-vm:domainID>\n        <ovirt-vm:guestName>/dev/vda</ovirt-vm:guestName>\n        <ovirt-vm:imageID>a0544f7f-2cf9-4c67-9386-df46a750ed74</ovirt-vm:imageID>\n        <ovirt-vm:poolID>00000000-0000-0000-0000-000000000000</ovirt-vm:poolID>\n        <ovirt-vm:shared>exclusive</ovirt-vm:shared>\n        <ovirt-vm:volumeID>503d88b0-2253-4567-9f90-ea8987da5d4b</ovirt-vm:volumeID>\n        <ovirt-vm:specParams>\n            <ovirt-vm:pinToIoThread>1</ovirt-vm:pinToIoThread>\n        </ovirt-vm:specParams>\n        <ovirt-vm:vm_custom/>\n        <ovirt-vm:volumeChain>\n            <ovirt-vm:volumeChainNode>\n                <ovirt-vm:domainID>7b68bd24-f73e-40a6-9c4a-5db3ecdf102e</ovirt-vm:domainID>\n                <ovirt-vm:imageID>a0544f7f-2cf9-4c67-9386-df46a750ed74</ovirt-vm:imageID>\n                <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset>\n                <ovirt-vm:leasePath>/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease</ovirt-vm:leasePath>\n                <ovirt-vm:path>/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b</ovirt-vm:path>\n                <ovirt-vm:volumeID>503d88b0-2253-4567-9f90-ea8987da5d4b</ovirt-vm:volumeID>\n            </ovirt-vm:volumeChainNode>\n        </ovirt-vm:volumeChain>\n    </ovirt-vm:device>\n    <ovirt-vm:device devtype="disk" name="hdc">\n        <ovirt-vm:specParams/>\n        <ovirt-vm:vm_custom/>\n    </ovirt-vm:device>\n</ovirt-vm:vm>\n  </metadata>\n  <maxMemory slots=\'16\' unit=\'KiB\'>66961408</maxMemory>\n  <memory unit=\'KiB\'>16740352</memory>\n  <currentMemory unit=\'KiB\'>16740352</currentMemory>\n  <vcpu placement=\'static\' current=\'4\'>64</vcpu>\n  <iothreads>1</iothreads>\n  <resource>\n    <partition>/machine</partition>\n  </resource>\n  <sysinfo type=\'smbios\'>\n    <system>\n      <entry name=\'manufacturer\'>Red Hat</entry>\n      <entry name=\'product\'>RHEV Hypervisor</entry>\n      <entry name=\'version\'>7.6-8.0.el7</entry>\n      <entry name=\'serial\'>30373238-3730-4e43-3737-31353042384c</entry>\n      <entry name=\'uuid\'>6c11dd49-67a1-4917-be01-2a2f2bac6942</entry>\n    </system>\n  </sysinfo>\n  <os>\n    <type arch=\'x86_64\' machine=\'pc-i440fx-rhel7.5.0\'>hvm</type>\n    <boot dev=\'hd\'/>\n    <smbios mode=\'sysinfo\'/>\n  </os>\n  <features>\n    <acpi/>\n    <vmcoreinfo state=\'on\'/>\n  </features>\n  <cpu mode=\'custom\' match=\'exact\' check=\'full\'>\n    <model fallback=\'forbid\'>Broadwell</model>\n    <topology sockets=\'16\' cores=\'4\' threads=\'1\'/>\n    <feature policy=\'require\' name=\'vme\'/>\n    <feature policy=\'require\' name=\'f16c\'/>\n    <feature policy=\'require\' name=\'rdrand\'/>\n    <feature policy=\'require\' name=\'hypervisor\'/>\n    <feature policy=\'require\' name=\'arat\'/>\n    <feature policy=\'require\' name=\'xsaveopt\'/>\n    <feature policy=\'require\' name=\'abm\'/>\n    <numa>\n      <cell id=\'0\' cpus=\'0-3\' memory=\'16740352\' unit=\'KiB\'/>\n    </numa>\n  </cpu>\n  <clock offset=\'variable\' adjustment=\'0\' basis=\'utc\'>\n    <timer name=\'rtc\' tickpolicy=\'catchup\'/>\n    <timer name=\'pit\' tickpolicy=\'delay\'/>\n    <timer name=\'hpet\' present=\'no\'/>\n  </clock>\n  <on_poweroff>destroy</on_poweroff>\n  <on_reboot>destroy</on_reboot>\n  <on_crash>destroy</on_crash>\n  <pm>\n    <suspend-to-mem enabled=\'no\'/>\n    <suspend-to-disk enabled=\'no\'/>\n  </pm>\n  <devices>\n    <emulator>/usr/libexec/qemu-kvm</emulator>\n    <disk type=\'file\' device=\'cdrom\'>\n      <driver name=\'qemu\' error_policy=\'report\'/>\n      <source startupPolicy=\'optional\'/>\n      <target dev=\'hdc\' bus=\'ide\'/>\n      <readonly/>\n      <alias name=\'ua-93ca26a0-224c-49bc-aedb-075164e7129b\'/>\n      <address type=\'drive\' controller=\'0\' bus=\'1\' target=\'0\' unit=\'0\'/>\n    </disk>\n    <disk type=\'file\' device=\'disk\' snapshot=\'no\'>\n      <driver name=\'qemu\' type=\'raw\' cache=\'none\' error_policy=\'stop\' io=\'threads\' iothread=\'1\'/>\n      <source file=\'/var/run/vdsm/storage/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b\'/>\n      <backingStore/>\n      <target dev=\'vda\' bus=\'virtio\'/>\n      <serial>a0544f7f-2cf9-4c67-9386-df46a750ed74</serial>\n      <alias name=\'ua-a0544f7f-2cf9-4c67-9386-df46a750ed74\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x07\' function=\'0x0\'/>\n    </disk>\n    <controller type=\'virtio-serial\' index=\'0\' ports=\'16\'>\n      <alias name=\'ua-21005fec-1ab4-4bd8-b546-258b75db337d\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x05\' function=\'0x0\'/>\n    </controller>\n    <controller type=\'scsi\' index=\'0\' model=\'virtio-scsi\'>\n      <driver iothread=\'1\'/>\n      <alias name=\'ua-2238c2d8-9078-4d1d-a35c-b8c07be921c7\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x06\' function=\'0x0\'/>\n    </controller>\n    <controller type=\'ide\' index=\'0\'>\n      <alias name=\'ide\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x01\' function=\'0x1\'/>\n    </controller>\n    <controller type=\'usb\' index=\'0\' model=\'piix3-uhci\'>\n      <alias name=\'usb\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x01\' function=\'0x2\'/>\n    </controller>\n    <controller type=\'pci\' index=\'0\' model=\'pci-root\'>\n      <alias name=\'pci.0\'/>\n    </controller>\n    <lease>\n      <lockspace>7b68bd24-f73e-40a6-9c4a-5db3ecdf102e</lockspace>\n      <key>503d88b0-2253-4567-9f90-ea8987da5d4b</key>\n      <target path=\'/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease\'/>\n    </lease>\n    <interface type=\'bridge\'>\n      <mac address=\'52:54:00:34:04:b0\'/>\n      <source bridge=\'ovirtmgmt\'/>\n      <target dev=\'vnet0\'/>\n      <model type=\'virtio\'/>\n      <driver name=\'vhost\' queues=\'4\'/>\n      <filterref filter=\'vdsm-no-mac-spoofing\'/>\n      <link state=\'up\'/>\n      <mtu size=\'1500\'/>\n      <alias name=\'ua-1742b18c-41b7-421e-a898-a09e758587e1\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x03\' function=\'0x0\'/>\n    </interface>\n    <channel type=\'unix\'>\n      <source mode=\'bind\' path=\'/var/lib/libvirt/qemu/channels/6c11dd49-67a1-4917-be01-2a2f2bac6942.ovirt-guest-agent.0\'/>\n      <target type=\'virtio\' name=\'ovirt-guest-agent.0\' state=\'connected\'/>\n      <alias name=\'channel0\'/>\n      <address type=\'virtio-serial\' controller=\'0\' bus=\'0\' port=\'1\'/>\n    </channel>\n    <channel type=\'unix\'>\n      <source mode=\'bind\' path=\'/var/lib/libvirt/qemu/channels/6c11dd49-67a1-4917-be01-2a2f2bac6942.org.qemu.guest_agent.0\'/>\n      <target type=\'virtio\' name=\'org.qemu.guest_agent.0\' state=\'connected\'/>\n      <alias name=\'channel1\'/>\n      <address type=\'virtio-serial\' controller=\'0\' bus=\'0\' port=\'2\'/>\n    </channel>\n    <channel type=\'spicevmc\'>\n      <target type=\'virtio\' name=\'com.redhat.spice.0\' state=\'disconnected\'/>\n      <alias name=\'channel2\'/>\n      <address type=\'virtio-serial\' controller=\'0\' bus=\'0\' port=\'3\'/>\n    </channel>\n    <channel type=\'unix\'>\n      <source mode=\'bind\' path=\'/var/lib/libvirt/qemu/channels/6c11dd49-67a1-4917-be01-2a2f2bac6942.org.ovirt.hosted-engine-setup.0\'/>\n      <target type=\'virtio\' name=\'org.ovirt.hosted-engine-setup.0\' state=\'disconnected\'/>\n      <alias name=\'channel3\'/>\n      <address type=\'virtio-serial\' controller=\'0\' bus=\'0\' port=\'4\'/>\n    </channel>\n    <input type=\'mouse\' bus=\'ps2\'>\n      <alias name=\'input0\'/>\n    </input>\n    <input type=\'keyboard\' bus=\'ps2\'>\n      <alias name=\'input1\'/>\n    </input>\n    <graphics type=\'vnc\' port=\'5900\' autoport=\'yes\' listen=\'10.73.73.106\' keymap=\'en-us\' passwdValidTo=\'1970-01-01T00:00:01\'>\n      <listen type=\'network\' address=\'10.73.73.106\' network=\'vdsm-ovirtmgmt\'/>\n    </graphics>\n    <graphics type=\'spice\' port=\'5901\' tlsPort=\'5902\' autoport=\'yes\' listen=\'10.73.73.106\' passwdValidTo=\'1970-01-01T00:00:01\'>\n      <listen type=\'network\' address=\'10.73.73.106\' network=\'vdsm-ovirtmgmt\'/>\n      <channel name=\'main\' mode=\'secure\'/>\n      <channel name=\'display\' mode=\'secure\'/>\n      <channel name=\'inputs\' mode=\'secure\'/>\n      <channel name=\'cursor\' mode=\'secure\'/>\n      <channel name=\'playback\' mode=\'secure\'/>\n      <channel name=\'record\' mode=\'secure\'/>\n      <channel name=\'smartcard\' mode=\'secure\'/>\n      <channel name=\'usbredir\' mode=\'secure\'/>\n    </graphics>\n    <sound model=\'ich6\'>\n      <alias name=\'ua-b31e3318-8378-47d0-ac8e-1de601b9e069\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x04\' function=\'0x0\'/>\n    </sound>\n    <video>\n      <model type=\'qxl\' ram=\'65536\' vram=\'32768\' vgamem=\'16384\' heads=\'1\' primary=\'yes\'/>\n      <alias name=\'ua-2f0f0a2f-0eec-45c7-8e58-0886fb06c747\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x02\' function=\'0x0\'/>\n    </video>\n    <memballoon model=\'virtio\'>\n      <stats period=\'5\'/>\n      <alias name=\'ua-23e814e2-af58-43c6-8413-3cc513f8842a\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x08\' function=\'0x0\'/>\n    </memballoon>\n    <rng model=\'virtio\'>\n      <backend model=\'random\'>/dev/urandom</backend>\n      <alias name=\'ua-09ed90c4-a4c5-4f7e-98e9-3eb7677c6e92\'/>\n      <address type=\'pci\' domain=\'0x0000\' bus=\'0x00\' slot=\'0x09\' function=\'0x0\'/>\n    </rng>\n  </devices>\n  <seclabel type=\'dynamic\' model=\'selinux\' relabel=\'yes\'>\n    <label>system_u:system_r:svirt_t:s0:c69,c782</label>\n    <imagelabel>system_u:object_r:svirt_image_t:s0:c69,c782</imagelabel>\n  </seclabel>\n  <seclabel type=\'dynamic\' model=\'dac\' relabel=\'yes\'>\n    <label>+107:+107</label>\n    <imagelabel>+107:+107</imagelabel>\n  </seclabel>\n</domain>\n', u'custom': {}, u'smpCoresPerSocket': '4', u'vmName': 'HostedEngine', u'username': u'None', u'maxMemSize': 65392, u'bootMenuEnable': 'false', u'numOfIoThreads': '1', u'smpThreadsPerCore': '1', u'guestDiskMapping': {}, u'arch': 'x86_64', u'guestIPs': u'10.73.73.102', u'kvmEnable': 'true', 'devices': [{u'index': 2, u'iface': u'ide', u'name': u'hdc', u'vm_custom': {}, u'format': u'raw', u'vmid': u'6c11dd49-67a1-4917-be01-2a2f2bac6942', u'diskType': u'file', u'alias': u'ua-93ca26a0-224c-49bc-aedb-075164e7129b', u'readonly': u'True', u'specParams': {}, u'address': {u'bus': u'1', u'controller': u'0', u'type': u'drive', u'target': u'0', u'unit': u'0'}, u'device': u'cdrom', u'discard': False, u'path': u'', u'propagateErrors': u'report', u'type': u'disk'}, {u'address': {u'slot': u'0x07', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, u'reqsize': u'0', u'serial': u'a0544f7f-2cf9-4c67-9386-df46a750ed74', u'index': 0, u'iface': u'virtio', u'apparentsize': u'62277025792', u'specParams': {u'pinToIoThread': u'1'}, u'guestName': u'/dev/vda', u'cache': u'none', u'imageID': u'a0544f7f-2cf9-4c67-9386-df46a750ed74', u'readonly': u'False', u'shared': u'exclusive', u'truesize': u'3757756416', u'type': u'disk', u'domainID': u'7b68bd24-f73e-40a6-9c4a-5db3ecdf102e', u'volumeInfo': {u'path': u'/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b', u'type': u'file'}, u'format': u'raw', u'poolID': u'00000000-0000-0000-0000-000000000000', u'device': u'disk', u'path': u'/var/run/vdsm/storage/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b', u'propagateErrors': u'off', u'name': u'vda', u'vm_custom': {}, u'vmid': u'6c11dd49-67a1-4917-be01-2a2f2bac6942', u'volumeID': u'503d88b0-2253-4567-9f90-ea8987da5d4b', u'diskType': u'file', u'alias': u'ua-a0544f7f-2cf9-4c67-9386-df46a750ed74', u'discard': False, u'volumeChain': [{u'domainID': u'7b68bd24-f73e-40a6-9c4a-5db3ecdf102e', u'leaseOffset': 0, u'volumeID': u'503d88b0-2253-4567-9f90-ea8987da5d4b', u'leasePath': u'/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease', u'imageID': u'a0544f7f-2cf9-4c67-9386-df46a750ed74', u'path': u'/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b'}]}], u'status': 'Migration Destination', u'maxVCpus': '64', 'clientIp': u'', u'statusTime': '4298529960', u'maxMemSlots': 16}} from=::ffff:10.73.73.106,52742, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52)
 2018-12-19 12:10:05,243+0800 INFO  (vm/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') VM wrapper has started (vm:2785)

but with
 2018-12-19 12:10:05,283+0800 ERROR (jsonrpc/0) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Alias not found for device type lease during migration at destination host (vm:5674)

that got ignored,
and so VDSM started to prepare the storage and so on until:

 2018-12-19 12:10:49,926+0800 WARN  (vm/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Cannot find device alias for _conf:{'lease_id': '503d88b0-2253-4567-9f90-ea8987da5d4b', 'sd_id': '7b68bd24-f73e-40a6-9c4a-5db3ecdf102e', 'offset': '', 'device': 'lease', 'path': '/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease', 'type': 'lease'} _deviceXML:None alias:None config:<bound method Device.config of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> createXmlElem:<bound method Device.createXmlElem of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> custom:{} device:lease deviceId:None deviceType:None from_xml_tree:<bound method type.from_xml_tree of <class 'vdsm.virt.vmdevices.lease.Device'>> getXML:<bound method Device.getXML of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> get_extra_xmls:<bound method Device.get_extra_xmls of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> get_identifying_attrs:<bound method type.get_identifying_attrs of <class 'vdsm.virt.vmdevices.lease.Device'>> get_metadata:<bound method Device.get_metadata of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> is_attached_to:<bound method Device.is_attached_to of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> is_hostdevice:False lease_id:503d88b0-2253-4567-9f90-ea8987da5d4b log:<vdsm.common.logutils.SimpleLogAdapter object at 0x7f81f0074110> offset: path:/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease sd_id:7b68bd24-f73e-40a6-9c4a-5db3ecdf102e setup:<bound method Device.setup of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> specParams:{} teardown:<bound method Device.teardown of <lease.Device sd_id=7b68bd24-f73e-40a6-9c4a-5db3ecdf102e, lease_id=503d88b0-2253-4567-9f90-ea8987da5d4b, path=/rhev/data-center/mnt/10.66.8.174:_home_wangwei_nfs/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/images/a0544f7f-2cf9-4c67-9386-df46a750ed74/503d88b0-2253-4567-9f90-ea8987da5d4b.lease, offset= at 0x7f81f02129f0>> type:lease update_device_info:<bound method type.update_device_info of <class 'vdsm.virt.vmdevices.lease.Device'>> vmid:None (core:957)

that got ignored.

At the same time ovirt-ha-agent (::1,44282) is still active:
 2018-12-19 12:11:10,550+0800 INFO  (jsonrpc/0) [api.virt] START migrate(params={u'tunneled': False, u'autoConverge': True, u'dstqemu': u'10.73.73.106', u'src': u'localhost', u'enableGuestEvents': True, u'dst': u'10.73.73.106', u'convergenceSchedule': {u'init': [{u'params': [u'100'], u'name': u'setDowntime'}], u'stalling': [{u'action': {u'params': [u'150'], u'name': u'setDowntime'}, u'limit': 1}, {u'action': {u'params': [u'200'], u'name': u'setDowntime'}, u'limit': 2}, {u'action': {u'params': [u'300'], u'name': u'setDowntime'}, u'limit': 3}, {u'action': {u'params': [u'400'], u'name': u'setDowntime'}, u'limit': 4}, {u'action': {u'params': [u'500'], u'name': u'setDowntime'}, u'limit': 6}, {u'action': {u'params': [u'5000'], u'name': u'setDowntime'}, u'limit': -1}, {u'action': {u'params': [], u'name': u'abort'}, u'limit': -1}]}, u'vmId': u'6c11dd49-67a1-4917-be01-2a2f2bac6942', u'abortOnError': True, u'compressed': False, u'method': u'online'}) from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:46)
 2018-12-19 12:11:10,552+0800 INFO  (jsonrpc/0) [api.virt] FINISH migrate return={'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} from=::1,44282, vmId=6c11dd49-67a1-4917-be01-2a2f2bac6942 (api:52)
 2018-12-19 12:11:10,552+0800 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.migrate succeeded in 0.00 seconds (__init__:573)
 2018-12-19 12:11:10,590+0800 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:573)
 2018-12-19 12:11:10,987+0800 INFO  (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Creation of destination VM took: 0 seconds (migration:473)
 2018-12-19 12:11:10,988+0800 INFO  (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') starting migration to qemu+tls://10.73.73.106/system with miguri tcp://10.73.73.106 (migration:502)
 2018-12-19 12:11:12,302+0800 INFO  (jsonrpc/4) [vdsm.api] START getSpmStatus(spUUID=u'febfea04-033c-11e9-9c65-5254003404b0', options=None) from=::ffff:10.73.73.102,60254, task_id=5ae98c04-1bcd-40df-8a59-ffabf23d17fa (api:46)

and it tries to migrate again the engine VM to tcp://10.73.73.106

but this time it fails with:
 2018-12-19 12:11:14,225+0800 ERROR (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Domain not found: no domain with matching name 'HostedEngine' (migration:290)
 2018-12-19 12:11:14,230+0800 ERROR (migsrc/6c11dd49) [virt.vm] (vmId='6c11dd49-67a1-4917-be01-2a2f2bac6942') Failed to migrate (migration:455)
 Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 437, in _regular_run
     self._startUnderlyingMigration(time.time())
   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 509, in _startUnderlyingMigration
     self._perform_with_conv_schedule(duri, muri)
   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 587, in _perform_with_conv_schedule
     self._perform_migration(duri, muri)
   File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 529, in _perform_migration
     self._migration_flags)
   File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
     ret = attr(*args, **kwargs)
   File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
     ret = f(*args, **kwargs)
   File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
     return func(inst, *args, **kwargs)
   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1779, in migrateToURI3
     if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
 libvirtError: Domain not found: no domain with matching name 'HostedEngine'

which leads to 
 Dec 19 12:11:14 localhost libvirtd: 2018-12-19 04:11:14.108+0000: 31096: error : qemuMigrationJobCheckStatus:1435 : operation failed: migration out job: unexpectedly failed
In /var/log/messages


so:
1. ovirt-ha-agent started the migration
2. the migration successfully completed
3. the engine sees the engine VM on the second host and tries to kill the engine VM on the first
4. ovirt-ha-agent got reported that the engine VM disappeared so it sucesfully went into local maintenance (which is not sent back to the engine)
5. the engine tries to migrate back the engine VM but it fails due to an issue with Volume Lease
6. but ovirt-ha-agent notices a new engine VM appearing here
6. ovirt-ha-agent tries to migrate it another time since it's supposed to be in local maintenance
7. but ovirt-ha-agent fails to trigger the migration since the engine VM never really went back

Comment 6 Simone Tiraboschi 2019-01-22 12:47:51 UTC
agent.log confirms the behaviour described in comment 5:

host 1 was in local maintenance and the VM was on host 2:

 MainThread::INFO::2018-12-19 12:09:10,444::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected
 MainThread::INFO::2018-12-19 12:09:10,572::hosted_engine::613::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor) Stopped VDSM domain monitor
 MainThread::INFO::2018-12-19 12:09:10,572::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0)
 MainThread::INFO::2018-12-19 12:09:19,592::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected
 MainThread::INFO::2018-12-19 12:09:19,727::hosted_engine::613::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor) Stopped VDSM domain monitor
 MainThread::INFO::2018-12-19 12:09:19,727::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0)

Then it detected an anomalous state (the engine trying to migrate back the VM while in maintenance?) and so it went into ReinitializeFSM:

 MainThread::INFO::2018-12-19 12:09:29,775::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenance-ReinitializeFSM) sent? sent
 MainThread::INFO::2018-12-19 12:09:29,827::ovf_store::151::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /var/run/vdsm/storage/7b68bd24-f73e-40a6-9c4a-5db3ecdf102e/6469ecb5-0c22-4789-be72-0b16187143a3/dfd76e75-1a6d-42fe-8531-571605c76d6b 
 MainThread::INFO::2018-12-19 12:09:29,888::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0)
 MainThread::INFO::2018-12-19 12:09:39,934::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineDown) sent? sent
 MainThread::INFO::2018-12-19 12:09:40,058::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
 MainThread::INFO::2018-12-19 12:09:40,058::hosted_engine::499::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host 10.73.73.106 (id: 2, score: 3400)
 MainThread::INFO::2018-12-19 12:09:50,077::states::486::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm is running on host 10.73.73.106 (id 2)
 MainThread::INFO::2018-12-19 12:09:50,208::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
 MainThread::INFO::2018-12-19 12:10:00,351::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
 MainThread::INFO::2018-12-19 12:10:10,510::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
 MainThread::INFO::2018-12-19 12:10:20,671::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
 MainThread::INFO::2018-12-19 12:10:29,838::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
 MainThread::INFO::2018-12-19 12:10:39,998::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
 MainThread::INFO::2018-12-19 12:10:50,145::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)

and then the engine VM appeared and so it moved to EngineUp

 MainThread::INFO::2018-12-19 12:11:00,164::states::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm unexpectedly running locally, monitoring vm
 MainThread::INFO::2018-12-19 12:11:00,202::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStarting) sent? sent
 MainThread::INFO::2018-12-19 12:11:00,330::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStarting (score: 3400)
 MainThread::INFO::2018-12-19 12:11:00,339::state_decorators::92::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Timeout cleared while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUp'>
 MainThread::INFO::2018-12-19 12:11:00,365::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineStarting-EngineUp) sent? sent
 MainThread::INFO::2018-12-19 12:11:00,378::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400)

but it was still in local maintenance:

 MainThread::INFO::2018-12-19 12:11:10,391::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected

and so it tries to migrate away the engine VM again:

 MainThread::INFO::2018-12-19 12:11:10,424::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-LocalMaintenanceMigrateVm) sent? sent
 MainThread::WARNING::2018-12-19 12:11:10,542::hosted_engine::600::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
 MainThread::INFO::2018-12-19 12:11:10,543::states::243::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to local maintenance mode
 MainThread::INFO::2018-12-19 12:11:10,543::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenanceMigrateVm (score: 0)
 MainThread::INFO::2018-12-19 12:11:10,588::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (LocalMaintenanceMigrateVm-EngineMigratingAway) sent? sent
 MainThread::WARNING::2018-12-19 12:11:10,588::hosted_engine::600::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
 MainThread::INFO::2018-12-19 12:11:10,592::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineMigratingAway (score: 3400)

but it failed since the engine VM wasn't really here:

 MainThread::ERROR::2018-12-19 12:11:19,615::states::828::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Migration failed: {u'status': {u'message': u'Fatal error during migration', u'code': 12}, u'progress': 0}
 MainThread::INFO::2018-12-19 12:11:19,641::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineMigratingAway-ReinitializeFSM) sent? ignored
 MainThread::WARNING::2018-12-19 12:11:19,772::hosted_engine::600::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
 MainThread::INFO::2018-12-19 12:11:19,772::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state ReinitializeFSM (score: 0)
 MainThread::INFO::2018-12-19 12:11:19,774::state_decorators::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected
 MainThread::INFO::2018-12-19 12:11:19,803::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-LocalMaintenance) sent? sent
 MainThread::WARNING::2018-12-19 12:11:19,803::hosted_engine::600::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor_if_possible) The VM is running locally or we have no data, keeping the domain monitor.
 MainThread::INFO::2018-12-19 12:11:19,803::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (score: 0)

At the end nothing really bad really happened but the logs, and what we reported to the user in the mean time, got really polluted.

Comment 7 Andrej Krejcir 2019-01-22 16:34:52 UTC
Looking at this, I noticed a few things:

When ovirt-ha-agent is in LocalMaintenance state, it does not check if the HE VM is running on the host or not. 
So the movement from LocalMaintenance to ReinitializeFSM has to be caused by an external event from the command line, cockpit or the engine.
The local maintenance mode was probably not canceled by the engine, because there is no log line in vdsm.log indicating it, 

The migration command from the engine is probably caused by balancing. There should not be any other reason why the engine would initiate a migration without user command.

It would be useful to have engine logs, so we know why the engine sent destroy command to vdsm and if the host's HA score was not propagated to the engine before it started the migration.

Comment 8 Wei Wang 2019-01-29 08:51:20 UTC
This issue haven't detected with recent build, e.g: RHVH-4.3-20190124.0-RHVH-x86_64-dvd1.iso.
Migration is successful when clicking "Put this host into local maintenance" button first time on cockpit UI.

Comment 9 Sandro Bonazzola 2019-02-18 07:57:59 UTC
Moving to 4.3.2 not being identified as blocker for 4.3.1

Comment 10 Sandro Bonazzola 2019-07-11 07:03:32 UTC
Re-targeting to 4.3.6 not being identified as blocker for 4.3.5.

Comment 11 Sandro Bonazzola 2019-08-01 15:19:45 UTC
Already fixed in 4.3.5, testing in 4.3.6 scope

Comment 12 Wei Wang 2019-08-05 05:37:21 UTC
Test with RHVH-4.3-20190801.2-RHVH-x86_64-dvd1.iso, bug is fixed. Move to "VERIFIED"

Comment 13 Sandro Bonazzola 2019-09-26 19:43:14 UTC
This bugzilla is included in oVirt 4.3.6 release, published on September 26th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.6 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.