Bug 1665934 - Host stuck in preparing to maintenance - operation failed: migration job: unexpectedly failed
Summary: Host stuck in preparing to maintenance - operation failed: migration job: une...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: ---
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ovirt-4.3.5
: ---
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
: 1598509 1629454 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-14 13:49 UTC by Liran Rotenberg
Modified: 2020-02-25 09:20 UTC (History)
10 users (show)

Fixed In Version: ovirt-hosted-engine-setup-2.3.11
Doc Type: Bug Fix
Doc Text:
In the past when the host running the engine VM was set into maintenance mode from the engine, the engine VM was going to be migrated by ovirt-ha-agent as an indirect action caused by the maintenance mode. This process wasn't fully controlled by the engine and it could potentially lead to side effects on specific cases.
Clone Of:
Environment:
Last Closed: 2019-07-30 14:08:16 UTC
oVirt Team: Integration
Embargoed:
sbonazzo: ovirt-4.3?
sbonazzo: planning_ack?
sbonazzo: devel_ack+
sbonazzo: testing_ack?


Attachments (Terms of Use)
libvirt logs (893.97 KB, application/x-xz)
2019-01-14 13:55 UTC, Liran Rotenberg
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 99404 0 'None' MERGED Remove the support to start a migration 2021-01-25 10:44:38 UTC
oVirt gerrit 99462 0 'None' MERGED Avoid entering local mainteance mode if the engine VM is there 2021-01-25 10:44:38 UTC
oVirt gerrit 99523 0 'None' MERGED he: avoid skipping HE VM migration on maintenance 2021-01-25 10:44:38 UTC
oVirt gerrit 100299 0 'None' MERGED he: avoid skipping HE VM migration on maintenance 2021-01-25 10:44:38 UTC
oVirt gerrit 100300 0 'None' MERGED Avoid entering local mainteance mode if the engine VM is there 2021-01-25 10:44:38 UTC
oVirt gerrit 100301 0 'None' MERGED Remove the support to start a migration 2021-01-25 10:44:38 UTC

Internal Links: 1598509

Description Liran Rotenberg 2019-01-14 13:49:08 UTC
Description of problem:
Host with HE VM stuck in preparing to maintenance status.
The VM doesn't actually migrating away.

In /var/log/messages
Jan 14 13:49:30 ocelot02 libvirtd: 2019-01-14 11:49:30.608+0000: 101924: error : qemuMigrationCheckJobStatus:1457 : operation failed: migration job: unexpectedly failed


The full scenario is upgrade of RHEV, from 4.1 to 4.2.
After upgrading the engine, from the webUI (HE VM+SPM) upgrade of host_mixed_1(ocelot01) is successfully done.
I moved host_mixed_3 into maintenance to try and reproduce a bug(1629454).
The HE VM+SPM are on host_mixed_2(ocelot02). Now, I tried to upgrade it from the web UI as I did to host_mixed_1.  

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.8.2-0.1.el7ev.noarch

Source host(ocelot02):
vdsm-4.19.51-1.el7ev.x86_64
libvirt-libs-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-network-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-secret-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-config-nwfilter-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-storage-logical-3.9.0-14.el7_5.8.x86_64
libvirt-python-3.9.0-1.el7.x86_64
libvirt-daemon-driver-nwfilter-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-storage-iscsi-3.9.0-14.el7_5.8.x86_64
libvirt-client-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-qemu-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-storage-rbd-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-storage-disk-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-interface-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-storage-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-storage-core-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-storage-mpath-3.9.0-14.el7_5.8.x86_64
libvirt-lock-sanlock-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-storage-scsi-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-driver-nodedev-3.9.0-14.el7_5.8.x86_64
libvirt-daemon-kvm-3.9.0-14.el7_5.8.x86_64
qemu-kvm-tools-rhev-2.10.0-21.el7_5.7.x86_64
qemu-kvm-common-rhev-2.10.0-21.el7_5.7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-img-rhev-2.10.0-21.el7_5.7.x86_64
libvirt-daemon-driver-qemu-3.9.0-14.el7_5.8.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.7.x86_64

Destination host(ocelot01):
vdsm-4.20.46-1.el7ev.x86_64
libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-iscsi-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-interface-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-lxc-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-disk-4.5.0-10.el7_6.3.x86_64
libvirt-lock-sanlock-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-core-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-config-network-4.5.0-10.el7_6.3.x86_64
libvirt-python-4.5.0-1.el7.x86_64
libvirt-daemon-driver-nodedev-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-mpath-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-4.5.0-10.el7_6.3.x86_64
libvirt-bash-completion-4.5.0-10.el7_6.3.x86_64
libvirt-libs-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-nwfilter-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-scsi-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-kvm-4.5.0-10.el7_6.3.x86_64
libvirt-client-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-secret-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-rbd-4.5.0-10.el7_6.3.x86_64
libvirt-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-network-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-logical-4.5.0-10.el7_6.3.x86_64
qemu-kvm-rhev-2.12.0-18.el7_6.1.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-kvm-tools-rhev-2.12.0-18.el7_6.1.x86_64
qemu-kvm-common-rhev-2.12.0-18.el7_6.1.x86_64
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.3.x86_64
qemu-img-rhev-2.12.0-18.el7_6.1.x86_64

How reproducible:
I tried the upgrade procedure just once. But each time I redo it on this environment(activate the host and re-try to upgrade) it had the same results.

Steps to Reproduce:
1. Deploy SHE environment with RHEV 4.1(2 hosts).
2. Upgrade the engine to 4.2.
3. Upgrade host1 from the web UI.
4. Upgrade host2 from the web UI.

Actual results:
Upgrade of host2 failed, host stuck in preparing to maintenance due to unsuccessful HE VM migration.  

Expected results:
HE VM should migrate to host1, host2 should move into maintenance mode, upgrade succeed.

Additional info:
As a simple workaround, re-activate the host, migrate the HE VM manually(invoking HE VM migration without doing it by changing the host mode succeed).

Comment 3 Liran Rotenberg 2019-01-14 13:55:25 UTC
Created attachment 1520555 [details]
libvirt logs

Comment 4 Jiri Denemark 2019-01-14 18:22:53 UTC
Unfortunately, the important libvirt logs files do not seem to be produced at
the time you saw the error. There's no "unexpectedly failed" message in them
and while the message you quoted from /var/log/messages was reported at
2019-01-14 11:49:30.608+0000, the log files start at 2019-01-14
13:11:??.???+0000, which is almost 1.5 hours later.

Anyway, this kind of error is usually reported if the destination QEMU dies
during migration, so /var/log/libvirt/qemu/$VM.log from the destination host
may contain some hints.

One of the hosts gives us

    2019-01-14 11:24:07.405+0000: starting up libvirt version: 3.9.0, package: 14.el7_5.8 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-09-04-07:27:24, x86-019.build.eng.bos.redhat.com), qemu version: 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.7), hostname: ocelot02.qa.lab.tlv.redhat.com
    LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=HostedEngine,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-HostedEngine/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-IBRS,vme=on,f16c=on,rdrand=on,hypervisor=on,arat=on,xsaveopt=on,abm=on -m 16384 -realtime mlock=off -smp 4,maxcpus=16,sockets=16,cores=1,threads=1 -uuid 3074151e-b38f-461f-bf17-90fb36dd41cb -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.5-8.el7,serial=00000000-0000-0000-0000-AC1F6B57AE82,uuid=3074151e-b38f-461f-bf17-90fb36dd41cb' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-HostedEngine/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2019-01-14T11:24:07,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-reboot -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f,format=raw,if=none,id=drive-virtio-disk0,serial=c8f4766d-5e44-4798-b0e9-000cf97faebe,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:10:64,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/3074151e-b38f-461f-bf17-90fb36dd41cb.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/3074151e-b38f-461f-bf17-90fb36dd41cb.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev socket,id=charchannel2,path=/var/lib/libvirt/qemu/channels/3074151e-b38f-461f-bf17-90fb36dd41cb.org.ovirt.hosted-engine-setup.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=org.ovirt.hosted-engine-setup.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -vnc 0:0,password -device VGA,id=video0,vgamem_mb=32,bus=pci.0,addr=0x2 -incoming defer -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on
    2019-01-14T11:24:07.545834Z qemu-kvm: -chardev pty,id=charconsole0: char device redirected to /dev/pts/1 (label charconsole0)
    2019-01-14T11:24:07.548146Z qemu-kvm: -drive file=/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f,format=raw,if=none,id=drive-virtio-disk0,serial=c8f4766d-5e44-4798-b0e9-000cf97faebe,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead
    2019-01-14 11:49:26.525+0000: initiating migration

We can see the QEMU process was started at 11:24:07.405 and outgoing migration
started at 11:49:26.525+0000. So this was a source host.

Looking at /var/log/messages on the destination, I can see the error from this
bug description

    Jan 14 13:49:30 ocelot02 libvirtd: 2019-01-14 11:49:30.608+0000: 101924: error : qemuMigrationCheckJobStatus:1457 : operation failed: migration job: unexpectedly failed

vdsm log file contains

    2019-01-14 13:49:30,621+0200 ERROR (migsrc/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Domain not found: no domain with matching name 'HostedEngine' (migration:287)
    2019-01-14 13:49:30,627+0200 ERROR (migsrc/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Failed to migrate (migration:429)
    Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 411, in run
        self._startUnderlyingMigration(time.time())
      File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 489, in _startUnderlyingMigration
        self._perform_with_downtime_thread(duri, muri)
      File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 556, in _perform_with_downtime_thread
        self._perform_migration(duri, muri)
      File "/usr/lib/python2.7/site-packages/vdsm/virt/migration.py", line 529, in _perform_migration
        self._vm._dom.migrateToURI3(duri, params, flags)
      File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
        ret = attr(*args, **kwargs)
      File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
        ret = f(*args, **kwargs)
      File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1006, in wrapper
        return func(inst, *args, **kwargs)
      File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1746, in migrateToURI3
        if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
    libvirtError: Domain not found: no domain with matching name 'HostedEngine'

On the destination host the qemu log contains:

    2019-01-14 11:49:26.077+0000: starting up libvirt version: 4.5.0, package: 10.el7_6.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-11-08-09:31:37, x86-020.build.eng.bos.redhat.com), qemu version: 2.12.0qemu-kvm-rhev-2.12.0-18.el7_6.1, kernel: 3.10.0-957.5.1.el7.x86_64, hostname: ocelot01.qa.lab.tlv.redhat.com
    LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=HostedEngine,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-HostedEngine/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-IBRS,vme=on,f16c=on,rdrand=on,hypervisor=on,arat=on,xsaveopt=on,abm=on -m 16384 -realtime mlock=off -smp 4,maxcpus=16,sockets=16,cores=1,threads=1 -uuid 3074151e-b38f-461f-bf17-90fb36dd41cb -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.5-8.el7,serial=00000000-0000-0000-0000-AC1F6B57AE82,uuid=3074151e-b38f-461f-bf17-90fb36dd41cb' -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=31,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2019-01-14T11:49:26,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-reboot -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f,format=raw,if=none,id=drive-virtio-disk0,serial=c8f4766d-5e44-4798-b0e9-000cf97faebe,werror=stop,rerror=stop,cache=none,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=34 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:10:64,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,fd=35,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,fd=36,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev socket,id=charchannel2,fd=37,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=org.ovirt.hosted-engine-setup.0 -chardev pty,id=charconsole0 -device virtconsole,chardev=charconsole0,id=console0 -vnc 0:0,password -device VGA,id=video0,vgamem_mb=32,bus=pci.0,addr=0x2 -incoming defer -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
    2019-01-14T11:49:26.205346Z qemu-kvm: -chardev pty,id=charconsole0: char device redirected to /dev/pts/0 (label charconsole0)
    2019-01-14T11:49:26.207845Z qemu-kvm: -drive file=/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe/0ddd524f-4616-4877-88bb-c5a65cadfb4f,format=raw,if=none,id=drive-virtio-disk0,serial=c8f4766d-5e44-4798-b0e9-000cf97faebe,werror=stop,rerror=stop,cache=none,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead
    2019-01-14T11:49:30.085678Z qemu-kvm: terminating on signal 15 from pid 16219 (<unknown process>)
    2019-01-14 11:49:30.285+0000: shutting down, reason=destroyed

/var/message/log does not provide a lot of details either:

    Jan 14 13:49:26 ocelot01 systemd-machined: New machine qemu-1-HostedEngine.
    Jan 14 13:49:26 ocelot01 systemd: Started Virtual Machine qemu-1-HostedEngine.
    Jan 14 13:49:26 ocelot01 kvm: 1 guest now active
    Jan 14 13:49:30 ocelot01 kvm: 0 guests now active
    Jan 14 13:49:30 ocelot01 systemd-machined: Machine qemu-1-HostedEngine terminated.

No libvirt errors shown here. But vsdm.log is interesting:

    2019-01-14 13:49:30,076+0200 INFO  (jsonrpc/5) [api.virt] START getStats() from=::1,59606, vmId=3074151e-b38f-461f-bf17-90fb36dd41cb (api:46)
    2019-01-14 13:49:30,077+0200 INFO  (jsonrpc/5) [api.virt] FINISH getStats return={'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Migration Destination', 'displayInfo': [{'tlsPort': '-1', 'ipAddress': '0', 'type': u'vnc', 'port': '-1'}], 'hash': '4043262262160744328', 'acpiEnable': 'true', 'guestFQDN': '', 'vmId': '3074151e-b38f-461f-bf17-90fb36dd41cb', 'displayIp': '0', 'cpuUsage': '0.00', 'displayPort': '-1', 'session': 'Unknown', 'displaySecurePort': '-1', 'timeOffset': '0', 'memUsage': '0', 'clientIp': u'', 'elapsedTime': '11893', 'kvmEnable': u'true', 'vmType': 'kvm', 'cpuUser': '0.00', 'monitorResponse': '0', 'username': 'Unknown', 'displayType': 'vnc', 'cpuSys': '0.00', 'guestCPUCount': -1, 'appsList': (), 'guestIPs': '', 'statusTime': '4294846400', 'vmName': 'HostedEngine'}]} from=::1,59606, vmId=3074151e-b38f-461f-bf17-90fb36dd41cb (api:52)
    2019-01-14 13:49:30,077+0200 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.getStats succeeded in 0.00 seconds (__init__:573)
    2019-01-14 13:49:30,079+0200 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:573)
    2019-01-14 13:49:30,081+0200 INFO  (jsonrpc/7) [api.virt] START destroy(gracefulAttempts=1) from=::1,59606, vmId=3074151e-b38f-461f-bf17-90fb36dd41cb (api:46)
    2019-01-14 13:49:30,081+0200 INFO  (jsonrpc/7) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Release VM resources (vm:5279)
    2019-01-14 13:49:30,081+0200 INFO  (jsonrpc/7) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Stopping connection (guestagent:442)
    2019-01-14 13:49:30,081+0200 INFO  (jsonrpc/7) [vdsm.api] START teardownImage(sdUUID=u'7d394a51-c3a3-47e1-ac31-697710ef0f70', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'c8f4766d-5e44-4798-b0e9-000cf97faebe', volUUID=None) from=::1,59606, task_id=80ca6bd0-7f10-4c61-a3b4-665348319091 (api:46)
    2019-01-14 13:49:30,081+0200 INFO  (jsonrpc/7) [storage.StorageDomain] Removing image rundir link u'/var/run/vdsm/storage/7d394a51-c3a3-47e1-ac31-697710ef0f70/c8f4766d-5e44-4798-b0e9-000cf97faebe' (fileSD:600)
    2019-01-14 13:49:30,081+0200 INFO  (jsonrpc/7) [vdsm.api] FINISH teardownImage return=None from=::1,59606, task_id=80ca6bd0-7f10-4c61-a3b4-665348319091 (api:52)
    2019-01-14 13:49:30,082+0200 INFO  (jsonrpc/7) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Stopping connection (guestagent:442)
    2019-01-14 13:49:30,082+0200 INFO  (jsonrpc/7) [vdsm.api] START inappropriateDevices(thiefId='3074151e-b38f-461f-bf17-90fb36dd41cb') from=::1,59606, task_id=0a8354b5-d6b4-430c-8a7e-0fac6eb9c713 (api:46)
    2019-01-14 13:49:30,084+0200 INFO  (jsonrpc/7) [vdsm.api] FINISH inappropriateDevices return=None from=::1,59606, task_id=0a8354b5-d6b4-430c-8a7e-0fac6eb9c713 (api:52)
    2019-01-14 13:49:30,225+0200 INFO  (jsonrpc/7) [root] /usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd: rc=0 err= (hooks:110)
    2019-01-14 13:49:30,307+0200 INFO  (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Changed state to Down: VM destroyed during the startup (code=10) (vm:1689)
    2019-01-14 13:49:30,308+0200 INFO  (vm/3074151e) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Stopping connection (guestagent:442)
    2019-01-14 13:49:30,309+0200 INFO  (libvirt/events) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') underlying process disconnected (vm:1063)
    2019-01-14 13:49:30,318+0200 INFO  (jsonrpc/7) [root] /usr/libexec/vdsm/hooks/after_vm_destroy/delete_vhostuserclient_hook: rc=0 err= (hooks:110)
    2019-01-14 13:49:30,319+0200 INFO  (jsonrpc/7) [virt.vm] (vmId='3074151e-b38f-461f-bf17-90fb36dd41cb') Can't undefine disconnected VM '3074151e-b38f-461f-bf17-90fb36dd41cb' (vm:2388)
    2019-01-14 13:49:30,319+0200 INFO  (jsonrpc/7) [api.virt] FINISH destroy return={'status': {'message': 'Machine destroyed', 'code': 0}} from=::1,59606, vmId=3074151e-b38f-461f-bf17-90fb36dd41cb (api:52)
    2019-01-14 13:49:30,319+0200 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call VM.destroy succeeded in 0.23 seconds (__init__:573)

Which looks like someone or something asked vdsm to destroy the domain while
it was being migrated to this host (13:49:30,081+0200 ... START destroy) and
vdsm did what it was asked to do, which caused a failed migration on the
source host.

While I can't confirm this from libvirtd.log, there seems to be no libvirt bug
involved here. The migration was aborted by a request to destroy the
destination domain.

Comment 5 Jiri Denemark 2019-01-14 18:27:10 UTC
I'm moving this bug to vdsm for further investigation.

Comment 6 Michal Skrivanek 2019-01-15 06:38:35 UTC
likely hosted-engine-broker-ha judging from the ::1 address.

Comment 7 Michal Skrivanek 2019-01-15 12:56:30 UTC
broker:

Thread-13::INFO::2019-01-14 13:49:18,495::engine_health::94::engine_health.EngineHealth::(action) VM not on this host
Thread-10::INFO::2019-01-14 13:49:19,075::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports
Thread-9::INFO::2019-01-14 13:49:20,159::ping::60::ping.Ping::(action) Successfully pinged 10.35.30.254
Thread-11::INFO::2019-01-14 13:49:27,769::mem_free::51::mem_free.MemFree::(action) memFree: 119479
Thread-13::INFO::2019-01-14 13:49:28,513::engine_health::203::engine_health.EngineHealth::(_result_from_stats) VM not running on this host, status Migration Destination
Thread-10::INFO::2019-01-14 13:49:29,004::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports
Thread-9::INFO::2019-01-14 13:49:30,318::ping::60::ping.Ping::(action) Successfully pinged 10.35.30.254
Thread-14::INFO::2019-01-14 13:49:30,345::engine_health::203::engine_health.EngineHealth::(_result_from_stats) VM not running on this host, status Down

agent:

MainThread::INFO::2019-01-14 13:49:19,254::states::497::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) The engine is not running, but we do not have enough data to decide which hosts are alive
MainThread::ERROR::2019-01-14 13:49:19,460::config_ovf::84::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm::(_get_vm_conf_content_from_ovf_store) Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs
MainThread::INFO::2019-01-14 13:49:19,529::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineDown (score: 3400)
[above lines repeated many times in a 10s interval]
MainThread::INFO::2019-01-14 13:49:29,547::states::510::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down and local host has best score (3400), attempting to start engine VM
MainThread::INFO::2019-01-14 13:49:29,576::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineStart) sent? sent
MainThread::ERROR::2019-01-14 13:49:29,823::config_ovf::84::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm::(_get_vm_conf_content_from_ovf_store) Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs
MainThread::INFO::2019-01-14 13:49:29,892::hosted_engine::491::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStart (score: 3400)
MainThread::ERROR::2019-01-14 13:49:30,059::config_ovf::84::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm::(_get_vm_conf_content_from_ovf_store) Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs
MainThread::INFO::2019-01-14 13:49:30,072::hosted_engine::967::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Ensuring VDSM state is clear for engine VM
MainThread::INFO::2019-01-14 13:49:30,078::hosted_engine::987::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Cleaning state for non-running VM
MainThread::INFO::2019-01-14 13:49:31,352::hosted_engine::979::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_clean_vdsm_state) Vdsm state for VM clean

so the agent disregarded the MigrationDestination case and just decided to start a new HostedEngine - that requires to kill any lingering VM,
Martine, is that expected?

Comment 9 Martin Sivák 2019-01-15 13:04:22 UTC
Well, if there was an incoming migration then the source host was not reporting the status properly or the local agent was not sure the source host was actually up (it takes at least two updates after agent restart - ~20 seconds). This might be a corner case in the state machine.

Comment 10 Michal Skrivanek 2019-01-15 15:16:41 UTC
would it make sense to defer the decision until there is no Migration Destination state? The migration works pretty reliably on its own so I guess agent could wait a bit longer.

Comment 11 Martin Sivák 2019-01-15 15:22:14 UTC
That is one option yes.

Comment 12 Michal Skrivanek 2019-01-16 08:45:57 UTC
Simone, I believe the right place to fix this is fir the agent to consider Migration Destination state as well

Comment 13 Sandro Bonazzola 2019-01-21 08:28:34 UTC
re-targeting to 4.3.1 since this BZ has not been proposed as blocker for 4.3.0.
If you think this bug should block 4.3.0 please re-target and set blocker flag.

Comment 14 Sandro Bonazzola 2019-02-18 07:54:50 UTC
Moving to 4.3.2 not being identified as blocker for 4.3.1.

Comment 15 Sandro Bonazzola 2019-02-20 08:49:57 UTC
*** Bug 1629454 has been marked as a duplicate of this bug. ***

Comment 16 Sandro Bonazzola 2019-02-20 08:53:44 UTC
*** Bug 1598509 has been marked as a duplicate of this bug. ***

Comment 17 Nikolai Sednev 2019-07-15 12:24:32 UTC
Works just fine on these components:
ovirt-engine-setup-4.3.5.4-0.1.el7.noarch
ovirt-hosted-engine-ha-2.3.3-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.11-1.el7ev.noarch
Linux 3.10.0-1060.el7.x86_64 #1 SMP Mon Jul 1 18:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.7 Beta (Maipo)


Moving to verified, in case of still being reproduced, please reopen.

Comment 18 Sandro Bonazzola 2019-07-30 14:08:16 UTC
This bugzilla is included in oVirt 4.3.5 release, published on July 30th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.