Hide Forgot
Description of problem: Had three nodes in my cluster zod, sulphur and tettnang. There are some vms running on zod , put the node zod to maintenance. It tries to migrate the vms to another host and fails to migrate some of the vms. Version-Release number of selected component (if applicable): libgovirt-0.3.3-1.el7_2.1.x86_64 ovirt-host-deploy-1.4.1-1.el7ev.noarch ovirt-vmconsole-1.0.0-1.el7ev.noarch ovirt-vmconsole-host-1.0.0-1.el7ev.noarch ovirt-setup-lib-1.0.1-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch libvirt-client-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-secret-1.2.17-13.el7_2.4.x86_64 libvirt-python-1.2.17-2.el7.x86_64 libvirt-daemon-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-config-nwfilter-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-interface-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-nodedev-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-storage-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-qemu-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-nwfilter-1.2.17-13.el7_2.4.x86_64 libvirt-lock-sanlock-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-network-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-kvm-1.2.17-13.el7_2.4.x86_64 How reproducible: Steps to Reproduce: 1. configure HC machines 2. BootStromed 30 vms 3. Now put the machine zod to maintenance. Actual results: There are 11 vms running on the machine which was put to maintenance and i see that only 5 of them migrated to another hypervisor and rest all of them failed to migrate. I tried migrating them manually and that did not work too. Following are the trace backs i see in the vdsm logs. Thread-3817985::ERROR::2016-04-26 17:41:26,086::migration::309::virt.vm::(run) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 297, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 363, in _startUnderlyingMigration self._perform_migration(duri, muri) File "/usr/share/vdsm/virt/migration.py", line 402, in _perform_migration self._vm._dom.migrateToURI3(duri, params, flags) File "/usr/share/vdsm/virt/virdomain.py", line 68, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1313, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1836, in migrateToURI3 if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self) libvirtError: internal error: info migration reply was missing return status Expected results: All the vms should be migrated sucessfully. Additional info:
From the logs on migration destination: periodic/47::WARNING::2016-04-26 15:00:56,440::periodic::285::virt.vm::(__call__) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::could not run on 2050bdfa-caea-49a3-bad4-f49964b29657: domain n ot connected libvirtEventLoop::WARNING::2016-04-26 15:00:58,783::utils::140::root::(rmFile) File: /var/lib/libvirt/qemu/channels/2050bdfa-caea-49a3-bad4-f49964b29657.com.redhat.rhevm.vdsm already removed jsonrpc.Executor/4::DEBUG::2016-04-26 15:00:58,773::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.destroy' in bridge with [u'2050bdfa-caea-49a3-bad4-f49964b29657'] jsonrpc.Executor/4::INFO::2016-04-26 15:00:58,788::API::341::vds::(destroy) vmContainerLock acquired by vm 2050bdfa-caea-49a3-bad4-f49964b29657 jsonrpc.Executor/4::DEBUG::2016-04-26 15:00:58,789::vm::3885::virt.vm::(destroy) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::destroy Called Thread-3732940::ERROR::2016-04-26 15:00:58,774::vm::753::virt.vm::(_startUnderlyingVm) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::Failed to start a migration destination vm Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 722, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 2852, in _completeIncomingMigration self._incomingMigrationFinished.isSet(), usedTimeout) File "/usr/share/vdsm/virt/vm.py", line 2911, in _attachLibvirtDomainAfterMigration raise MigrationError(e.get_error_message()) MigrationError: Domain not found: no domain with matching uuid '2050bdfa-caea-49a3-bad4-f49964b29657' Thread-3732940::INFO::2016-04-26 15:00:58,793::vm::1330::virt.vm::(setDownStatus) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::Changed state to Down: VM failed to migrate (code=8) Thread-3732940::DEBUG::2016-04-26 15:00:58,795::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"2050bdfa-caea-49a3-bad4-f49964b29657": {"status": "Down", "timeOffset": "0", "exitReason": 8, "exitMessage": "VM failed to migrate", "exitCode": 1}, "notify_time": 5589338680}, "jsonrpc": "2.0", "method": "|virt|VM_status|2050bdfa-caea-49a3-bad4-f49964b29657"} The sosreport tree output does have the image path - /rhev/data-center/mnt/glusterSD/sulphur.lab.eng.blr.redhat.com:_vmstore/297a9b9c-4396-4b30-8bfe-976a67d49a74/images/2c6601e9-456f-4638-9ce4-d98efd97c053/86999bdc-a7bd-4d1e-9faa-a8ba7cf531f4
Martin, any ideas about this error? I'm afraid I don't have enough virt expertise to debug further.
It seems the engine actually tried to migrate the VM so it is not a scheduling issue. And I do not know enough about the underlying libvirt logic, you need someone from the virt team for that (I do scheduling and QoS).
this looks like a libvirt or qemu bug, versions look the same on both sides, (see e.g. qemu/linux_vm.log) moving to libvirt, feel free to push down the stack
The "internal error: info migration reply was missing return status" is a result of bug 1374613. Because of this bug and missing debug logs from libvirt it's impossible to diagnose the real cause of the migration failure. I'm closing this bug as a duplicate of 1374613. If the issue can be reproduced with a package with bug 1374613 fixed, please, file a new bug so that we can properly investigate the root cause. *** This bug has been marked as a duplicate of bug 1374613 ***