Bug 1330548

Summary:	VMs failed to migrate when one of the node in the cluster is put into maintenance.
Product:	Red Hat Enterprise Linux 7	Reporter:	RamaKasturi <knarra>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED DUPLICATE	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.2	CC:	bugs, dyuan, jsuchane, msivak, pzhang, rbalakri, sabose, xuzhang, ykaul
Target Milestone:	pre-dev-freeze
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-09-12 13:49:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1258386

Description RamaKasturi 2016-04-26 12:34:37 UTC

Description of problem:
Had three nodes in my cluster zod, sulphur and tettnang. There are some vms running on zod , put the node zod to maintenance. It tries to migrate the vms to another host and fails to migrate some of the vms.

Version-Release number of selected component (if applicable):
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-secret-1.2.17-13.el7_2.4.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-daemon-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-config-nwfilter-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-interface-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-nodedev-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-storage-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-qemu-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-nwfilter-1.2.17-13.el7_2.4.x86_64
libvirt-lock-sanlock-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-network-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-kvm-1.2.17-13.el7_2.4.x86_64


How reproducible:


Steps to Reproduce:
1. configure HC machines
2. BootStromed 30 vms 
3. Now put the machine zod to maintenance.

Actual results:
There are 11 vms running on the machine which was put to maintenance and i see that only 5 of them migrated to another hypervisor and rest all of them failed to migrate.

I tried migrating them manually and that did not work too. Following are the trace backs i see in the vdsm logs.

Thread-3817985::ERROR::2016-04-26 17:41:26,086::migration::309::virt.vm::(run) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 297, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 363, in _startUnderlyingMigration
    self._perform_migration(duri, muri)
  File "/usr/share/vdsm/virt/migration.py", line 402, in _perform_migration
    self._vm._dom.migrateToURI3(duri, params, flags)
  File "/usr/share/vdsm/virt/virdomain.py", line 68, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1313, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1836, in migrateToURI3
    if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
libvirtError: internal error: info migration reply was missing return status


Expected results:
All the vms should be migrated sucessfully.

Additional info:

Comment 2 Sahina Bose 2016-04-27 15:20:09 UTC

From the logs on migration destination:
periodic/47::WARNING::2016-04-26 15:00:56,440::periodic::285::virt.vm::(__call__) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::could not run on 2050bdfa-caea-49a3-bad4-f49964b29657: domain n
ot connected



libvirtEventLoop::WARNING::2016-04-26 15:00:58,783::utils::140::root::(rmFile) File: /var/lib/libvirt/qemu/channels/2050bdfa-caea-49a3-bad4-f49964b29657.com.redhat.rhevm.vdsm already removed
jsonrpc.Executor/4::DEBUG::2016-04-26 15:00:58,773::__init__::503::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'VM.destroy' in bridge with [u'2050bdfa-caea-49a3-bad4-f49964b29657']
jsonrpc.Executor/4::INFO::2016-04-26 15:00:58,788::API::341::vds::(destroy) vmContainerLock acquired by vm 2050bdfa-caea-49a3-bad4-f49964b29657
jsonrpc.Executor/4::DEBUG::2016-04-26 15:00:58,789::vm::3885::virt.vm::(destroy) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::destroy Called
Thread-3732940::ERROR::2016-04-26 15:00:58,774::vm::753::virt.vm::(_startUnderlyingVm) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::Failed to start a migration destination vm
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 722, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 2852, in _completeIncomingMigration
    self._incomingMigrationFinished.isSet(), usedTimeout)
  File "/usr/share/vdsm/virt/vm.py", line 2911, in _attachLibvirtDomainAfterMigration
    raise MigrationError(e.get_error_message())
MigrationError: Domain not found: no domain with matching uuid '2050bdfa-caea-49a3-bad4-f49964b29657'
Thread-3732940::INFO::2016-04-26 15:00:58,793::vm::1330::virt.vm::(setDownStatus) vmId=`2050bdfa-caea-49a3-bad4-f49964b29657`::Changed state to Down: VM failed to migrate (code=8)
Thread-3732940::DEBUG::2016-04-26 15:00:58,795::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"2050bdfa-caea-49a3-bad4-f49964b29657": {"status": "Down", "timeOffset": "0", "exitReason": 8, "exitMessage": "VM failed to migrate", "exitCode": 1}, "notify_time": 5589338680}, "jsonrpc": "2.0", "method": "|virt|VM_status|2050bdfa-caea-49a3-bad4-f49964b29657"}

The sosreport tree output does have the image path - /rhev/data-center/mnt/glusterSD/sulphur.lab.eng.blr.redhat.com:_vmstore/297a9b9c-4396-4b30-8bfe-976a67d49a74/images/2c6601e9-456f-4638-9ce4-d98efd97c053/86999bdc-a7bd-4d1e-9faa-a8ba7cf531f4

Comment 3 Sahina Bose 2016-04-28 07:23:51 UTC

Martin, any ideas about this error? I'm afraid I don't have enough virt expertise to debug further.

Comment 4 Martin Sivák 2016-04-28 10:33:26 UTC

It seems the engine actually tried to migrate the VM so it is not a scheduling issue. And I do not know enough about the underlying libvirt logic, you need someone from the virt team for that (I do scheduling and QoS).

Comment 7 Michal Skrivanek 2016-05-04 11:12:39 UTC

this looks like a libvirt or qemu bug, versions look the same on both sides, (see e.g. qemu/linux_vm.log)
moving to libvirt, feel free to push down the stack

Comment 11 Jiri Denemark 2016-09-12 13:49:53 UTC

The "internal error: info migration reply was missing return status" is a result of bug 1374613. Because of this bug and missing debug logs from libvirt it's impossible to diagnose the real cause of the migration failure. I'm closing this bug as a duplicate of 1374613. If the issue can be reproduced with a package with bug 1374613 fixed, please, file a new bug so that we can properly investigate the root cause.

*** This bug has been marked as a duplicate of bug 1374613 ***