Bug 965172

Summary: "Internal Engine Error" after ongoing live migration times out.
Product: Red Hat Enterprise Linux 6 Reporter: Julio Entrena Perez <jentrena>
Component: libvirtAssignee: Martin Kletzander <mkletzan>
Status: CLOSED DUPLICATE QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.4CC: acathrow, dallan, dyuan, hartsjc, hateya, iheim, jdenemar, jentrena, jkt, lpeer, lyarwood, michal.skrivanek, mzhan, Rhev-m-bugs, sputhenp, weizhan, yeylon, zhpeng, zpeng
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-15 14:05:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Julio Entrena Perez 2013-05-20 15:23:58 UTC
Description of problem:
"Internal Engine Error" is reported by destination host after an ongoing live migration times out ('migration_timeout' reached).
Webadmin portal presents user with unhelpful error message which doesn't provide any hint of what did go wrong.

Version-Release number of selected component (if applicable):
vdsm-4.10.2-1.13.el6ev
rhevm-3.1.0-53.el6ev

How reproducible:
Unknown, it happened once so far.

Steps to Reproduce:
1. Start a live migration which won't complete within the default 'migration_timeout' of 300 seconds. 
2. Wait until those 5 minutes have elapsed.

Actual results:
Migration failed due to Error: Internal Engine Error (VM: vm_hostname, Source Host: host_hostname).


Expected results:
A more descriptive error message, e.g. "Migration failed to complete within default timeout of $migration_timeout seconds".


Additional info:

- Source host:

[...]
Thread-61770::INFO::2013-05-17 12:33:12,750::libvirtvm::413::vm.Vm::(run) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Migration Progress: 292 seconds elapsed, 50% of data processed, 50% of mem processed
Thread-61770::INFO::2013-05-17 12:33:26,731::libvirtvm::413::vm.Vm::(run) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Migration Progress: 306 seconds elapsed, 51% of data processed, 51% of mem processed
Thread-61769::DEBUG::2013-05-17 12:33:26,302::libvirtvm::351::vm.Vm::(run) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::setting migration downtime to 250
Thread-61770::WARNING::2013-05-17 12:33:38,616::libvirtvm::399::vm.Vm::(run) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Migration is stuck: Hasn't progressed in 307.269540071 seconds. Aborting.
Thread-61770::DEBUG::2013-05-17 12:33:42,844::libvirtvm::416::vm.Vm::(stop) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::stopping migration monitor thread
libvirtEventLoop::DEBUG::2013-05-17 12:33:51,554::libvirtvm::2808::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::event Resumed detail 1 opaque None
Thread-61766::DEBUG::2013-05-17 12:33:51,577::libvirtvm::357::vm.Vm::(cancel) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::canceling migration downtime thread
Thread-61766::DEBUG::2013-05-17 12:33:51,579::libvirtvm::416::vm.Vm::(stop) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::stopping migration monitor thread
Thread-61769::DEBUG::2013-05-17 12:33:51,579::libvirtvm::354::vm.Vm::(run) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::migration downtime thread exiting
Thread-61766::ERROR::2013-05-17 12:33:51,581::vm::195::vm.Vm::(_recover) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::operation aborted: migration job: canceled by client
Thread-61766::ERROR::2013-05-17 12:33:58,758::vm::283::vm.Vm::(run) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Failed to migrate
[...]
Thread-61836::DEBUG::2013-05-17 12:34:00,605::BindingXMLRPC::903::vds::(wrapper) client [2.3.4.5]::call vmGetMigrationStatus with ('d6446340-b00a-4068-8778-2227f89776fd',) {}
Thread-61836::DEBUG::2013-05-17 12:34:00,605::BindingXMLRPC::910::vds::(wrapper) return vmGetMigrationStatus with {'status': {'message': 'Migration canceled', 'code': {'status': {'message': 'Migration not in progress', 'code': 47}}}}


- Destination host:

Thread-50523::ERROR::2013-05-17 12:33:51,488::utils::416::vm.Vm::(collect) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x21d5d50>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 109, in _highWrite
    self._vm._dom.blockInfo(vmDrive.path, 0)
  File "/usr/share/vdsm/libvirtvm.py", line 525, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock
[...]
libvirtEventLoop::DEBUG::2013-05-17 12:33:51,509::libvirtvm::2808::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::event Stopped detail 5 opaque None
Thread-50523::DEBUG::2013-05-17 12:33:51,509::task::957::TaskManager.Task::(_decref) Task=`579740f0-17e8-4ac9-8605-a078bc168489`::ref 0 aborting False
libvirtEventLoop::INFO::2013-05-17 12:33:51,512::vm::713::vm.Vm::(_onQemuDeath) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::underlying process disconnected
libvirtEventLoop::INFO::2013-05-17 12:33:51,516::libvirtvm::2374::vm.Vm::(releaseVm) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Release VM resources
[...]
Thread-50546::DEBUG::2013-05-17 12:33:51,733::BindingXMLRPC::903::vds::(wrapper) client [1.2.3.4]::call vmDestroy with ('d6446340-b00a-4068-8778-2227f89776fd',) {}
[...]
Thread-50523::ERROR::2013-05-17 12:33:51,683::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 1 edom: 10
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1849, in getCPUStats
    if ret is None: raise libvirtError ('virDomainGetCPUStats() failed', dom=self)
libvirtError: internal error No such domain <D6>Dc@<B0>
@h<87>x"'<F8><97>v<FD>
[...]
Thread-50546::INFO::2013-05-17 12:33:51,775::API::308::vds::(destroy) vmContainerLock acquired by vm d6446340-b00a-4068-8778-2227f89776fd
[...]
Thread-50546::DEBUG::2013-05-17 12:33:52,066::libvirtvm::2425::vm.Vm::(destroy) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::destroy Called
Thread-50523::ERROR::2013-05-17 12:33:51,792::utils::416::vm.Vm::(collect) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Stats function failed: <AdvancedStatsFunction _sampleCpu at 0x21d5e30>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 137, in _sampleCpu
    cpuStats = self._vm._dom.getCPUStats(True, 0)
  File "/usr/share/vdsm/libvirtvm.py", line 517, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1849, in getCPUStats
    if ret is None: raise libvirtError ('virDomainGetCPUStats() failed', dom=self)
libvirtError: internal error No such domain <D6>Dc@<B0>
@h<87>x"'<F8><97>v<FD>
[...]
Thread-50523::ERROR::2013-05-17 12:33:52,135::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1820, in blockStats
    if ret is None: raise libvirtError ('virDomainBlockStats() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid 'd6446340-b00a-4068-8778-2227f89776fd'
[...]
Thread-50523::ERROR::2013-05-17 12:33:52,239::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1826, in blockStatsFlags
    if ret is None: raise libvirtError ('virDomainBlockStatsFlags() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid 'd6446340-b00a-4068-8778-2227f89776fd'
Thread-50523::ERROR::2013-05-17 12:33:52,286::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1867, in interfaceStats
    if ret is None: raise libvirtError ('virDomainInterfaceStats() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid 'd6446340-b00a-4068-8778-2227f89776fd'
libvirtEventLoop::ERROR::2013-05-17 12:33:52,364::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid 'd6446340-b00a-4068-8778-2227f89776fd'
libvirtEventLoop::DEBUG::2013-05-17 12:33:52,380::utils::353::vm.Vm::(stop) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Stop statistics collection
libvirtEventLoop::DEBUG::2013-05-17 12:33:52,394::vmChannels::208::vds::(unregister) Delete fileno 89 from listener.
Thread-50523::DEBUG::2013-05-17 12:33:52,394::utils::384::vm.Vm::(run) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Stats thread finished
[...]
libvirtEventLoop::ERROR::2013-05-17 12:33:52,420::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 755, in destroyFlags
    if ret == -1: raise libvirtError ('virDomainDestroyFlags() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid 'd6446340-b00a-4068-8778-2227f89776fd'
[...]
Thread-50546::DEBUG::2013-05-17 12:33:52,625::libvirtvm::2419::vm.Vm::(deleteVm) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Total desktops after destroy of d6446340-b00a-4068-8778-2227f89776fd is 75
Thread-50546::DEBUG::2013-05-17 12:33:52,632::BindingXMLRPC::910::vds::(wrapper) return vmDestroy with {'status': {'message': 'Machine destroyed', 'code': 0}}
libvirtEventLoop::ERROR::2013-05-17 12:33:52,636::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid 'd6446340-b00a-4068-8778-2227f89776fd'
libvirtEventLoop::DEBUG::2013-05-17 12:33:52,657::vm::1027::vm.Vm::(setDownStatus) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Changed state to Down: Admin shut down
libvirtEventLoop::DEBUG::2013-05-17 12:33:52,664::utils::353::vm.Vm::(stop) vmId=`d6446340-b00a-4068-8778-2227f89776fd`::Stop statistics collection
libvirtEventLoop::ERROR::2013-05-17 12:33:52,681::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid 'd6446340-b00a-4068-8778-2227f89776fd'


- Manager:

2013-05-17 12:34:00,521 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-18) VM vm_hostname d6446340-b00a-4068-8778-2227f89776fd moved from MigratingFrom --> Up
2013-05-17 12:34:00,521 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-18) adding VM d6446340-b00a-4068-8778-2227f89776fd to re-run list
2013-05-17 12:34:00,522 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-18) Rerun vm d6446340-b00a-4068-8778-2227f89776fd. Called from vds src_hostname
2013-05-17 12:34:00,569 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (pool-3-thread-49) START, MigrateStatusVDSCommand(HostName = src_hostname
, HostId = 1a62f776-695e-11e2-a97a-fb8bf5530f36, vmId=d6446340-b00a-4068-8778-2227f89776fd), log id: 3b3e8edd
2013-05-17 12:34:00,607 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand] (pool-3-thread-49) Failed in MigrateStatusVDS method, for vds: src_hostname
; host: 1.2.3.4
2013-05-17 12:34:00,607 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (pool-3-thread-49) Command MigrateStatusVDS execution failed. Exception: ClassCastException: java.util.HashMap cannot be cast to java.lang.Integer
2013-05-17 12:34:00,607 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (pool-3-thread-49) FINISH, MigrateStatusVDSCommand, log id: 3b3e8edd
[...]
2013-05-17 12:34:00,781 WARN  [org.ovirt.engine.core.bll.MigrateVmCommand] (pool-3-thread-49) CanDoAction of action MigrateVm failed. Reasons:ACTION_TYPE_FAILED_VDS_VM_CLUSTER,VAR__ACTION__MIGRATE,VAR__TYPE__VM


Uncompressed sosreports of both hosts, the manager and the guest are also available at http://jentrena-xw8600.usersys.redhat.com/00846482/ .

Comment 3 Michal Skrivanek 2013-05-27 12:22:46 UTC
errors on destination host points to some internal libvirt problem. please take a look...

Comment 8 Martin Kletzander 2013-08-14 14:39:59 UTC
It is very hard to find the cause in these logs.  What is the name of the machine that didn't migrate?  Grepping all the errors and looking at the logs shown in this bz I can't find anything unusual on the bug.  Can you point me in a right direction?

Comment 9 Martin Kletzander 2013-08-15 14:05:18 UTC
As mentioned in comment #6, this is already tracked in bug 970645.  After discussion with Michal and going through the logs I've found no other problem beyond the mentioned one, so I'm closing it as DUPLICATE.

*** This bug has been marked as a duplicate of bug 970645 ***