Bug 1048573

Summary: Unexpected behaviour when kill qemu process on source host, while VMs migration.
Product: Red Hat Enterprise Virtualization Manager Reporter: Ilanit Stein <istein>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED ERRATA QA Contact: Ilanit Stein <istein>
Severity: medium Docs Contact:
Priority: low    
Version: 3.3.0CC: ahadas, bgraveno, fromani, iheim, istein, joallen, lpeer, mavital, michal.skrivanek, rbalakri, Rhev-m-bugs, yeylon
Target Milestone: ovirt-3.6.0-rcFlags: istein: testing_plan_complete+
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Enhanced audit log messages with additional information to enable users to easily resolve migration failure issues.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-09 20:34:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine log (migration start at Jan 05, 11:05)
none
host_1 logs
none
host_2 logs none

Description Ilanit Stein 2014-01-05 10:59:52 UTC
Description of problem:
On a setup with 2 hosts, and 16 VMs, run on a host_1, 
migrate all VMs to host_2. While migration, kill -9 qemu process,
of 1 VM, shortly after migration started.
Problems found in results:
=========================
1. Inaccurate event : "Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host"
a. As failure is in qemu process running on source host, so Error creating th e requested Desktop is not true.
b. Try to migrate to another host, also seem irrelevant, as problem is not with destination host.

2. Beside the VM, which it's process killed, 2 more VMs, had failed the migration as well:
"Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host"

3. When trying to run the down VM (the one that qemu process killed),
got event error: "Failed to run VM...", and then started OK.
It is not clear why the error was given here.

Version-Release number of selected component (if applicable):
is30

How reproducible:
It didn't reproduce, in several trials I tried. Might be a matter of timing.

Expected results:
1. Only the VM, that it's process was killed should be affected.
2. Event Error messages, for the VM that it's qemu process killed, should be relevant.

Additional info:
In the logs, please see VM for which qemu process killed POOLTEST-10.
Also VMs that failed migration as well, POOLTEST-11 & POOLTEST-12.

Comment 1 Ilanit Stein 2014-01-05 11:01:55 UTC
Audit log for the VM, which his process killed (POOLTEST-10):
============================================================ 
2014-Jan-05, 11:21
VM POOLTEST-10 started on Host host19-rack06.scale.openstack.engineering.redhat.com
124336ae
oVirt
	
2014-Jan-05, 11:20
Failed to run VM POOLTEST-10 (User: admin@internal).
124336ae	
oVirt
	
2014-Jan-05, 11:06
VM POOLTEST-10 is down. Exit message: Lost connection with qemu process.
oVirt
	
2014-Jan-05, 11:06
Migration failed due to Error: Error creating the requested Desktop (VM: POOLTEST-10, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com).
21d96212
oVirt
	
2014-Jan-05, 11:06
Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host (VM: POOLTEST-10, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com).
21d96212
oVirt

Audit log for the 2 VM, failed migration as well (same message for VMs POOLTEST-11, POOLTEST-12):
================================================================================= 			
2014-Jan-05, 11:07	
Migration failed due to Error: Error creating the requested Desktop (VM: POOLTEST-11, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com).
46b59d1a
oVirt
	
2014-Jan-05, 11:07
Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host (VM: POOLTEST-11, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com).

Comment 2 Ilanit Stein 2014-01-05 11:06:49 UTC
Created attachment 845731 [details]
engine log (migration start at Jan 05, 11:05)

Comment 3 Ilanit Stein 2014-01-05 11:43:12 UTC
Created attachment 845735 [details]
host_1 logs

Time 2 hours behind rhevm. Migration in logs start @ Jan 5, 9:05

Comment 4 Ilanit Stein 2014-01-05 11:43:54 UTC
Created attachment 845736 [details]
host_2 logs

Time 2 hours behind rhevm. Migration in logs start @ Jan 5, 9:05

Comment 5 Michal Skrivanek 2014-01-22 13:47:55 UTC
Francesco, can you take a look at the logs? especially concerning the point 3.

Comment 6 Francesco Romani 2014-01-27 15:20:51 UTC
Yep, taking the bug

Comment 7 Francesco Romani 2014-01-28 10:42:19 UTC
First finding for the point 3 above.

engine.log reports a NullPointerException shortly after the vm creation command is issued

2014-01-05 11:20:46,945 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (pool-4-thread-46) [124336ae] FINISH, CreateVDSCommand, log id: a81c73e
2014-01-05 11:20:46,951 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (pool-4-thread-46) [124336ae] FINISH, CreateVmVDSCommand, return: WaitForLaunch, log id: 2d75abdc
2014-01-05 11:20:46,952 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-46) [124336ae] Lock freed to object EngineLock [exclusiveLocks= key: 7cbda807-f208-427a-a5a2-b3894fe62ffe value: VM
, sharedLocks= ]
2014-01-05 11:20:46,953 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-46) [124336ae] Command org.ovirt.engine.core.bll.RunVmCommand throw exception: java.lang.NullPointerException
	at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) [rt.jar:1.7.0_45]
	at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) [rt.jar:1.7.0_45]
	at org.ovirt.engine.core.vdsbroker.ResourceManager.GetVdsManager(ResourceManager.java:228) [vdsbroker.jar:]
	at org.ovirt.engine.core.bll.RunVmCommandBase.getMonitor(RunVmCommandBase.java:337) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommandBase.getBlockingQueue(RunVmCommandBase.java:327) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommandBase.decreasePendingVms(RunVmCommandBase.java:289) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommandBase.decreasePendingVms(RunVmCommandBase.java:284) [bll.jar:]
	at org.ovirt.engine.core.bll.MigrateVmCommand.reportCompleted(MigrateVmCommand.java:377) [bll.jar:]
	at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:53) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:473) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:234) [bll.jar:]
	at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:304) [bll.jar:]
	at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:87) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1134) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1219) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1895) [bll.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:]
	at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:]
	at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1239) [bll.jar:]
	at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:362) [bll.jar:]
	at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:175) [bll.jar:]
	at org.ovirt.engine.core.bll.MultipleActionsRunner.RunCommands(MultipleActionsRunner.java:156) [bll.jar:]
	at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.RunCommands(SortedMultipleActionsRunnerBase.java:17) [bll.jar:]
	at org.ovirt.engine.core.bll.MultipleActionsRunner$1.run(MultipleActionsRunner.java:94) [bll.jar:]
	at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:71) [utils.jar:]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_45]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_45]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_45]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_45]
	at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_45]


On VDSM side, the vm seems to be started OK:
libvirtEventLoop::DEBUG::2014-01-05 09:21:07,471::vm::4940::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`7cbda807-f208-427a-a5a2-b3894fe62ffe`::event Started detail 0 opaque None

Comment 8 Michal Skrivanek 2014-01-28 11:31:55 UTC
thanks Francesco, this case has been recently fixed by Arik in bug 1048790

Comment 9 Francesco Romani 2014-01-28 11:35:15 UTC
Moving investigation to Point 2 above

Comment 10 Francesco Romani 2014-01-28 16:38:02 UTC
Preliminary findings about point 2:

Migration indeed failed.
host 1, source:

Thread-441::DEBUG::2014-01-05 09:05:37,843::BindingXMLRPC::984::vds::(wrapper) client [10.35.161.80]::call vmMigrate with ({'tunneled': 'false', 'src': '10.1.32.43', 'dst': '10.1.32.44:54321', 'vmId': '8200c2b7-fcc6-4348-983d-043eeb869dbf', 'abortOnError': 'true', 'method': 'online'},) {} flowID [46b59d1a]
Thread-441::DEBUG::2014-01-05 09:05:37,844::API::469::vds::(migrate) {'tunneled': 'false', 'src': '10.1.32.43', 'dst': '10.1.32.44:54321', 'vmId': '8200c2b7-fcc6-4348-983d-043eeb869dbf', 'abortOnError': 'true', 'method': 'online'}
Thread-441::DEBUG::2014-01-05 09:05:37,844::BindingXMLRPC::991::vds::(wrapper) return vmMigrate with {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0}
Thread-442::DEBUG::2014-01-05 09:05:37,845::vm::181::vm.Vm::(_setupVdsConnection) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Destination server is: 10.1.32.44:54321
Thread-442::DEBUG::2014-01-05 09:05:37,846::vm::183::vm.Vm::(_setupVdsConnection) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Initiating connection with destination
Thread-442::DEBUG::2014-01-05 09:05:37,902::vm::233::vm.Vm::(_prepareGuest) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Migration started
Thread-442::DEBUG::2014-01-05 09:05:37,967::vm::315::vm.Vm::(run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration semaphore acquired

On host 2, destination:
Thread-59771::DEBUG::2014-01-05 09:05:36,944::BindingXMLRPC::984::vds::(wrapper) client [10.1.32.43]::call vmMigrationCreate with ({'username': 'Unknown', 'acpiEnable': 'true', 'emulatedMachine': 'rhel6.5.0', 'afterMigrationStatus': '', 'guestFQDN': '', 'vmId': '8200c2b7-fcc6-4348-983d-043eeb869dbf', 'memGuaranteedSize': 4096, 'spiceSslCipherSuite': 'DEFAULT', 'displaySecurePort': '5901', 'timeOffset': '0', 'cpuType': 'Penryn', 'smp': '1', 'migrationDest': 'libvirt', 'custom': {'device_4713f045-9b0a-4412-bf90-cb3c37f58f35': 'VmDevice {vmId=8200c2b7-fcc6-4348-983d-043eeb869dbf, deviceId=4713f045-9b0a-4412-bf90-cb3c37f58f35, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x01, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}', 'device_4713f045-9b0a-4412-bf90-cb3c37f58f35device_eeddafff-9509-41f8-a56e-99f39a8cf0f6': 'VmDevice {vmId=8200c2b7-fcc6-4348-983d-043eeb869dbf, deviceId=eeddafff-9509-41f8-a56e-99f39a8cf0f6,  (omitted for brevity)

Thread-59771::DEBUG::2014-01-05 09:05:36,945::API::504::vds::(migrationCreate) Migration create
Thread-59771::INFO::2014-01-05 09:05:36,950::clientIF::391::vds::(createVm) vmContainerLock acquired by vm 8200c2b7-fcc6-4348-983d-043eeb869dbf
Thread-59771::DEBUG::2014-01-05 09:05:36,962::clientIF::404::vds::(createVm) Total desktops after creation of 8200c2b7-fcc6-4348-983d-043eeb869dbf is 2
Thread-59772::DEBUG::2014-01-05 09:05:36,962::vm::2115::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Start
Thread-59771::DEBUG::2014-01-05 09:05:36,965::vm::5025::vm.Vm::(waitForMigrationDestinationPrepare) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration destination: waiting for VM creation
Thread-59772::DEBUG::2014-01-05 09:05:36,967::vm::2119::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::_ongoingCreations acquired
Thread-59772::INFO::2014-01-05 09:05:36,969::vm::2969::vm.Vm::(_run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::VM wrapper has started
Thread-59771::DEBUG::2014-01-05 09:05:36,969::vm::5030::vm.Vm::(waitForMigrationDestinationPrepare) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration destination: waiting 42s for path preparation

but then 
Thread-59771::DEBUG::2014-01-05 09:06:18,972::vm::5033::vm.Vm::(waitForMigrationDestinationPrepare) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Timeout while waiting for path preparation

consequently, on host1:
Thread-442::ERROR::2014-01-05 09:06:20,054::vm::239::vm.Vm::(_recover) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration destination error: Error creating the requested VM
Thread-442::ERROR::2014-01-05 09:06:45,811::vm::338::vm.Vm::(run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Failed to migrate (stack trace snipped)


and then, on engine:
2014-01-05 11:06:59,871 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) VM POOLTEST-11 8200c2b7-fcc6-4348-983d-043eeb869dbf moved from MigratingFrom --> Up
2014-01-05 11:06:59,871 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) Adding VM 8200c2b7-fcc6-4348-983d-043eeb869dbf to re-run list
2014-01-05 11:06:59,878 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) Rerun vm 31556a01-bf16-4c82-ae6c-993441440a22. Called from vds host19-rack06.scale.openstack.engineering.redhat.com
2014-01-05 11:06:59,879 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) Rerun vm 8200c2b7-fcc6-4348-983d-043eeb869dbf. Called from vds host19-rack06.scale.openstack.engineering.redhat.com

Comment 11 Francesco Romani 2014-01-29 16:25:01 UTC
libvirt logs start at
host1: 2014-01-05 10:40:02.181+0000
host2: 2014-01-05 09:47:28.558+0000:

so the time window 09:05 - 09:07 is missing. Can you please provide them?

Comment 12 Ilanit Stein 2014-01-30 07:56:16 UTC
I'm afraid logs are not there any more.

Comment 13 Francesco Romani 2014-03-27 16:10:52 UTC
Sorry for late update.

Turns out the timeout expired because a call to getVolumeSize, which is made in the VM creation flow, took too long to complete:

Thread-59772::DEBUG::2014-01-05 09:05:36,962::vm::2115::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Start
Thread-59772::DEBUG::2014-01-05 09:05:36,967::vm::2119::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::_ongoingCreations acquired
Thread-59772::INFO::2014-01-05 09:05:36,969::vm::2969::vm.Vm::(_run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::VM wrapper has started
Thread-59772::WARNING::2014-01-05 09:05:36,972::vm::1902::vm.Vm::(getConfDevices) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '1'}}' found
Thread-59772::WARNING::2014-01-05 09:05:36,973::vm::1902::vm.Vm::(getConfDevices) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '2'}}' found
Thread-59772::WARNING::2014-01-05 09:05:36,973::vm::1902::vm.Vm::(getConfDevices) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Unknown type found, device: '{'device': 'spicevmc', 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '3'}}' found
Thread-59772::DEBUG::2014-01-05 09:05:36,974::task::579::TaskManager.Task::(_updateState) Task=`effa6e6e-ed54-4b32-a271-f9d04cfa2f9b`::moving from state init -> state preparing
Thread-59772::INFO::2014-01-05 09:05:36,975::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='7ac99a5a-1542-4757-b7fc-c43ac2caf8f4', spUUID='a46b1470-b901-4535-a308-4438c135d6c3', imgUUID='06f55c72-43ce-4959-931f-0a93626d0f07', volUUID='d7827c2a-4a62-41e0-bae1-ff4d10e7c466', options=None)
Thread-59772::INFO::2014-01-05 09:06:27,689::logUtils::47::dispatcher::(wrapper) Run and protect: getVolumeSize, Return response: {'truesize': '337831936', 'apparentsize': '337182720'}

the call took roughly 48s to complete while the timeout was set to 42s, and this explain the failure.

The only thing we can improve here is reporting to the UI.

Comment 14 Francesco Romani 2014-04-09 14:50:15 UTC
On VDSM side, reporting the exitReason should provide enough informatio to the engine to properly fix the error reporting.
In particular I think that http://gerrit.ovirt.org/#/c/25164/ should also fix this issue.

However, *not* moving on POST because a patch on engine side is needed as well to leverage the new information.

Comment 16 Michal Skrivanek 2014-08-28 08:09:43 UTC
reassigning to add the engine-side message

Comment 17 Arik 2015-09-16 13:49:33 UTC
Seems like it was already solved. On 3.6 master the following audit log is shown (note that on master the 'Destination' part in this message is removed):
Migration failed: Lost connection with qemu process (VM: mmm, Source: bamba).

Comment 18 Ilanit Stein 2015-10-18 09:22:50 UTC
Moving to verified, as this bug is a result of race condition.
We have the migration tests run on automation constantly, and such a problem was not seen recently.

Comment 20 errata-xmlrpc 2016-03-09 20:34:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html