Hide Forgot
Description of problem: On a setup with 2 hosts, and 16 VMs, run on a host_1, migrate all VMs to host_2. While migration, kill -9 qemu process, of 1 VM, shortly after migration started. Problems found in results: ========================= 1. Inaccurate event : "Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host" a. As failure is in qemu process running on source host, so Error creating th e requested Desktop is not true. b. Try to migrate to another host, also seem irrelevant, as problem is not with destination host. 2. Beside the VM, which it's process killed, 2 more VMs, had failed the migration as well: "Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host" 3. When trying to run the down VM (the one that qemu process killed), got event error: "Failed to run VM...", and then started OK. It is not clear why the error was given here. Version-Release number of selected component (if applicable): is30 How reproducible: It didn't reproduce, in several trials I tried. Might be a matter of timing. Expected results: 1. Only the VM, that it's process was killed should be affected. 2. Event Error messages, for the VM that it's qemu process killed, should be relevant. Additional info: In the logs, please see VM for which qemu process killed POOLTEST-10. Also VMs that failed migration as well, POOLTEST-11 & POOLTEST-12.
Audit log for the VM, which his process killed (POOLTEST-10): ============================================================ 2014-Jan-05, 11:21 VM POOLTEST-10 started on Host host19-rack06.scale.openstack.engineering.redhat.com 124336ae oVirt 2014-Jan-05, 11:20 Failed to run VM POOLTEST-10 (User: admin@internal). 124336ae oVirt 2014-Jan-05, 11:06 VM POOLTEST-10 is down. Exit message: Lost connection with qemu process. oVirt 2014-Jan-05, 11:06 Migration failed due to Error: Error creating the requested Desktop (VM: POOLTEST-10, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com). 21d96212 oVirt 2014-Jan-05, 11:06 Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host (VM: POOLTEST-10, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com). 21d96212 oVirt Audit log for the 2 VM, failed migration as well (same message for VMs POOLTEST-11, POOLTEST-12): ================================================================================= 2014-Jan-05, 11:07 Migration failed due to Error: Error creating the requested Desktop (VM: POOLTEST-11, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com). 46b59d1a oVirt 2014-Jan-05, 11:07 Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host (VM: POOLTEST-11, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com).
Created attachment 845731 [details] engine log (migration start at Jan 05, 11:05)
Created attachment 845735 [details] host_1 logs Time 2 hours behind rhevm. Migration in logs start @ Jan 5, 9:05
Created attachment 845736 [details] host_2 logs Time 2 hours behind rhevm. Migration in logs start @ Jan 5, 9:05
Francesco, can you take a look at the logs? especially concerning the point 3.
Yep, taking the bug
First finding for the point 3 above. engine.log reports a NullPointerException shortly after the vm creation command is issued 2014-01-05 11:20:46,945 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (pool-4-thread-46) [124336ae] FINISH, CreateVDSCommand, log id: a81c73e 2014-01-05 11:20:46,951 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (pool-4-thread-46) [124336ae] FINISH, CreateVmVDSCommand, return: WaitForLaunch, log id: 2d75abdc 2014-01-05 11:20:46,952 INFO [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-46) [124336ae] Lock freed to object EngineLock [exclusiveLocks= key: 7cbda807-f208-427a-a5a2-b3894fe62ffe value: VM , sharedLocks= ] 2014-01-05 11:20:46,953 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-46) [124336ae] Command org.ovirt.engine.core.bll.RunVmCommand throw exception: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) [rt.jar:1.7.0_45] at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) [rt.jar:1.7.0_45] at org.ovirt.engine.core.vdsbroker.ResourceManager.GetVdsManager(ResourceManager.java:228) [vdsbroker.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.getMonitor(RunVmCommandBase.java:337) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.getBlockingQueue(RunVmCommandBase.java:327) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.decreasePendingVms(RunVmCommandBase.java:289) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.decreasePendingVms(RunVmCommandBase.java:284) [bll.jar:] at org.ovirt.engine.core.bll.MigrateVmCommand.reportCompleted(MigrateVmCommand.java:377) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:53) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:473) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:234) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:304) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:87) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1134) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1219) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1895) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1239) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:362) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:175) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.RunCommands(MultipleActionsRunner.java:156) [bll.jar:] at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.RunCommands(SortedMultipleActionsRunnerBase.java:17) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner$1.run(MultipleActionsRunner.java:94) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:71) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_45] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_45] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_45] at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_45] On VDSM side, the vm seems to be started OK: libvirtEventLoop::DEBUG::2014-01-05 09:21:07,471::vm::4940::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`7cbda807-f208-427a-a5a2-b3894fe62ffe`::event Started detail 0 opaque None
thanks Francesco, this case has been recently fixed by Arik in bug 1048790
Moving investigation to Point 2 above
Preliminary findings about point 2: Migration indeed failed. host 1, source: Thread-441::DEBUG::2014-01-05 09:05:37,843::BindingXMLRPC::984::vds::(wrapper) client [10.35.161.80]::call vmMigrate with ({'tunneled': 'false', 'src': '10.1.32.43', 'dst': '10.1.32.44:54321', 'vmId': '8200c2b7-fcc6-4348-983d-043eeb869dbf', 'abortOnError': 'true', 'method': 'online'},) {} flowID [46b59d1a] Thread-441::DEBUG::2014-01-05 09:05:37,844::API::469::vds::(migrate) {'tunneled': 'false', 'src': '10.1.32.43', 'dst': '10.1.32.44:54321', 'vmId': '8200c2b7-fcc6-4348-983d-043eeb869dbf', 'abortOnError': 'true', 'method': 'online'} Thread-441::DEBUG::2014-01-05 09:05:37,844::BindingXMLRPC::991::vds::(wrapper) return vmMigrate with {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} Thread-442::DEBUG::2014-01-05 09:05:37,845::vm::181::vm.Vm::(_setupVdsConnection) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Destination server is: 10.1.32.44:54321 Thread-442::DEBUG::2014-01-05 09:05:37,846::vm::183::vm.Vm::(_setupVdsConnection) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Initiating connection with destination Thread-442::DEBUG::2014-01-05 09:05:37,902::vm::233::vm.Vm::(_prepareGuest) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Migration started Thread-442::DEBUG::2014-01-05 09:05:37,967::vm::315::vm.Vm::(run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration semaphore acquired On host 2, destination: Thread-59771::DEBUG::2014-01-05 09:05:36,944::BindingXMLRPC::984::vds::(wrapper) client [10.1.32.43]::call vmMigrationCreate with ({'username': 'Unknown', 'acpiEnable': 'true', 'emulatedMachine': 'rhel6.5.0', 'afterMigrationStatus': '', 'guestFQDN': '', 'vmId': '8200c2b7-fcc6-4348-983d-043eeb869dbf', 'memGuaranteedSize': 4096, 'spiceSslCipherSuite': 'DEFAULT', 'displaySecurePort': '5901', 'timeOffset': '0', 'cpuType': 'Penryn', 'smp': '1', 'migrationDest': 'libvirt', 'custom': {'device_4713f045-9b0a-4412-bf90-cb3c37f58f35': 'VmDevice {vmId=8200c2b7-fcc6-4348-983d-043eeb869dbf, deviceId=4713f045-9b0a-4412-bf90-cb3c37f58f35, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x01, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}', 'device_4713f045-9b0a-4412-bf90-cb3c37f58f35device_eeddafff-9509-41f8-a56e-99f39a8cf0f6': 'VmDevice {vmId=8200c2b7-fcc6-4348-983d-043eeb869dbf, deviceId=eeddafff-9509-41f8-a56e-99f39a8cf0f6, (omitted for brevity) Thread-59771::DEBUG::2014-01-05 09:05:36,945::API::504::vds::(migrationCreate) Migration create Thread-59771::INFO::2014-01-05 09:05:36,950::clientIF::391::vds::(createVm) vmContainerLock acquired by vm 8200c2b7-fcc6-4348-983d-043eeb869dbf Thread-59771::DEBUG::2014-01-05 09:05:36,962::clientIF::404::vds::(createVm) Total desktops after creation of 8200c2b7-fcc6-4348-983d-043eeb869dbf is 2 Thread-59772::DEBUG::2014-01-05 09:05:36,962::vm::2115::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Start Thread-59771::DEBUG::2014-01-05 09:05:36,965::vm::5025::vm.Vm::(waitForMigrationDestinationPrepare) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration destination: waiting for VM creation Thread-59772::DEBUG::2014-01-05 09:05:36,967::vm::2119::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::_ongoingCreations acquired Thread-59772::INFO::2014-01-05 09:05:36,969::vm::2969::vm.Vm::(_run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::VM wrapper has started Thread-59771::DEBUG::2014-01-05 09:05:36,969::vm::5030::vm.Vm::(waitForMigrationDestinationPrepare) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration destination: waiting 42s for path preparation but then Thread-59771::DEBUG::2014-01-05 09:06:18,972::vm::5033::vm.Vm::(waitForMigrationDestinationPrepare) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Timeout while waiting for path preparation consequently, on host1: Thread-442::ERROR::2014-01-05 09:06:20,054::vm::239::vm.Vm::(_recover) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration destination error: Error creating the requested VM Thread-442::ERROR::2014-01-05 09:06:45,811::vm::338::vm.Vm::(run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Failed to migrate (stack trace snipped) and then, on engine: 2014-01-05 11:06:59,871 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) VM POOLTEST-11 8200c2b7-fcc6-4348-983d-043eeb869dbf moved from MigratingFrom --> Up 2014-01-05 11:06:59,871 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) Adding VM 8200c2b7-fcc6-4348-983d-043eeb869dbf to re-run list 2014-01-05 11:06:59,878 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) Rerun vm 31556a01-bf16-4c82-ae6c-993441440a22. Called from vds host19-rack06.scale.openstack.engineering.redhat.com 2014-01-05 11:06:59,879 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) Rerun vm 8200c2b7-fcc6-4348-983d-043eeb869dbf. Called from vds host19-rack06.scale.openstack.engineering.redhat.com
libvirt logs start at host1: 2014-01-05 10:40:02.181+0000 host2: 2014-01-05 09:47:28.558+0000: so the time window 09:05 - 09:07 is missing. Can you please provide them?
I'm afraid logs are not there any more.
Sorry for late update. Turns out the timeout expired because a call to getVolumeSize, which is made in the VM creation flow, took too long to complete: Thread-59772::DEBUG::2014-01-05 09:05:36,962::vm::2115::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Start Thread-59772::DEBUG::2014-01-05 09:05:36,967::vm::2119::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::_ongoingCreations acquired Thread-59772::INFO::2014-01-05 09:05:36,969::vm::2969::vm.Vm::(_run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::VM wrapper has started Thread-59772::WARNING::2014-01-05 09:05:36,972::vm::1902::vm.Vm::(getConfDevices) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '1'}}' found Thread-59772::WARNING::2014-01-05 09:05:36,973::vm::1902::vm.Vm::(getConfDevices) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '2'}}' found Thread-59772::WARNING::2014-01-05 09:05:36,973::vm::1902::vm.Vm::(getConfDevices) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Unknown type found, device: '{'device': 'spicevmc', 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '3'}}' found Thread-59772::DEBUG::2014-01-05 09:05:36,974::task::579::TaskManager.Task::(_updateState) Task=`effa6e6e-ed54-4b32-a271-f9d04cfa2f9b`::moving from state init -> state preparing Thread-59772::INFO::2014-01-05 09:05:36,975::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='7ac99a5a-1542-4757-b7fc-c43ac2caf8f4', spUUID='a46b1470-b901-4535-a308-4438c135d6c3', imgUUID='06f55c72-43ce-4959-931f-0a93626d0f07', volUUID='d7827c2a-4a62-41e0-bae1-ff4d10e7c466', options=None) Thread-59772::INFO::2014-01-05 09:06:27,689::logUtils::47::dispatcher::(wrapper) Run and protect: getVolumeSize, Return response: {'truesize': '337831936', 'apparentsize': '337182720'} the call took roughly 48s to complete while the timeout was set to 42s, and this explain the failure. The only thing we can improve here is reporting to the UI.
On VDSM side, reporting the exitReason should provide enough informatio to the engine to properly fix the error reporting. In particular I think that http://gerrit.ovirt.org/#/c/25164/ should also fix this issue. However, *not* moving on POST because a patch on engine side is needed as well to leverage the new information.
reassigning to add the engine-side message
Seems like it was already solved. On 3.6 master the following audit log is shown (note that on master the 'Destination' part in this message is removed): Migration failed: Lost connection with qemu process (VM: mmm, Source: bamba).
Moving to verified, as this bug is a result of race condition. We have the migration tests run on automation constantly, and such a problem was not seen recently.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0376.html