Bug 1048573
Summary: | Unexpected behaviour when kill qemu process on source host, while VMs migration. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Ilanit Stein <istein> | ||||||||
Component: | ovirt-engine | Assignee: | Arik <ahadas> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Ilanit Stein <istein> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 3.3.0 | CC: | ahadas, bgraveno, fromani, iheim, istein, joallen, lpeer, mavital, michal.skrivanek, rbalakri, Rhev-m-bugs, yeylon | ||||||||
Target Milestone: | ovirt-3.6.0-rc | Flags: | istein:
testing_plan_complete+
|
||||||||
Target Release: | 3.6.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Enhanced audit log messages with additional information to enable users to easily resolve migration failure issues.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-03-09 20:34:53 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Ilanit Stein
2014-01-05 10:59:52 UTC
Audit log for the VM, which his process killed (POOLTEST-10): ============================================================ 2014-Jan-05, 11:21 VM POOLTEST-10 started on Host host19-rack06.scale.openstack.engineering.redhat.com 124336ae oVirt 2014-Jan-05, 11:20 Failed to run VM POOLTEST-10 (User: admin@internal). 124336ae oVirt 2014-Jan-05, 11:06 VM POOLTEST-10 is down. Exit message: Lost connection with qemu process. oVirt 2014-Jan-05, 11:06 Migration failed due to Error: Error creating the requested Desktop (VM: POOLTEST-10, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com). 21d96212 oVirt 2014-Jan-05, 11:06 Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host (VM: POOLTEST-10, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com). 21d96212 oVirt Audit log for the 2 VM, failed migration as well (same message for VMs POOLTEST-11, POOLTEST-12): ================================================================================= 2014-Jan-05, 11:07 Migration failed due to Error: Error creating the requested Desktop (VM: POOLTEST-11, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com). 46b59d1a oVirt 2014-Jan-05, 11:07 Migration failed due to Error: Error creating the requested Desktop. Trying to migrate to another Host (VM: POOLTEST-11, Source: host19-rack06.scale.openstack.engineering.redhat.com, Destination: host20-rack06.scale.openstack.engineering.redhat.com). Created attachment 845731 [details]
engine log (migration start at Jan 05, 11:05)
Created attachment 845735 [details]
host_1 logs
Time 2 hours behind rhevm. Migration in logs start @ Jan 5, 9:05
Created attachment 845736 [details]
host_2 logs
Time 2 hours behind rhevm. Migration in logs start @ Jan 5, 9:05
Francesco, can you take a look at the logs? especially concerning the point 3. Yep, taking the bug First finding for the point 3 above. engine.log reports a NullPointerException shortly after the vm creation command is issued 2014-01-05 11:20:46,945 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateVDSCommand] (pool-4-thread-46) [124336ae] FINISH, CreateVDSCommand, log id: a81c73e 2014-01-05 11:20:46,951 INFO [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (pool-4-thread-46) [124336ae] FINISH, CreateVmVDSCommand, return: WaitForLaunch, log id: 2d75abdc 2014-01-05 11:20:46,952 INFO [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-46) [124336ae] Lock freed to object EngineLock [exclusiveLocks= key: 7cbda807-f208-427a-a5a2-b3894fe62ffe value: VM , sharedLocks= ] 2014-01-05 11:20:46,953 ERROR [org.ovirt.engine.core.bll.RunVmCommand] (pool-4-thread-46) [124336ae] Command org.ovirt.engine.core.bll.RunVmCommand throw exception: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) [rt.jar:1.7.0_45] at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) [rt.jar:1.7.0_45] at org.ovirt.engine.core.vdsbroker.ResourceManager.GetVdsManager(ResourceManager.java:228) [vdsbroker.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.getMonitor(RunVmCommandBase.java:337) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.getBlockingQueue(RunVmCommandBase.java:327) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.decreasePendingVms(RunVmCommandBase.java:289) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommandBase.decreasePendingVms(RunVmCommandBase.java:284) [bll.jar:] at org.ovirt.engine.core.bll.MigrateVmCommand.reportCompleted(MigrateVmCommand.java:377) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.RunAsyncVdsCommand(VDSBrokerFrontendImpl.java:53) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.createVm(RunVmCommand.java:473) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.runVm(RunVmCommand.java:234) [bll.jar:] at org.ovirt.engine.core.bll.RunVmCommand.executeVmCommand(RunVmCommand.java:304) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:87) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1134) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1219) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1895) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:174) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:116) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1239) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:362) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.executeValidatedCommand(MultipleActionsRunner.java:175) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner.RunCommands(MultipleActionsRunner.java:156) [bll.jar:] at org.ovirt.engine.core.bll.SortedMultipleActionsRunnerBase.RunCommands(SortedMultipleActionsRunnerBase.java:17) [bll.jar:] at org.ovirt.engine.core.bll.MultipleActionsRunner$1.run(MultipleActionsRunner.java:94) [bll.jar:] at org.ovirt.engine.core.utils.threadpool.ThreadPoolUtil$InternalWrapperRunnable.run(ThreadPoolUtil.java:71) [utils.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_45] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_45] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_45] at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_45] On VDSM side, the vm seems to be started OK: libvirtEventLoop::DEBUG::2014-01-05 09:21:07,471::vm::4940::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`7cbda807-f208-427a-a5a2-b3894fe62ffe`::event Started detail 0 opaque None thanks Francesco, this case has been recently fixed by Arik in bug 1048790 Moving investigation to Point 2 above Preliminary findings about point 2: Migration indeed failed. host 1, source: Thread-441::DEBUG::2014-01-05 09:05:37,843::BindingXMLRPC::984::vds::(wrapper) client [10.35.161.80]::call vmMigrate with ({'tunneled': 'false', 'src': '10.1.32.43', 'dst': '10.1.32.44:54321', 'vmId': '8200c2b7-fcc6-4348-983d-043eeb869dbf', 'abortOnError': 'true', 'method': 'online'},) {} flowID [46b59d1a] Thread-441::DEBUG::2014-01-05 09:05:37,844::API::469::vds::(migrate) {'tunneled': 'false', 'src': '10.1.32.43', 'dst': '10.1.32.44:54321', 'vmId': '8200c2b7-fcc6-4348-983d-043eeb869dbf', 'abortOnError': 'true', 'method': 'online'} Thread-441::DEBUG::2014-01-05 09:05:37,844::BindingXMLRPC::991::vds::(wrapper) return vmMigrate with {'status': {'message': 'Migration in progress', 'code': 0}, 'progress': 0} Thread-442::DEBUG::2014-01-05 09:05:37,845::vm::181::vm.Vm::(_setupVdsConnection) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Destination server is: 10.1.32.44:54321 Thread-442::DEBUG::2014-01-05 09:05:37,846::vm::183::vm.Vm::(_setupVdsConnection) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Initiating connection with destination Thread-442::DEBUG::2014-01-05 09:05:37,902::vm::233::vm.Vm::(_prepareGuest) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Migration started Thread-442::DEBUG::2014-01-05 09:05:37,967::vm::315::vm.Vm::(run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration semaphore acquired On host 2, destination: Thread-59771::DEBUG::2014-01-05 09:05:36,944::BindingXMLRPC::984::vds::(wrapper) client [10.1.32.43]::call vmMigrationCreate with ({'username': 'Unknown', 'acpiEnable': 'true', 'emulatedMachine': 'rhel6.5.0', 'afterMigrationStatus': '', 'guestFQDN': '', 'vmId': '8200c2b7-fcc6-4348-983d-043eeb869dbf', 'memGuaranteedSize': 4096, 'spiceSslCipherSuite': 'DEFAULT', 'displaySecurePort': '5901', 'timeOffset': '0', 'cpuType': 'Penryn', 'smp': '1', 'migrationDest': 'libvirt', 'custom': {'device_4713f045-9b0a-4412-bf90-cb3c37f58f35': 'VmDevice {vmId=8200c2b7-fcc6-4348-983d-043eeb869dbf, deviceId=4713f045-9b0a-4412-bf90-cb3c37f58f35, device=ide, type=CONTROLLER, bootOrder=0, specParams={}, address={bus=0x00, domain=0x0000, type=pci, slot=0x01, function=0x1}, managed=false, plugged=true, readOnly=false, deviceAlias=ide0, customProperties={}, snapshotId=null}', 'device_4713f045-9b0a-4412-bf90-cb3c37f58f35device_eeddafff-9509-41f8-a56e-99f39a8cf0f6': 'VmDevice {vmId=8200c2b7-fcc6-4348-983d-043eeb869dbf, deviceId=eeddafff-9509-41f8-a56e-99f39a8cf0f6, (omitted for brevity) Thread-59771::DEBUG::2014-01-05 09:05:36,945::API::504::vds::(migrationCreate) Migration create Thread-59771::INFO::2014-01-05 09:05:36,950::clientIF::391::vds::(createVm) vmContainerLock acquired by vm 8200c2b7-fcc6-4348-983d-043eeb869dbf Thread-59771::DEBUG::2014-01-05 09:05:36,962::clientIF::404::vds::(createVm) Total desktops after creation of 8200c2b7-fcc6-4348-983d-043eeb869dbf is 2 Thread-59772::DEBUG::2014-01-05 09:05:36,962::vm::2115::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Start Thread-59771::DEBUG::2014-01-05 09:05:36,965::vm::5025::vm.Vm::(waitForMigrationDestinationPrepare) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration destination: waiting for VM creation Thread-59772::DEBUG::2014-01-05 09:05:36,967::vm::2119::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::_ongoingCreations acquired Thread-59772::INFO::2014-01-05 09:05:36,969::vm::2969::vm.Vm::(_run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::VM wrapper has started Thread-59771::DEBUG::2014-01-05 09:05:36,969::vm::5030::vm.Vm::(waitForMigrationDestinationPrepare) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration destination: waiting 42s for path preparation but then Thread-59771::DEBUG::2014-01-05 09:06:18,972::vm::5033::vm.Vm::(waitForMigrationDestinationPrepare) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Timeout while waiting for path preparation consequently, on host1: Thread-442::ERROR::2014-01-05 09:06:20,054::vm::239::vm.Vm::(_recover) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::migration destination error: Error creating the requested VM Thread-442::ERROR::2014-01-05 09:06:45,811::vm::338::vm.Vm::(run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Failed to migrate (stack trace snipped) and then, on engine: 2014-01-05 11:06:59,871 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) VM POOLTEST-11 8200c2b7-fcc6-4348-983d-043eeb869dbf moved from MigratingFrom --> Up 2014-01-05 11:06:59,871 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) Adding VM 8200c2b7-fcc6-4348-983d-043eeb869dbf to re-run list 2014-01-05 11:06:59,878 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) Rerun vm 31556a01-bf16-4c82-ae6c-993441440a22. Called from vds host19-rack06.scale.openstack.engineering.redhat.com 2014-01-05 11:06:59,879 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-22) Rerun vm 8200c2b7-fcc6-4348-983d-043eeb869dbf. Called from vds host19-rack06.scale.openstack.engineering.redhat.com libvirt logs start at host1: 2014-01-05 10:40:02.181+0000 host2: 2014-01-05 09:47:28.558+0000: so the time window 09:05 - 09:07 is missing. Can you please provide them? I'm afraid logs are not there any more. Sorry for late update. Turns out the timeout expired because a call to getVolumeSize, which is made in the VM creation flow, took too long to complete: Thread-59772::DEBUG::2014-01-05 09:05:36,962::vm::2115::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Start Thread-59772::DEBUG::2014-01-05 09:05:36,967::vm::2119::vm.Vm::(_startUnderlyingVm) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::_ongoingCreations acquired Thread-59772::INFO::2014-01-05 09:05:36,969::vm::2969::vm.Vm::(_run) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::VM wrapper has started Thread-59772::WARNING::2014-01-05 09:05:36,972::vm::1902::vm.Vm::(getConfDevices) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel0', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '1'}}' found Thread-59772::WARNING::2014-01-05 09:05:36,973::vm::1902::vm.Vm::(getConfDevices) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Unknown type found, device: '{'device': 'unix', 'alias': 'channel1', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '2'}}' found Thread-59772::WARNING::2014-01-05 09:05:36,973::vm::1902::vm.Vm::(getConfDevices) vmId=`8200c2b7-fcc6-4348-983d-043eeb869dbf`::Unknown type found, device: '{'device': 'spicevmc', 'alias': 'channel2', 'type': 'channel', 'address': {'bus': '0', 'controller': '0', 'type': 'virtio-serial', 'port': '3'}}' found Thread-59772::DEBUG::2014-01-05 09:05:36,974::task::579::TaskManager.Task::(_updateState) Task=`effa6e6e-ed54-4b32-a271-f9d04cfa2f9b`::moving from state init -> state preparing Thread-59772::INFO::2014-01-05 09:05:36,975::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='7ac99a5a-1542-4757-b7fc-c43ac2caf8f4', spUUID='a46b1470-b901-4535-a308-4438c135d6c3', imgUUID='06f55c72-43ce-4959-931f-0a93626d0f07', volUUID='d7827c2a-4a62-41e0-bae1-ff4d10e7c466', options=None) Thread-59772::INFO::2014-01-05 09:06:27,689::logUtils::47::dispatcher::(wrapper) Run and protect: getVolumeSize, Return response: {'truesize': '337831936', 'apparentsize': '337182720'} the call took roughly 48s to complete while the timeout was set to 42s, and this explain the failure. The only thing we can improve here is reporting to the UI. On VDSM side, reporting the exitReason should provide enough informatio to the engine to properly fix the error reporting. In particular I think that http://gerrit.ovirt.org/#/c/25164/ should also fix this issue. However, *not* moving on POST because a patch on engine side is needed as well to leverage the new information. reassigning to add the engine-side message Seems like it was already solved. On 3.6 master the following audit log is shown (note that on master the 'Destination' part in this message is removed): Migration failed: Lost connection with qemu process (VM: mmm, Source: bamba). Moving to verified, as this bug is a result of race condition. We have the migration tests run on automation constantly, and such a problem was not seen recently. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0376.html |