Hide Forgot
Created attachment 487070 [details] vdsm-logs Description of problem: When vdsm tries to operate a vm while qemu process is non-responding - libvirt fails with: TimeoutError: Timed out during operation: cannot acquire state change lock. after failure in operation, vdsm gets events from libvirt abd dose not change vm status. libvirtEventLoop::DEBUG::2011-03-23 17:03:44,521::libvirtvm::1043::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`52902746-580f-4e0c-826a-0bc42ec6d970`::event Stopped detail 5 opaque None libvirtEventLoop::INFO::2011-03-23 17:03:44,521::vm::1014::vm.Vm::(_onQemuDeath) vmId=`52902746-580f-4e0c-826a-0bc42ec6d970`::underlying process disconnected vdsClient -s 0 list table 52902746-580f-4e0c-826a-0bc42ec6d970 31111 LIBVIRT-NFS-0-07 Powering down* Version-Release number of selected component (if applicable): vdsm-4.9-55.el6.x86_64 libvirt-0.8.7-13.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1.run vm 2.kill -19 qemu-process and wait 60 seconds till vdsm will report the vm as up* 3.vdsClient -s 0 destroy vm-uuid (operation will fail with the above error ) 4.kill -9 qemu-process Actual results: vm in non-responding state while actually dead. Additional info: vdsm log attached
The real reason for stuck in 'Powering Down' state it's destroy failure and this is right. We can't switch VM state to 'Down' before we release all its resources. So, if you will issue destroy command again it will succeed. The additional problem here, that we need to do same things (release VM resources) when we get onQemuDeath event.
Tested on vdsm-4.9-58.el6: libvirtEventLoop::INFO::2011-04-10 13:17:16,363::vm::995::vm.Vm::(_onQemuDeath) vmId=`cbc17805-d25c-469c-aa76-53237d4d6df2`::underlying process disconnected libvirtEventLoop::INFO::2011-04-10 13:17:16,440::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: teardownVolume, args: ( sdUUID=b9750b78-f531-4161-ac3e-fe1805297861 spUUID=cc174bff-13d3-4ff4-a5cb-8ce6d019629e imgUUID=294a7fc5-d61d-4006-aa8b-3c24ae9d762f volUUID=f1b12b58-215a-49bc-9530-41908f668770 rw=False) libvirtEventLoop::INFO::2011-04-10 13:17:19,660::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: teardownVolume, Return response: {'status': {'message': 'OK' , 'code': 0}} teardown executed after OnQemuDeath, but no call destroy. [root@camel-vdsb tmp]# vdsClient -s 0 list table cbc17805-d25c-469c-aa76-53237d4d6df2 5294 LIBVIRT-TEST-001 Powering down* the vm still in Powering down* state. expected result: destroy should be called after onQemuDeath and teardown volume. (even if destroy was called before)
Tested a patch: vm.py: def _onQemuDeath(self): self.log.info('underlying process disconnected') # Try release VM resources first, if failed stuck in 'Powering Down' # state response = self.releaseVm() if not response['status']['code']: if self.user_destroy: self.setDownStatus(NORMAL, "User shut down") else: self.setDownStatus(ERROR, "Lost connection with kvm process") Which works.
Verified on vdsm-4.9-59.el6 libvirtEventLoop::DEBUG::2011-04-12 19:26:34,422::vm::1760::vm.Vm::(setDownStatus) vmId=`f3bbd496-3931-43a7-b967-857853fd0e71`::Changed state to Down: Lost connection with kvm process