---Problem Description--- Unable to resume/force shutdown/destroy/stop a paused Windows 2003 VM. The following error is displayed : error: Timed out during operation: cannot acquire state change lock Error occurs using GUI and CLI Steps to Reproduce ============= For some reason, the single VM we have running on the system, a Windows 2003 VM, became paused. All attempts to resume/stop/destroy the VM have failed, even after two host reboots. Here are snippets from virsh and messages : [root@iopx3550a0277 ~]# virsh list Id Name State ---------------------------------- 1 iopxa0277-w3-1 paused [root@iopx3550a0277 ~]# virsh resume iopxa0277-w3-1 error: Failed to resume domain iopxa0277-w3-1 error: Timed out during operation: cannot acquire state change lock [root@iopx3550a0277 ~]# virsh shutdown iopxa0277-w3-1 error: Failed to shutdown domain iopxa0277-w3-1 error: Timed out during operation: cannot acquire state change lock [root@iopx3550a0277 ~]# virsh destroy iopxa0277-w3-1 error: Failed to destroy domain iopxa0277-w3-1 error: Timed out during operation: cannot acquire state change lock Apr 13 15:48:18 iopx3550a0277 libvirtd: 15:48:18.087: error : virLibConnError:462 : this function is not supported by the connection driver: virDomainReboot Apr 13 15:48:35 iopx3550a0277 libvirtd: 15:48:35.792: error : virLibConnError:462 : this function is not supported by the connection driver: virDomainReboot Apr 13 15:50:17 iopx3550a0277 libvirtd: 15:50:17.005: error : qemuDomainObjBeginJobWithDriver:409 : Timed out during operation: cannot acquire state change lock Apr 13 15:56:23 iopx3550a0277 libvirtd: 15:56:23.005: error : qemuDomainObjBeginJobWithDriver:409 : Timed out during operation: cannot acquire state change lock Apr 13 15:57:32 iopx3550a0277 libvirtd: 15:57:32.005: error : qemuDomainObjBeginJob:362 : Timed out during operation: cannot acquire state change lock Machine Details : =========== ---uname output--- Linux iopx3550a0277.storage.tucson.ibm.com 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64 GNU/Linux Machine Type = x3550 7978-2TZ Platform : x86 -64 bit kernel version : 2.6.18-238.el5 Libvirt : libvirt-0.8.2-15.el5 Resolution : ======== This issue is discussed upstream (http://permalink.gmane.org/gmane.comp.emulators.libvirt/28718) and Stefan has provided the fix for this issue. libvirt commit IDs : Issue : Rebuild network filter for UML guests on updates libvirt mainline Commit ID : 38ba6e16eac14872ea3a2ce0bc6bffed6669582a issue : nwfilter: resolve deadlock between VM ops and filter update nwfilter: resolve deadlock between VM ops and filter update Patch : ===== Below is the back ported patches to fix this issue. [patch1] libvirt : Network filter Rebuild network filter for UML guests on updates libvirt mainline Commit ID : 38ba6e16eac14872ea3a2ce0bc6bffed6669582a [patch2] libvirt : nwfilter: resolve deadlock between VM ops and filter update nwfilter: resolve deadlock between VM ops and filter update libvirt mainline commit ID : 4435f3c4779b8e2a63166ebe987979e921afa5e0 Verification : ======= I have regenerated libvirt package and installed on the same machine. The fix appears to work. I was able to pause, resume, stop, start, pause, stop, reboot, and start again with no issues. I would consider the fix working.
Created attachment 493084 [details] [patch1] libvirt : Network filter
Created attachment 493085 [details] [patch2] libvirt : nwfilter: resolve deadlock between VM ops and filter update
I don't really understand how is this bug report related to the patch fixing deadlock between qemu and nwfilter drivers but since you did the verification and you are satisfied with what the patch did for you, I take this bz as a request to pull that patch in RHEL-5. Moreover, the patch fixes a real deadlock, so it's good to have anyway. As for testing this by our QE, I was able to reproduce the fix with the following steps: 1. define a domain which references a filter 2. create a filter xml 3. run two infinite loops concurrently - one that defines/undefines the filter from step 2: virsh nwfilter-define filter.xml; virsh nwfilter-undefine filter.xml - and one that creates/destroys the domain from step 1: virsh start domain; virsh destroy domain Without the fix the two loops will stop (quite soon in my testing) as a result of the deadlock. With the fix, both loops should continue forever.
http://post-office.corp.redhat.com/archives/rhvirt-patches/2011-April/msg00428.html
------- Comment From mf.com 2011-04-21 16:21 EDT------- (In reply to comment #22) > I don't really understand how is this bug report related to the patch fixing > deadlock between qemu and nwfilter drivers but since you did the verification > and you are satisfied with what the patch did for you, I take this bz as a > request to pull that patch in RHEL-5. Moreover, the patch fixes a real > deadlock, so it's good to have anyway. Yes, that is the request...to have the fix included in RHEL-5. Thanks!
checked with libvirt-0.8.2-19.el5 kernel-2.6.18-259.el5 xen-3.0.3-130.el5 [root@localhost libvirt]# virsh suspend win2k3 Domain win2k3 suspended [root@localhost libvirt]# virsh resume win2k3 Domain win2k3 resumed [root@localhost libvirt]# virsh shutdown win2k3 Domain win2k3 is being shutdown [root@localhost libvirt]# virsh start win2k3 Domain win2k3 started [root@localhost libvirt]# virsh suspend win2k3 Domain win2k3 suspended [root@localhost libvirt]# virsh shutdown win2k3 Domain win2k3 is being shutdown [root@localhost libvirt]# virsh reboot win2k3 Domain win2k3 is being rebooted [root@localhost libvirt]# virsh destroy win2k3 Domain win2k3 destroyed [root@localhost libvirt]# virsh start win2k3 Domain win2k3 started all of the operations get no issues . So set bug status to VERIFIED
Could reproduce this bug on the following components of rh5.6: libvirt-0.8.2-15.el5 kernel-2.6.18-238.el5 xen-3.0.3-120.el5 Steps: 1# cat nwfilter <filter name='disallow-arp' chain='arp'> <rule action='drop' direction='inout' priority='500'/> </filter> # cat test2.sh for i in `seq 1000`; do virsh nwfilter-define nwfilter; echo $i; virsh nwfilter-undefine disallow-arp; done # cat test.sh for i in `seq 1000`; do virsh start rh5.6; echo $i; virsh destroy rh5.6; done 2. #virsh edit rh5.6 add "<filterref filter='allow-dhcp'/>" inside one of the "interface" node. 3 run two infinite sh file concurrently Outputs: #sh test2.sh 377 Network filter disallow-arp undefined Network filter disallow-arp defined from nwfilter 378 error: Failed to undefine network filter disallow-arp error: Invalid network filter: nwfilter is in use Network filter disallow-arp defined from nwfilter 379 Network filter disallow-arp undefined Network filter disallow-arp defined from nwfilter 380 error: Failed to undefine network filter disallow-arp error: Invalid network filter: nwfilter is in use Network filter disallow-arp defined from nwfilter 381 Network filter disallow-arp undefined Network filter disallow-arp defined from nwfilter 382 error: Failed to undefine network filter disallow-arp error: Invalid network filter: nwfilter is in use Network filter disallow-arp defined from nwfilter 383 Network filter disallow-arp undefined Network filter disallow-arp defined from nwfilter 384 error: Failed to undefine network filter disallow-arp error: Invalid network filter: nwfilter is in use Network filter disallow-arp defined from nwfilter 385 (deadlock) # sh test.sh Domain rh5.6 started 1 Domain rh5.6 destroyed Domain rh5.6 started 2 Domain rh5.6 destroyed Domain rh5.6 started 3 Domain rh5.6 destroyed Domain rh5.6 started 4 Domain rh5.6 destroyed Domain rh5.6 started 5 Domain rh5.6 destroyed (deadlock)
------- Comment From markwiz.com 2011-06-08 07:03 EDT------- Is this suppose to be fixed in RHEL5.7?
Hello, It is accepted for RHEl 5.7. It should be fixed in libvirt-0.8.2-19.el5. Thank You Joe Kachuck
------- Comment From vahegde1.ibm.com 2011-06-29 01:44 EDT------- Hi Red Hat, We have Verified with : kernel 2.6.18-268.el5 libvirt-0.8.2-20.el5 libvirt-0.8.2-20.el5 kvm-83-237.el5 Ran tests with Windows 2003 VM. No issues found. We consider this problem fixed. Thanks for your support.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1019.html