Bug 697749 - Deadlock between VM ops and filter update
Summary: Deadlock between VM ops and filter update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: libvirt
Version: 5.7
Hardware: x86_64
OS: All
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 684940
TreeView+ depends on / blocked
 
Reported: 2011-04-19 06:00 UTC by IBM Bug Proxy
Modified: 2011-07-21 10:31 UTC (History)
10 users (show)

Fixed In Version: libvirt-0.8.2-19.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-21 10:31:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
[patch1] libvirt : Network filter (1.04 KB, text/plain)
2011-04-19 06:00 UTC, IBM Bug Proxy
no flags Details
[patch2] libvirt : nwfilter: resolve deadlock between VM ops and filter update (5.67 KB, text/plain)
2011-04-19 06:00 UTC, IBM Bug Proxy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 71561 0 None None None Never
Red Hat Product Errata RHSA-2011:1019 0 normal SHIPPED_LIVE Moderate: libvirt security, bug fix, and enhancement update 2011-07-21 10:31:00 UTC

Description IBM Bug Proxy 2011-04-19 06:00:47 UTC
---Problem Description---

Unable to resume/force shutdown/destroy/stop a paused Windows 2003 VM. The following error is
displayed : 

error: Timed out during operation: cannot acquire state change lock

Error occurs using GUI and CLI
 
Steps to Reproduce
=============
 For some reason, the single VM we have running on the system, a Windows 2003 VM, became paused. All
attempts to resume/stop/destroy the VM have failed, even after two host reboots. Here are snippets
from virsh and messages :

[root@iopx3550a0277 ~]# virsh list
 Id Name                 State
----------------------------------
  1 iopxa0277-w3-1       paused

[root@iopx3550a0277 ~]# virsh resume iopxa0277-w3-1
error: Failed to resume domain iopxa0277-w3-1
error: Timed out during operation: cannot acquire state change lock


[root@iopx3550a0277 ~]# virsh shutdown iopxa0277-w3-1
error: Failed to shutdown domain iopxa0277-w3-1
error: Timed out during operation: cannot acquire state change lock

[root@iopx3550a0277 ~]# virsh destroy iopxa0277-w3-1
error: Failed to destroy domain iopxa0277-w3-1
error: Timed out during operation: cannot acquire state change lock


Apr 13 15:48:18 iopx3550a0277 libvirtd: 15:48:18.087: error : virLibConnError:462 : this function is 

not supported by the connection driver: virDomainReboot
Apr 13 15:48:35 iopx3550a0277 libvirtd: 15:48:35.792: error : virLibConnError:462 : this function is 

not supported by the connection driver: virDomainReboot
Apr 13 15:50:17 iopx3550a0277 libvirtd: 15:50:17.005: error : qemuDomainObjBeginJobWithDriver:409 : 

Timed out during operation: cannot acquire state change lock
Apr 13 15:56:23 iopx3550a0277 libvirtd: 15:56:23.005: error : qemuDomainObjBeginJobWithDriver:409 : 

Timed out during operation: cannot acquire state change lock
Apr 13 15:57:32 iopx3550a0277 libvirtd: 15:57:32.005: error : qemuDomainObjBeginJob:362 : Timed out 

during operation: cannot acquire state change lock
 
Machine Details :
===========
---uname output---
Linux iopx3550a0277.storage.tucson.ibm.com 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64
x86_64 x86_64 GNU/Linux
 
Machine Type = x3550 7978-2TZ 

Platform : x86 -64 bit 

kernel version : 2.6.18-238.el5

Libvirt   : libvirt-0.8.2-15.el5

Resolution : 
========
      This issue is discussed upstream
(http://permalink.gmane.org/gmane.comp.emulators.libvirt/28718) and Stefan has
provided the fix for this issue. 

libvirt commit IDs : 

Issue : Rebuild network filter for UML guests on updates 
libvirt mainline Commit ID : 38ba6e16eac14872ea3a2ce0bc6bffed6669582a

issue :  nwfilter:  resolve deadlock between VM ops and filter update
nwfilter: resolve deadlock between VM ops and filter update


Patch : 
=====
    Below is the back ported patches to fix this issue. 

[patch1] libvirt : Network filter 

Rebuild network filter for UML guests on updates 
libvirt mainline Commit ID : 38ba6e16eac14872ea3a2ce0bc6bffed6669582a

[patch2] libvirt : nwfilter:  resolve deadlock between VM ops and filter update

nwfilter: resolve deadlock between VM ops and filter update
libvirt mainline commit ID : 4435f3c4779b8e2a63166ebe987979e921afa5e0

Verification : 
=======
I have regenerated libvirt package and installed on the same machine. 

The fix appears to work. I was able to pause, resume, stop, start, pause, stop, reboot, and start
again with no issues. I would consider the fix working.

Comment 1 IBM Bug Proxy 2011-04-19 06:00:52 UTC
Created attachment 493084 [details]
[patch1] libvirt : Network filter

Comment 2 IBM Bug Proxy 2011-04-19 06:00:55 UTC
Created attachment 493085 [details]
[patch2] libvirt : nwfilter:  resolve deadlock between VM ops and filter update

Comment 3 Jiri Denemark 2011-04-21 14:02:39 UTC
I don't really understand how is this bug report related to the patch fixing deadlock between qemu and nwfilter drivers but since you did the verification and you are satisfied with what the patch did for you, I take this bz as a request to pull that patch in RHEL-5. Moreover, the patch fixes a real deadlock, so it's good to have anyway.

As for testing this by our QE, I was able to reproduce the fix with the following steps:

1. define a domain which references a filter
2. create a filter xml
3. run two infinite loops concurrently
   - one that defines/undefines the filter from step 2:
     virsh nwfilter-define filter.xml; virsh nwfilter-undefine filter.xml
   - and one that creates/destroys the domain from step 1:
     virsh start domain; virsh destroy domain

Without the fix the two loops will stop (quite soon in my testing) as a result of the deadlock. With the fix, both loops should continue forever.

Comment 5 IBM Bug Proxy 2011-04-21 20:30:57 UTC
------- Comment From mf.com 2011-04-21 16:21 EDT-------
(In reply to comment #22)
> I don't really understand how is this bug report related to the patch fixing
> deadlock between qemu and nwfilter drivers but since you did the verification
> and you are satisfied with what the patch did for you, I take this bz as a
> request to pull that patch in RHEL-5. Moreover, the patch fixes a real
> deadlock, so it's good to have anyway.

Yes, that is the request...to have the fix included in RHEL-5. Thanks!

Comment 7 Vivian Bian 2011-05-10 03:32:21 UTC
checked with 
libvirt-0.8.2-19.el5
kernel-2.6.18-259.el5
xen-3.0.3-130.el5


[root@localhost libvirt]# virsh suspend win2k3
Domain win2k3 suspended

[root@localhost libvirt]# virsh resume win2k3
Domain win2k3 resumed

[root@localhost libvirt]# virsh shutdown win2k3
Domain win2k3 is being shutdown

[root@localhost libvirt]# virsh start win2k3
Domain win2k3 started

[root@localhost libvirt]# virsh suspend win2k3
Domain win2k3 suspended

[root@localhost libvirt]# virsh shutdown win2k3
Domain win2k3 is being shutdown

[root@localhost libvirt]# virsh reboot win2k3
Domain win2k3 is being rebooted

[root@localhost libvirt]# virsh destroy win2k3
Domain win2k3 destroyed

[root@localhost libvirt]# virsh start win2k3
Domain win2k3 started

all of the operations get no issues . 

So set bug status to VERIFIED

Comment 8 Huming Jiang 2011-06-02 07:38:55 UTC
Could reproduce this bug on the following components of rh5.6:
libvirt-0.8.2-15.el5
kernel-2.6.18-238.el5
xen-3.0.3-120.el5

Steps:

1# cat nwfilter 
<filter name='disallow-arp' chain='arp'>
  <rule action='drop' direction='inout' priority='500'/>
</filter>
# cat test2.sh 
for i in `seq 1000`;
 do  
virsh nwfilter-define nwfilter;
echo $i;
virsh nwfilter-undefine disallow-arp; 
done
# cat test.sh 
for i in `seq 1000`;
 do  
virsh start rh5.6;
echo $i;
virsh destroy rh5.6;
done 

2. #virsh edit rh5.6
add "<filterref filter='allow-dhcp'/>" inside one of the "interface" node. 

3 run two infinite sh file concurrently
Outputs:
#sh test2.sh
377
Network filter disallow-arp undefined

Network filter disallow-arp defined from nwfilter

378
error: Failed to undefine network filter disallow-arp
error: Invalid network filter: nwfilter is in use

Network filter disallow-arp defined from nwfilter

379
Network filter disallow-arp undefined

Network filter disallow-arp defined from nwfilter

380
error: Failed to undefine network filter disallow-arp
error: Invalid network filter: nwfilter is in use

Network filter disallow-arp defined from nwfilter

381
Network filter disallow-arp undefined

Network filter disallow-arp defined from nwfilter

382
error: Failed to undefine network filter disallow-arp
error: Invalid network filter: nwfilter is in use

Network filter disallow-arp defined from nwfilter

383
Network filter disallow-arp undefined

Network filter disallow-arp defined from nwfilter

384
error: Failed to undefine network filter disallow-arp
error: Invalid network filter: nwfilter is in use

Network filter disallow-arp defined from nwfilter

385

(deadlock)
# sh test.sh 
Domain rh5.6 started

1
Domain rh5.6 destroyed

Domain rh5.6 started

2
Domain rh5.6 destroyed

Domain rh5.6 started

3
Domain rh5.6 destroyed

Domain rh5.6 started

4
Domain rh5.6 destroyed

Domain rh5.6 started

5
Domain rh5.6 destroyed
(deadlock)

Comment 9 IBM Bug Proxy 2011-06-08 11:10:30 UTC
------- Comment From markwiz.com 2011-06-08 07:03 EDT-------
Is this suppose to be fixed in RHEL5.7?

Comment 10 Joseph Kachuck 2011-06-08 16:20:57 UTC
Hello,
It is accepted for RHEl 5.7. It should be fixed in libvirt-0.8.2-19.el5.

Thank You
Joe Kachuck

Comment 11 IBM Bug Proxy 2011-06-29 05:51:22 UTC
------- Comment From vahegde1.ibm.com 2011-06-29 01:44 EDT-------
Hi Red Hat,

We have Verified with :

kernel 2.6.18-268.el5
libvirt-0.8.2-20.el5
libvirt-0.8.2-20.el5
kvm-83-237.el5

Ran tests with Windows 2003 VM. No issues found. We consider this problem fixed.

Thanks for your support.

Comment 12 errata-xmlrpc 2011-07-21 10:31:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1019.html


Note You need to log in before you can comment on or make changes to this bug.