Bug 1143780

Summary: Deadlock on nwfilter when taking same concurrent jobs
Product: Red Hat Enterprise Linux 7 Reporter: Hu Jianwei <jiahu>
Component: libvirtAssignee: Pavel Hrdina <phrdina>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: medium    
Version: 7.1CC: dyuan, honzhang, mzhan, phrdina, rbalakri
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-1.2.8-7.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-05 07:44:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1169409, 1202703    
Bug Blocks:    
Attachments:
Description Flags
libvirt_deadlock_libvirt-1.2.8-7.el7.x86_64 none

Description Hu Jianwei 2014-09-18 01:43:40 UTC
Description
Deadlock on nwfilter when taking same concurrent jobs, when define/undefine nwfilter and hot-attach/dettach that nwfilter to domain, it's easy to deadlock

Version:
libvirt-1.2.8-2.el7.x86_64
qemu-kvm-rhev-2.1.0-3.el7.x86_64
kernel-3.10.0-123.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare below interface and nwfilter xml
[root@ibm-x3850x5-06 ~]# cat nic.xml
 <interface type='network'>
      <mac address='02:54:00:36:c6:d0'/>
      <source network='default'/>
      <target dev='jianguo'/>
      <model type='virtio'/>
      <filterref filter='clean-traffic'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
[root@ibm-x3850x5-06 ~]# cat nic1.xml
 <interface type='network'>
      <mac address='02:54:00:36:c6:d0'/>
      <source network='default'/>
      <target dev='jianguo'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
[root@ibm-x3850x5-06 ~]# cat clean-traffic.xml
<filter name='clean-traffic' chain='root'>
  <uuid>f3d9b618-9097-4b37-86a7-e804066e7fbe</uuid>
  <filterref filter='no-mac-spoofing'/>
  <filterref filter='no-ip-spoofing'/>
  <rule action='accept' direction='out' priority='-650'>
    <mac protocolid='ipv4'/>
  </rule>
  <filterref filter='allow-incoming-ipv4'/>
  <filterref filter='no-arp-spoofing'/>
  <rule action='accept' direction='inout' priority='-500'>
    <mac protocolid='arp'/>
  </rule>
  <filterref filter='no-other-l2-traffic'/>
  <filterref filter='qemu-announce-self'/>
</filter>

2. Execute below two shells in two terminals at the same time.
[root@ibm-x3850x5-06 ~]# cat update-nwfilter.sh
#! /bin/sh -

while true
do
virsh update-device r7 nic1.xml
virsh update-device r7 nic.xml
done

[root@ibm-x3850x5-06 ~]# cat define_undefine_nwfilter.sh
#! /bin/sh -

while true
do
virsh nwfilter-undefine clean-traffic
virsh nwfilter-define clean-traffic.xml
done

3.Check virsh command status
[root@ibm-x3850x5-06 ~]# time virsh list --all
^C
real        2m32.586s
user        0m0.018s
sys        0m0.013s

[root@ibm-x3850x5-06 ~]# time virsh nwfilter-list
^C

real        5m43.071s
user        0m0.020s
sys        0m0.020s


Actual results:
As shown above steps, nwfilter and domain related commands will be blocked, no any response.

Other virsh commands can get response from libvirtd.
[root@ibm-x3850x5-06 ~]# time virsh pool-list --all
 Name                 State      Autostart
-------------------------------------------
 default              active     yes      


real        0m0.025s
user        0m0.011s
sys        0m0.007s
[root@ibm-x3850x5-06 ~]# time virsh net-list --all
 Name                 State      Autostart     Persistent
----------------------------------------------------------
 default              active     yes           yes
 host-bridge          active     no            yes


real        0m0.027s
user        0m0.009s
sys        0m0.007s

Expected results:
libvirtd should give a rapid/correct response.

Additional info:

Comment 2 Pavel Hrdina 2014-11-05 14:03:38 UTC
Upstream patch posted:

https://www.redhat.com/archives/libvir-list/2014-November/msg00108.html

Comment 3 Pavel Hrdina 2014-11-13 09:47:35 UTC
Upstream commit:

commit 41127244fb90f08cf5032a5d7553f5f0390d925e
Author: Pavel Hrdina <phrdina>
Date:   Wed Nov 5 14:28:57 2014 +0100

    nwfilter: fix deadlock caused updating network device and nwfilter
    
    Commit 6e5c79a1 tried to fix deadlock between nwfilter{Define,Undefine}
    and starting of guest, but this same deadlock exists for
    updating/attaching network device to domain.
    
    The deadlock was introduced by removing global QEMU driver lock because
    nwfilter was counting on this lock and ensure that all driver locks are
    locked inside of nwfilter{Define,Undefine}.
    
    This patch extends usage of virNWFilterReadLockFilterUpdates to prevent
    the deadlock for all possible paths in QEMU driver. LXC and UML drivers
    still have global lock.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1143780
    
    Signed-off-by: Pavel Hrdina <phrdina>

Comment 6 Hu Jianwei 2014-11-18 16:10:37 UTC
I still can reproduce it using libvirt-1.2.8-7.el7.x86_64

[root@ibm-x3850x5-06 ~]# rpm -q libvirt qemu-kvm-rhev
libvirt-1.2.8-7.el7.x86_64
qemu-kvm-rhev-2.1.2-8.el7.x86_64

In the first terminal:
[root@ibm-x3850x5-06 ~]# sh update-nwfilter.sh 
Device updated successfully

Device updated successfully

Device updated successfully

Device updated successfully

Device updated successfully

error: Failed to update device from nic.xml
error: End of file while reading data: Input/output error
error: Failed to reconnect to the hypervisor

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

Device updated successfully

Device updated successfully

...


Device updated successfully

Device updated successfully

^C

In the second terminal:
[root@ibm-x3850x5-06 ~]# sh define_undefine_nwfilter.sh 
Network filter clean-traffic undefined

error: Failed to define network filter from clean-traffic.xml
error: End of file while reading data: Input/output error
error: Failed to reconnect to the hypervisor

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

Network filter clean-traffic defined from clean-traffic.xml

error: Failed to undefine network filter clean-traffic
error: Requested operation is not valid: nwfilter is in use

Network filter clean-traffic defined from clean-traffic.xml

error: Failed to undefine network filter clean-traffic
error: Requested operation is not valid: nwfilter is in use

Network filter clean-traffic defined from clean-traffic.xml

error: Failed to undefine network filter clean-traffic
error: Requested operation is not valid: nwfilter is in use
^C

[root@ibm-x3850x5-06 ~]# time virsh list --all
^C

real	2m19.530s
user	0m0.031s
sys	0m0.012s
[root@ibm-x3850x5-06 ~]#

Comment 7 Hu Jianwei 2014-11-18 16:14:27 UTC
Created attachment 958656 [details]
libvirt_deadlock_libvirt-1.2.8-7.el7.x86_64

Please check the log output.

Comment 8 Pavel Hrdina 2014-12-01 14:46:50 UTC
Ouch, I've completely forget about this issue. It's a different bug, the issue is with removing the nwfilter from network interface and the libvirt will crash with segfault. I'll create a bug for rhel-7.1 and also I'll fix the issue upstream and downstream.

Comment 9 Hu Jianwei 2014-12-01 15:44:36 UTC
OK, thanks for your reply, the issue your pointed out will block the bug, so I don't verify the bug, just waiting for your new patch.

Comment 10 Hu Jianwei 2014-12-23 04:51:23 UTC
I run those two scripts concurrently about 2 hours, can not reproduce it any more.

The first terminal:
[root@ibm-x3850x5-06 ~]# sh update-nwfilter.sh 
...
error: Failed to update device from nic.xml
error: operation failed: failed to add new filter rules to 'vnet0' - attempting to restore old rules

Device updated successfully

...

The second terminal:
[root@ibm-x3850x5-06 ~]# sh define_undefine_nwfilter.sh
...
Network filter clean-traffic defined from clean-traffic.xml

Network filter clean-traffic undefined

Network filter clean-traffic defined from clean-traffic.xml
...

After 2 hours, check the output of virsh command.
[root@ibm-x3850x5-06 ~]# time virsh list --all
 Id    Name                           State
----------------------------------------------------
 37    r7                             running


real	0m0.037s
user	0m0.025s
sys	0m0.010s

[root@ibm-x3850x5-06 ~]# time virsh nwfilter-list
 UUID                                  Name                 
------------------------------------------------------------------
 c09829de-5380-4608-a827-f6a10d300784  allow-arp           
 d339ae40-c114-446b-a6e4-d89e17e8d0a0  allow-dhcp          
 773b6909-d01f-4223-9b62-c75910a6e0ab  allow-dhcp-server   
 e112e697-2b01-4253-b8d7-d88273ad6419  allow-incoming-ipv4 
 b1d49b06-7fdd-4bd4-8438-ac4cc125d09d  allow-ipv4          
 f3d9b618-9097-4b37-86a7-e804066e7fbe  clean-traffic       
...


real	0m0.033s
user	0m0.028s
sys	0m0.004s


Move to Verified.

Comment 12 errata-xmlrpc 2015-03-05 07:44:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html