Bug 584412

Summary: transmission stops when tap does not consume
Product: Red Hat Enterprise Linux 5 Reporter: Michael S. Tsirkin <mst>
Component: kernelAssignee: Michael S. Tsirkin <mst>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: high    
Version: 5.5.zCC: cww, dhoward, jpirko, jplans, mjenner, mlessard, mwagner, tburke, wquan, yvugenfi
Target Milestone: rcKeywords: ZStream
Target Release: 5.3.z   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 584428 665293 672619 (view as bug list) Environment:
Last Closed: 2011-01-13 21:28:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580949, 584428, 591842, 643348, 665293, 665295, 666367, 672619    

Description Michael S. Tsirkin 2010-04-21 14:28:50 UTC
Description of problem:

During MS WHQL tests we are hitting assertion from the test in form of blue
screen. The reason for the assertion is that the packets submitted by network
layer are not returned (and under the hood the driver add packets to the ring,
but we never get interrupt from QEMU to indicate that those packets were
transmitted. At the moment of blue screen transmit ring is full).


I also observed that when this happens, the qemu process
is unkillable.

The explanation for this is as follows:
tap1 sends packets, tap2 does not consume them, as a result
tap1 gets blocked forever, in particular it can not be closed.
We get messages:
unregister_netdevice: waiting for tap1 to become free
in the log.
This happens because tun/tap devices can hang on to skbs undefinitely.



Version-Release number of selected component (if applicable):
2.6.18-194

How reproducible:
always

Steps to Reproduce:
The problems is easiest to reproduce with 2 linux
guests:

1. run 2 VMs on same host
2. ifdown on the one side, ping -b -s 1472 on the other, 
3. you will lock out the second VM.

  
Actual results:

all traffic from second VM is blocked
on host, kill -9 for pid of the second VM,
   process does not die. 
dmesg log shows:
  unregister_netdevice: waiting for tap1 to become free

Expected results:

traffic to other destinations should continue even if one
destination is stuck.
kill -9 on host should kill qemu and guest

dmesg should be clean

Additional info:
yan, pls attach additional info as appropriate.

Comment 1 Michael S. Tsirkin 2010-04-21 14:35:57 UTC
brew  build with fix
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2376934
bug is reported fixed on this build

Comment 2 Yvugenfi@redhat.com 2010-04-21 14:42:13 UTC
Brew build was tested by QE team with DTM 1.5 (the tool for running WHQL tests) on Windows 7, Windows 2008 and Windows 2008 R2. 

Blue screens as a result of the hanged transfer were not experienced during those tests.

Comment 5 Jarod Wilson 2010-05-25 21:12:36 UTC
in kernel-2.6.18-200.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 6 Quan Wenli 2010-05-27 05:06:26 UTC
hi, Michael S. Tsirkin

I try to reproduce this bug as following steps,but failed, could you help to check if there is somewhere I misunderstanding ?

1.Host: 2.6.18-194.el5
2.Host:
ps -ef |grep qemu
root      7681  4933 17 12:38 pts/7    00:02:05 /usr/libexec/qemu-kvm -M pc -m 2048 -smp 2 -name guest1 -no-kvm-pit-reinjection -rtc-td-hack -startdate now -drive file=/mnt/rhel5.5-32-virtio.qcow2,if=virtio,boot=on,cache=none -net nic,macaddr=00:00:12:31:4A:01,vlan=0 -net tap,scprit=/etc/ifup,vlan=0 -usb -vnc :1 -monitor stdio
root      7968  5006 13 12:45 pts/8    00:00:38 /usr/libexec/qemu-kvm -M pc -m 2048 -smp 2 -name guest2 -no-kvm-pit-reinjection -rtc-td-hack -startdate now -drive file=/mnt/rhel5.5-64-virtio.qcow2,if=virtio,boot=on,cache=none -net nic,macaddr=00:00:12:31:4A:02,vlan=0,model=virtio -net tap,scprit=/etc/ifup,vlan=0 -usb -vnc :2 -monitor stdio
3.ifdown nic on the guest1
4.ping  -b -s 1472 guest1_ip on the guest2
5.Host: kill -9 7968 (guest2) process die.

Comment 7 Michael S. Tsirkin 2010-07-06 15:22:04 UTC
*** Bug 586829 has been marked as a duplicate of this bug. ***

Comment 9 Quan Wenli 2010-09-13 10:42:24 UTC
Reproduce it with in kernel-2.6.18-194 according the steps from bug 584428#c11.

Steps:

1. force arp in guest A to match guest B
arp -i eth0 -s <ip for guest B> <mac for guest B>
2. ping guest B, we should get back packets
e.g. with -c 1
3. ifdown guest B
4. ping guest B_ip -i 0.01 
keep ping operator about 4 hours or more till finding guest A could receive packages from guest B.
5. kill -9 13498 (process of guest A,process does not die)
ps -ef |grep qemu-kvm
root     13498  4152  0 Sep10 pts/1    00:02:59 [qemu-kvm] <defunct>


dmesg log shows:
breth0: port 2(tap0) entering disabled state
unregister_netdevice: waiting for tap0 to become free. Usage count = 1
unregister_netdevice: waiting for tap0 to become free. Usage count = 1

And it PASSED in kernel-2.6.18-209.
Thanks~~

Comment 24 errata-xmlrpc 2011-01-13 21:28:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html