Bug 672619

Summary:	transmission stops when tap does not consume
Product:	Red Hat Enterprise Linux 5	Reporter:	Jarod Wilson <jarod>
Component:	kernel	Assignee:	Michael S. Tsirkin <mst>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	high
Version:	5.7	CC:	cww, dhoward, dtian, hateya, hjia, jplans, mjenner, mlessard, mst, mwagner, ndai, qcai, qzhang, tburke, wquan, yvugenfi
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	584412	Environment:
Last Closed:	2011-07-21 10:12:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	584412
Bug Blocks:	580949, 584428, 591842, 643348, 665293, 665295, 666367

Description Jarod Wilson 2011-01-25 18:15:04 UTC

+++ This bug was initially created as a clone of Bug #584412 +++

Description of problem:

During MS WHQL tests we are hitting assertion from the test in form of blue
screen. The reason for the assertion is that the packets submitted by network
layer are not returned (and under the hood the driver add packets to the ring,
but we never get interrupt from QEMU to indicate that those packets were
transmitted. At the moment of blue screen transmit ring is full).


I also observed that when this happens, the qemu process
is unkillable.

The explanation for this is as follows:
tap1 sends packets, tap2 does not consume them, as a result
tap1 gets blocked forever, in particular it can not be closed.
We get messages:
unregister_netdevice: waiting for tap1 to become free
in the log.
This happens because tun/tap devices can hang on to skbs undefinitely.



Version-Release number of selected component (if applicable):
2.6.18-194

How reproducible:
always

Steps to Reproduce:
The problems is easiest to reproduce with 2 linux
guests:

1. run 2 VMs on same host
2. ifdown on the one side, ping -b -s 1472 on the other, 
3. you will lock out the second VM.

  
Actual results:

all traffic from second VM is blocked
on host, kill -9 for pid of the second VM,
   process does not die. 
dmesg log shows:
  unregister_netdevice: waiting for tap1 to become free

Expected results:

traffic to other destinations should continue even if one
destination is stuck.
kill -9 on host should kill qemu and guest

dmesg should be clean

Additional info:
yan, pls attach additional info as appropriate.

--- Additional comment from mst on 2010-04-21 10:35:57 EDT ---

brew  build with fix
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2376934
bug is reported fixed on this build

--- Additional comment from yvugenfi on 2010-04-21 10:42:13 EDT ---

Brew build was tested by QE team with DTM 1.5 (the tool for running WHQL tests) on Windows 7, Windows 2008 and Windows 2008 R2. 

Blue screens as a result of the hanged transfer were not experienced during those tests.

--- Additional comment from lwang on 2010-05-12 08:28:38 EDT ---

patch posted on 4/21/10 10:46 AM EDT. move to POST

--- Additional comment from jarod on 2010-05-21 16:39:41 EDT ---

Committing the following to kernel build 2.6.18-200.el5:
	- [net] tun: orphan an skb on tx (Michael S. Tsirkin) [584412]
The patch and discussion about it can be found here:
	http://patchwork.usersys.redhat.com/patch/24274/

--- Additional comment from jarod on 2010-05-25 17:12:36 EDT ---

in kernel-2.6.18-200.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

--- Additional comment from wquan on 2010-05-27 01:06:26 EDT ---

hi, Michael S. Tsirkin

I try to reproduce this bug as following steps,but failed, could you help to check if there is somewhere I misunderstanding ?

1.Host: 2.6.18-194.el5
2.Host:
ps -ef |grep qemu
root      7681  4933 17 12:38 pts/7    00:02:05 /usr/libexec/qemu-kvm -M pc -m 2048 -smp 2 -name guest1 -no-kvm-pit-reinjection -rtc-td-hack -startdate now -drive file=/mnt/rhel5.5-32-virtio.qcow2,if=virtio,boot=on,cache=none -net nic,macaddr=00:00:12:31:4A:01,vlan=0 -net tap,scprit=/etc/ifup,vlan=0 -usb -vnc :1 -monitor stdio
root      7968  5006 13 12:45 pts/8    00:00:38 /usr/libexec/qemu-kvm -M pc -m 2048 -smp 2 -name guest2 -no-kvm-pit-reinjection -rtc-td-hack -startdate now -drive file=/mnt/rhel5.5-64-virtio.qcow2,if=virtio,boot=on,cache=none -net nic,macaddr=00:00:12:31:4A:02,vlan=0,model=virtio -net tap,scprit=/etc/ifup,vlan=0 -usb -vnc :2 -monitor stdio
3.ifdown nic on the guest1
4.ping  -b -s 1472 guest1_ip on the guest2
5.Host: kill -9 7968 (guest2) process die.

--- Additional comment from mst on 2010-07-06 11:22:04 EDT ---

*** Bug 586829 has been marked as a duplicate of this bug. ***

--- Additional comment from errata-xmlrpc on 2010-07-12 10:19:57 EDT ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2010:9700-01
http://errata.devel.redhat.com/errata/show/9700

--- Additional comment from wquan on 2010-09-13 06:42:24 EDT ---

Reproduce it with in kernel-2.6.18-194 according the steps from bug 584428#c11.

Steps:

1. force arp in guest A to match guest B
arp -i eth0 -s <ip for guest B> <mac for guest B>
2. ping guest B, we should get back packets
e.g. with -c 1
3. ifdown guest B
4. ping guest B_ip -i 0.01 
keep ping operator about 4 hours or more till finding guest A could receive packages from guest B.
5. kill -9 13498 (process of guest A,process does not die)
ps -ef |grep qemu-kvm
root     13498  4152  0 Sep10 pts/1    00:02:59 [qemu-kvm] <defunct>


dmesg log shows:
breth0: port 2(tap0) entering disabled state
unregister_netdevice: waiting for tap0 to become free. Usage count = 1
unregister_netdevice: waiting for tap0 to become free. Usage count = 1

And it PASSED in kernel-2.6.18-209.
Thanks~~

--- Additional comment from pm-rhel on 2010-10-15 07:13:58 EDT ---

This bug has been copied as 5.5 z-stream (EUS) bug #643348 and now must be
resolved in the current update release, set blocker flag.

--- Additional comment from wquan on 2010-12-13 22:04:07 EST ---

As comment #9 in bug #643348 and I also can reproduce this bug by checking the kernel 2.6.18-235.el5 used the steps from comment #9 . So re-assign this bug.

--- Additional comment from bburns on 2010-12-21 09:47:01 EST ---

Michael, this is a proposed blocker and flagged for z-stream. Is a fix soon to be posted?

--- Additional comment from mst on 2010-12-21 10:07:40 EST ---

A workaround at the moment is to set the
tx queue length to 0 for the malicious guest or
kill the malicious (non consuming) guest.

--- Additional comment from jplans on 2010-12-22 10:49:30 EST ---

After the RHEL discussions, we still need this resolved ASAP and seen the timeline we have ahead, we would like to propose instead 5.7.0 / 5.6.z (as 5.5.z is not approved for EUS). Thanks, Jose.

--- Additional comment from pm-rhel on 2010-12-22 11:30:53 EST ---

GSS has reviewed this bug and agreed that it also be should be
included/released in one or more of the older still active and supported
releases (asynchronous Errata Advisory,Extended Update Support stream or
in an Advanced Mission Critical Long Life stream).

Blocker flag was set to ? and exception and fast flags were cleared.
This action ensures that this bugzilla will be included in the current
release and the customer who receives this patch will not see a
regression.

--- Additional comment from mst on 2010-12-22 13:10:51 EST ---

Brew build here
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2996598
It's a large change.
Can we have virt QE check this with a variety of workloads?

--- Additional comment from mst on 2010-12-23 01:42:34 EST ---

I mean more like some general stability test ideally
with non-virt users of tun like a VPN
(assuming we have such tests).

--- Additional comment from mst on 2010-12-23 02:27:16 EST ---

Patch sent.

Message-ID: <20101222211152.GA13148>

--- Additional comment from mst on 2011-01-04 12:25:21 EST ---

Created attachment 471719 [details]
[RHEL5.7/5.6.z untested PATCH] tun: introduce tun_file. bz 58441

This patch was posted:

Date: Wed, 22 Dec 2010 23:11:52 +0200                                                                             
From: "Michael S. Tsirkin" <mst>                                                                    
                              
Subject: [RHEL5.7/5.6.z PATCH] tun: introduce tun_file. bz 584412                                                 
Message-ID: <20101222211152.GA13148>

--- Additional comment from mst on 2011-01-04 12:28:04 EST ---

I have attached the patch to the BZ for your convenience.
Please note it was not reviewed yet, and underwent only very
light developer testing.

--- Additional comment from dlaor on 2011-01-10 08:07:12 EST ---

Any updates with review?

--- Additional comment from mst on 2011-01-10 10:14:40 EST ---

Got ack from Herbert. The patch is large and intrusive
so review might take a while.

--- Additional comment from errata-xmlrpc on 2011-01-13 05:12:06 EST ---

Bug report changed to RELEASE_PENDING status by Errata System.
Advisory RHSA-2011:0017-38 has been changed to PUSH_READY status.
http://errata.devel.redhat.com/errata/show/9700

--- Additional comment from errata-xmlrpc on 2011-01-13 16:28:36 EST ---

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

--- Additional comment from mst on 2011-01-25 13:02:09 EST ---

So this got closed but we still need to fix it in 5.6.z and 5.7.
What to do?

Comment 3 RHEL Program Management 2011-02-01 16:56:34 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Jarod Wilson 2011-03-03 20:34:28 UTC

in kernel-2.6.18-246.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 12 Quan Wenli 2011-03-10 10:23:22 UTC

Pass the verification  with kernel kernel-2.6.18-246.el5 & kvm-83-227.el5 by using the same steps from commant #1 

Steps:

1. force arp in guest A to match guest B
arp -i eth0 -s <ip for guest B> <mac for guest B>
2. ping guest B, we should get back packets
e.g. with -c 1
3. ifdown guest B
4. ping guest B_ip -i 0.01 
keep ping operator about 4 hours or more till finding guest A could receive
packages from guest B.
5. kill -9 13498 (process of guest A,process does not die)
ps -ef |grep qemu-kvm
root     13498  4152  0 Sep10 pts/1    00:02:59 [qemu-kvm] <defunct>

result:

there are no messages shows from dmesg like :

breth0: port 2(tap0) entering disabled state
unregister_netdevice: waiting for tap0 to become free. Usage count = 1
unregister_netdevice: waiting for tap0 to become free. Usage count = 1

Comment 14 Qunfang Zhang 2011-06-13 02:16:35 UTC

According to Comment 12 and Comment 13, set the status to verified.

Comment 15 errata-xmlrpc 2011-07-21 10:12:19 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html

Comment 16 Jiri Pirko 2012-11-21 10:21:06 UTC

*** Bug 589614 has been marked as a duplicate of this bug. ***