+++ This bug was initially created as a clone of Bug #584412 +++ Description of problem: During MS WHQL tests we are hitting assertion from the test in form of blue screen. The reason for the assertion is that the packets submitted by network layer are not returned (and under the hood the driver add packets to the ring, but we never get interrupt from QEMU to indicate that those packets were transmitted. At the moment of blue screen transmit ring is full). I also observed that when this happens, the qemu process is unkillable. The explanation for this is as follows: tap1 sends packets, tap2 does not consume them, as a result tap1 gets blocked forever, in particular it can not be closed. We get messages: unregister_netdevice: waiting for tap1 to become free in the log. This happens because tun/tap devices can hang on to skbs undefinitely. Version-Release number of selected component (if applicable): 2.6.18-194 How reproducible: always Steps to Reproduce: The problems is easiest to reproduce with 2 linux guests: 1. run 2 VMs on same host 2. ifdown on the one side, ping -b -s 1472 on the other, 3. you will lock out the second VM. Actual results: all traffic from second VM is blocked on host, kill -9 for pid of the second VM, process does not die. dmesg log shows: unregister_netdevice: waiting for tap1 to become free Expected results: traffic to other destinations should continue even if one destination is stuck. kill -9 on host should kill qemu and guest dmesg should be clean Additional info: yan, pls attach additional info as appropriate. --- Additional comment from mst on 2010-04-21 10:35:57 EDT --- brew build with fix http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2376934 bug is reported fixed on this build --- Additional comment from yvugenfi on 2010-04-21 10:42:13 EDT --- Brew build was tested by QE team with DTM 1.5 (the tool for running WHQL tests) on Windows 7, Windows 2008 and Windows 2008 R2. Blue screens as a result of the hanged transfer were not experienced during those tests. --- Additional comment from lwang on 2010-05-12 08:28:38 EDT --- patch posted on 4/21/10 10:46 AM EDT. move to POST --- Additional comment from jarod on 2010-05-21 16:39:41 EDT --- Committing the following to kernel build 2.6.18-200.el5: - [net] tun: orphan an skb on tx (Michael S. Tsirkin) [584412] The patch and discussion about it can be found here: http://patchwork.usersys.redhat.com/patch/24274/ --- Additional comment from jarod on 2010-05-25 17:12:36 EDT --- in kernel-2.6.18-200.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. --- Additional comment from wquan on 2010-05-27 01:06:26 EDT --- hi, Michael S. Tsirkin I try to reproduce this bug as following steps,but failed, could you help to check if there is somewhere I misunderstanding ? 1.Host: 2.6.18-194.el5 2.Host: ps -ef |grep qemu root 7681 4933 17 12:38 pts/7 00:02:05 /usr/libexec/qemu-kvm -M pc -m 2048 -smp 2 -name guest1 -no-kvm-pit-reinjection -rtc-td-hack -startdate now -drive file=/mnt/rhel5.5-32-virtio.qcow2,if=virtio,boot=on,cache=none -net nic,macaddr=00:00:12:31:4A:01,vlan=0 -net tap,scprit=/etc/ifup,vlan=0 -usb -vnc :1 -monitor stdio root 7968 5006 13 12:45 pts/8 00:00:38 /usr/libexec/qemu-kvm -M pc -m 2048 -smp 2 -name guest2 -no-kvm-pit-reinjection -rtc-td-hack -startdate now -drive file=/mnt/rhel5.5-64-virtio.qcow2,if=virtio,boot=on,cache=none -net nic,macaddr=00:00:12:31:4A:02,vlan=0,model=virtio -net tap,scprit=/etc/ifup,vlan=0 -usb -vnc :2 -monitor stdio 3.ifdown nic on the guest1 4.ping -b -s 1472 guest1_ip on the guest2 5.Host: kill -9 7968 (guest2) process die. --- Additional comment from mst on 2010-07-06 11:22:04 EDT --- *** Bug 586829 has been marked as a duplicate of this bug. *** --- Additional comment from errata-xmlrpc on 2010-07-12 10:19:57 EDT --- Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHBA-2010:9700-01 http://errata.devel.redhat.com/errata/show/9700 --- Additional comment from wquan on 2010-09-13 06:42:24 EDT --- Reproduce it with in kernel-2.6.18-194 according the steps from bug 584428#c11. Steps: 1. force arp in guest A to match guest B arp -i eth0 -s <ip for guest B> <mac for guest B> 2. ping guest B, we should get back packets e.g. with -c 1 3. ifdown guest B 4. ping guest B_ip -i 0.01 keep ping operator about 4 hours or more till finding guest A could receive packages from guest B. 5. kill -9 13498 (process of guest A,process does not die) ps -ef |grep qemu-kvm root 13498 4152 0 Sep10 pts/1 00:02:59 [qemu-kvm] <defunct> dmesg log shows: breth0: port 2(tap0) entering disabled state unregister_netdevice: waiting for tap0 to become free. Usage count = 1 unregister_netdevice: waiting for tap0 to become free. Usage count = 1 And it PASSED in kernel-2.6.18-209. Thanks~~ --- Additional comment from pm-rhel on 2010-10-15 07:13:58 EDT --- This bug has been copied as 5.5 z-stream (EUS) bug #643348 and now must be resolved in the current update release, set blocker flag. --- Additional comment from wquan on 2010-12-13 22:04:07 EST --- As comment #9 in bug #643348 and I also can reproduce this bug by checking the kernel 2.6.18-235.el5 used the steps from comment #9 . So re-assign this bug. --- Additional comment from bburns on 2010-12-21 09:47:01 EST --- Michael, this is a proposed blocker and flagged for z-stream. Is a fix soon to be posted? --- Additional comment from mst on 2010-12-21 10:07:40 EST --- A workaround at the moment is to set the tx queue length to 0 for the malicious guest or kill the malicious (non consuming) guest. --- Additional comment from jplans on 2010-12-22 10:49:30 EST --- After the RHEL discussions, we still need this resolved ASAP and seen the timeline we have ahead, we would like to propose instead 5.7.0 / 5.6.z (as 5.5.z is not approved for EUS). Thanks, Jose. --- Additional comment from pm-rhel on 2010-12-22 11:30:53 EST --- GSS has reviewed this bug and agreed that it also be should be included/released in one or more of the older still active and supported releases (asynchronous Errata Advisory,Extended Update Support stream or in an Advanced Mission Critical Long Life stream). Blocker flag was set to ? and exception and fast flags were cleared. This action ensures that this bugzilla will be included in the current release and the customer who receives this patch will not see a regression. --- Additional comment from mst on 2010-12-22 13:10:51 EST --- Brew build here http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2996598 It's a large change. Can we have virt QE check this with a variety of workloads? --- Additional comment from mst on 2010-12-23 01:42:34 EST --- I mean more like some general stability test ideally with non-virt users of tun like a VPN (assuming we have such tests). --- Additional comment from mst on 2010-12-23 02:27:16 EST --- Patch sent. Message-ID: <20101222211152.GA13148> --- Additional comment from mst on 2011-01-04 12:25:21 EST --- Created attachment 471719 [details] [RHEL5.7/5.6.z untested PATCH] tun: introduce tun_file. bz 58441 This patch was posted: Date: Wed, 22 Dec 2010 23:11:52 +0200 From: "Michael S. Tsirkin" <mst> Subject: [RHEL5.7/5.6.z PATCH] tun: introduce tun_file. bz 584412 Message-ID: <20101222211152.GA13148> --- Additional comment from mst on 2011-01-04 12:28:04 EST --- I have attached the patch to the BZ for your convenience. Please note it was not reviewed yet, and underwent only very light developer testing. --- Additional comment from dlaor on 2011-01-10 08:07:12 EST --- Any updates with review? --- Additional comment from mst on 2011-01-10 10:14:40 EST --- Got ack from Herbert. The patch is large and intrusive so review might take a while. --- Additional comment from errata-xmlrpc on 2011-01-13 05:12:06 EST --- Bug report changed to RELEASE_PENDING status by Errata System. Advisory RHSA-2011:0017-38 has been changed to PUSH_READY status. http://errata.devel.redhat.com/errata/show/9700 --- Additional comment from errata-xmlrpc on 2011-01-13 16:28:36 EST --- An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html --- Additional comment from mst on 2011-01-25 13:02:09 EST --- So this got closed but we still need to fix it in 5.6.z and 5.7. What to do?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-246.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Pass the verification with kernel kernel-2.6.18-246.el5 & kvm-83-227.el5 by using the same steps from commant #1 Steps: 1. force arp in guest A to match guest B arp -i eth0 -s <ip for guest B> <mac for guest B> 2. ping guest B, we should get back packets e.g. with -c 1 3. ifdown guest B 4. ping guest B_ip -i 0.01 keep ping operator about 4 hours or more till finding guest A could receive packages from guest B. 5. kill -9 13498 (process of guest A,process does not die) ps -ef |grep qemu-kvm root 13498 4152 0 Sep10 pts/1 00:02:59 [qemu-kvm] <defunct> result: there are no messages shows from dmesg like : breth0: port 2(tap0) entering disabled state unregister_netdevice: waiting for tap0 to become free. Usage count = 1 unregister_netdevice: waiting for tap0 to become free. Usage count = 1
According to Comment 12 and Comment 13, set the status to verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html
*** Bug 589614 has been marked as a duplicate of this bug. ***