676579 – virtio_net: missing schedule on oom

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 676579 - virtio_net: missing schedule on oom

Summary: virtio_net: missing schedule on oom

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Michael S. Tsirkin
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	684268
TreeView+	depends on / blocked

Reported:	2011-02-10 09:58 UTC by Michael S. Tsirkin
Modified:	2013-01-11 03:48 UTC (History)
CC List:	7 users (show)
Fixed In Version:	kernel-2.6.32-117.el6
Doc Type:	Bug Fix
Doc Text:	Intensive usage of resources on a guest lead to a failure of networking on that guest: packets could no longer be received. The failure occurred when a DMA (Direct Memory Access) ring was consumed before NAPI (New API; an interface for networking devices which makes use of interrupt mitigation techniques) was enabled which resulted in a failure to receive the next interrupt request. The regular interrupt handler was not affected in this situation (because it can process packets in-place), however, the OOM (Out Of Memory) handler did not detect the aforementioned situation and caused networking to fail. With this update, NAPI is subsequently scheduled for each napi_enable operation; thus, networking no longer fails under the aforementioned circumstances.
Clone Of:
Environment:
Last Closed:	2011-05-19 12:49:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0542	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update	2011-05-19 11:58:07 UTC

Description Michael S. Tsirkin 2011-02-10 09:58:50 UTC

Description of problem:

The following was reported upstream:
http://www.spinics.net/lists/linux-virtualization/msg12361.html

Under harsh testing conditions, including low memory, the guest would
stop receiving packets. With this patch applied we no longer see any
problems in the driver while performing these tests for extended periods
of time.

The bug is that if ring is consumed before napi is enabled,
we don't get another interrupt. Regular interrupt
handler fixes this by processing packets in-place,
but oom handler missed this check.

Version-Release number of selected component (if applicable):

How reproducible:
not sure. this was reported upstream and looking at code
makes it clear it applies to rhel 6.0.

Steps to Reproduce:
1. stress memory so atomic allocations start failing
   (how to do this? not sure)
2. at the same time stress with large incoming packets

  
Actual results:
at some point networking will stop and wont recover
when 

Expected results:
keeps going slowly

Additional info:
stress with nfs reads might trigger this?

Comment 1 RHEL Program Management 2011-02-10 10:10:36 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 2 Keqin Hong 2011-02-12 10:34:14 UTC

This bug can be reproduced through following scenario:

+---------------+         +------------------------------+
|  Netserver    |   LAN   |                              |
|---------------|---------|Netperf(*2000+ tasks)         |
|     VM (512M) |         |                              |
+---------------+         +------------------------------+

Run thousands of netperf clients in background to stress the netserver.

kernel-2.6.32-113.el6.x86_64

Comment 4 Aristeu Rozanski 2011-02-18 22:09:01 UTC

Patch(es) available on kernel-2.6.32-117.el6

Comment 7 Keqin Hong 2011-03-02 11:20:58 UTC

Reproduced on kernel-2.6.32-116.el6.x86_64, and verified on kernel-2.6.32-117.el6.x86_64. PASS.

Steps:
1) boot guest with 512M mem and virtio net.
2) run netserver inside guest.
3) on host, launch 2000 netperf clients in background to stress netserver.
4) ping guest (network lost, need to restart guest network to restore)

Update guest kernel to 2.6.32-117 and test again, no network lost.

CLI:

/usr/libexec/qemu-kvm -S -M rhel6.1.0 -enable-kvm -m 512 -smp 2,sockets=2,cores=1,threads=1 -name RHEL6.1-virtio_net_test -uuid 362f0255-b6e4-2a75-9506-af9c2e5ceb5d -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/RHEL6.1-virtio_net_test.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/home/khong/RHEL6.1-virtio_net_test.img,if=none,id=drive-virtio-disk0,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,fd=20,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1f:f2:62,bus=pci.0,addr=0x3 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device AC97,id=sound0,bus=pci.0,addr=0x4 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

Comment 8 Keqin Hong 2011-03-02 11:23:33 UTC

script to run netperf clients:

#! /bin/sh
ip=$guest_ip
i=0
while [ $i -lt 2000 ]
do
netperf -H $ip -l 300 &
i=`expr $i + 1`
echo launch Client-No.$i 
done

Comment 10 Martin Prpič 2011-04-12 12:49:35 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Intensive usage of resources on a guest lead to a failure of networking on that guest: packets could no longer be received. The failure occurred when a DMA (Direct Memory Access) ring was consumed before NAPI (New API; an interface for networking devices which makes use of interrupt mitigation techniques) was enabled which resulted in a failure to receive the next interrupt request. The regular interrupt handler was not affected in this situation (because it can process packets in-place), however, the OOM (Out Of Memory) handler did not detect the aforementioned situation and caused networking to fail. With this update, NAPI is subsequently scheduled for each napi_enable operation; thus, networking no longer fails under the aforementioned circumstances.

Comment 11 errata-xmlrpc 2011-05-19 12:49:11 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html

Note You need to log in before you can comment on or make changes to this bug.