Bug 495863

Summary: kernel: tun: Add packet accounting
Product: Red Hat Enterprise Linux 5 Reporter: Herbert Xu <herbert.xu>
Component: kernelAssignee: Herbert Xu <herbert.xu>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.4CC: dzickus, eteo, markmc, mjw, security-response-team, syeghiay
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:57:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 508861    
Attachments:
Description Flags
tun: Limit amount of queued packets per device none

Description Herbert Xu 2009-04-15 08:16:27 UTC
We need to add packet accounting to the tun driver so that virtio-net gets congestion feedback which is necessary to prevent packet loss for protocols lacking congestion conctrol (such as UDP) when used in a guest.

Comment 1 Herbert Xu 2009-04-15 08:18:26 UTC
Created attachment 339645 [details]
tun: Limit amount of queued packets per device

This is a backport of

commit 33dccbb050bbe35b88ca8cf1228dcf3e4d4b3554
Author: Herbert Xu <herbert.org.au>
Date:   Thu Feb 5 21:25:32 2009 -0800

    tun: Limit amount of queued packets per device
    
    Unlike a normal socket path, the tuntap device send path does
    not have any accounting.  This means that the user-space sender
    may be able to pin down arbitrary amounts of kernel memory by
    continuing to send data to an end-point that is congested.
    
    Even when this isn't an issue because of limited queueing at
    most end points, this can also be a problem because its only
    response to congestion is packet loss.  That is, when those
    local queues at the end-point fills up, the tuntap device will
    start wasting system time because it will continue to send
    data there which simply gets dropped straight away.
    
    Of course one could argue that everybody should do congestion
    control end-to-end, unfortunately there are people in this world
    still hooked on UDP, and they don't appear to be going away
    anywhere fast.  In fact, we've always helped them by performing
    accounting in our UDP code, the sole purpose of which is to
    provide congestion feedback other than through packet loss.
    
    This patch attempts to apply the same bandaid to the tuntap device.
    It creates a pseudo-socket object which is used to account our
    packets just as a normal socket does for UDP.  Of course things
    are a little complex because we're actually reinjecting traffic
    back into the stack rather than out of the stack.
    
    The stack complexities however should have been resolved by preceding
    patches.  So this one can simply start using skb_set_owner_w.
    
    For now the accounting is essentially disabled by default for
    backwards compatibility.  In particular, we set the cap to INT_MAX.
    This is so that existing applications don't get confused by the
    sudden arrival EAGAIN errors.
    
    In future we may wish (or be forced to) do this by default.
    
    Signed-off-by: Herbert Xu <herbert.org.au>
    Signed-off-by: David S. Miller <davem>

commit 4cc7f68d65558f683c702d4fe3a5aac4c5227b97
Author: Herbert Xu <herbert.org.au>
Date:   Wed Feb 4 16:55:54 2009 -0800

    net: Reexport sock_alloc_send_pskb
    
    The function sock_alloc_send_pskb is completely useless if not
    exported since most of the code in it won't be used as is.  In
    fact, this code has already been duplicated in the tun driver.
    
    Now that we need accounting in the tun driver, we can in fact
    use this function as is.  So this patch marks it for export again.
    
    Signed-off-by: Herbert Xu <herbert.org.au>
    Signed-off-by: David S. Miller <davem>

commit 9a279bcbe347496799711155ed41a89bc40f79c5
Author: Herbert Xu <herbert.org.au>
Date:   Wed Feb 4 16:55:27 2009 -0800

    net: Partially allow skb destructors to be used on receive path
    
    As it currently stands, skb destructors are forbidden on the
    receive path because the protocol end-points will overwrite
    any existing destructor with their own.
    
    This is the reason why we have to call skb_orphan in the loopback
    driver before we reinject the packet back into the stack, thus
    creating a period during which loopback traffic isn't charged
    to any socket.
    
    With virtualisation, we have a similar problem in that traffic
    is reinjected into the stack without being associated with any
    socket entity, thus providing no natural congestion push-back
    for those poor folks still stuck with UDP.
    
    Now had we been consistent in telling them that UDP simply has
    no congestion feedback, I could just fob them off.  Unfortunately,
    we appear to have gone to some length in catering for this on
    the standard UDP path, with skb/socket accounting so that has
    created a very unhealthy dependency.
    
    Alas habits are difficult to break out of, so we may just have
    to allow skb destructors on the receive path.
    
    It turns out that making skb destructors useable on the receive path
    isn't as easy as it seems.  For instance, simply adding skb_orphan
    to skb_set_owner_r isn't enough.  This is because we assume all
    over the IP stack that skb->sk is an IP socket if present.
    
    The new transparent proxy code goes one step further and assumes
    that skb->sk is the receiving socket if present.
    
    Now all of this can be dealt with by adding simple checks such
    as only treating skb->sk as an IP socket if skb->sk->sk_family
    matches.  However, it turns out that for bridging at least we
    don't need to do all of this work.
    
    This is of interest because most virtualisation setups use bridging
    so we don't actually go through the IP stack on the host (with
    the exception of our old nemesis the bridge netfilter, but that's
    easily taken care of).
    
    So this patch simply adds skb_orphan to the point just before we
    enter the IP stack, but after we've gone through the bridge on the
    receive path.  It also adds an skb_orphan to the one place in
    netfilter that touches skb->sk/skb->destructor, that is, tproxy.
    
    One word of caution, because of the internal code structure, anyone
    wishing to deploy this must use skb_set_owner_w as opposed to
    skb_set_owner_r since many functions that create a new skb from
    an existing one will invoke skb_set_owner_w on the new skb.
    
    Signed-off-by: Herbert Xu <herbert.org.au>
    Signed-off-by: David S. Miller <davem>

Comment 5 RHEL Program Management 2009-05-11 20:59:13 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Don Zickus 2009-05-14 19:35:30 UTC
in kernel-2.6.18-148.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 8 Don Zickus 2009-06-03 20:17:13 UTC
There was a problem with this patch and it is being reverted.  The next time it goes to MODIFIED, the patch will have been reverted.

Comment 9 Don Zickus 2009-06-04 16:07:22 UTC
in kernel-2.6.18-152.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 11 Don Zickus 2009-06-05 18:51:59 UTC
Moving to POST to re-apply with additional fixes.

Comment 12 Don Zickus 2009-06-11 15:36:59 UTC
in kernel-2.6.18-153.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 17 errata-xmlrpc 2009-09-02 08:57:45 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html