Bug 439923

Summary: Avoid multi-page allocations in IP fragmentation
Product: Red Hat Enterprise Linux 4 Reporter: Greg Marsden <greg.marsden>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Martin Jenner <mjenner>
Severity: low Docs Contact:
Priority: low    
Version: 4.6CC: davem, okir, tgraf, vgoyal
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-03-23 14:18:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Avoid multi-page allocations in IP fragmentation none

Description Greg Marsden 2008-03-31 23:47:08 UTC
From: Olaf Kirch <olaf.kirch>
Subject: Avoid multi-page allocations in IP fragmentation

This patch is based on a patch by Zach Brown. The idea is to avoid
multi-order allocations in the fragment handling code, because that
will fail on heavily loaded machines where memory tends to become
rather fragmented. Original posting can be found here:

http://marc.info/?l=linux-netdev&m=114425947024500&w=2

This modified patch addresses a problem encountered on ppc - the
original patch caused skb_shared_info to become unaligned, which causes
crashes on ppc (see bug #6140918). The first iteration introduced a new
problem (described in bug #6845794).

Olaf Kirch <olaf.kirch>
---
 net/ipv4/ip_output.c |   40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

Comment 1 Greg Marsden 2008-03-31 23:47:08 UTC
Created attachment 299793 [details]
Avoid multi-page allocations in IP fragmentation

Comment 2 Neil Horman 2008-04-15 16:11:12 UTC
whats the status on this patch?    I've read the thread you reference above, and
it end after Herbert suggests some changes that are never answered.  I'm looking
in the upstream tree and I see that this patch hasn't been accepted.  Given that
the thread above was from 2006, I'm hesitant to accept it.  Can you provide me
with some sort of upstream status here?

Comment 3 Greg Marsden 2008-04-15 20:37:57 UTC
Upstream progress for this patch is stalled, but the latest version of the patch
(attached here) addresses Herbert's concerns about trailer_len.

The problem itself is fairly simple to reproduce on a system that supports large
frames.

Comment 4 David Miller 2008-04-16 11:12:44 UTC
If I am reading this patch and remembering the case correctly, it
only lets the first SKB be up to PAGE_SIZE when fragmenting.

If PAGE_SIZE is significantly less than the MTU this is going to
kill performance, because the result will be that fragments
will only be using a fraction of the MTU.

This really needs to be reinvigorated upstream if it's still an issue.
I suspect that the thing to do is to make protocols like UDP use
the sk->sk_sndmsg_page scheme like TCP, so that outgoing frames are
paged and thus not susceptible to this problem.

Actually... this is exactly what the code does.  ip_append_data()
uses the sk_sndmsg_page when the device supports scatter-gather.

So it appears the that only case that can result in multi-page allocs
is when the device does not support scatter-gather.  Nearly all devices
support that on transmit, so this issue can only effect extremely old
and primitive network devices which also support large MTUs which is a
group exponentially approaching zero.


Comment 5 Neil Horman 2008-04-16 11:36:17 UTC
Based on davem's comments, this is a NOTABUG. 

Comment 6 Greg Marsden 2008-04-21 17:50:26 UTC
Dave is right, this is not an issue in mainline, because in the mainline code,
it does page-at-a-time allocations if the interface supports NETIF_SG. However,
that's not the behavior for EL4 or EL5, which is at issue here... so there's no 
upstream for this, as it's already fixed via the scatter-gather solution, but we
still have the problem with current releases.


Comment 7 Neil Horman 2008-04-21 18:59:09 UTC
Not really sure what you're talking about here, A cursory look at ip_output.c
shows identical behavior for NETIF_F_SG between RHEL5 and upstream.  So I'm not
sure where you're seeing multiple order-0 allocations vs. higher order
allocations.  I'm looking upstream  and both alloc_new_skb implementations call
though sock_alloc_send_skb, which in the end just calls alloc_skb, which happily
makes higher order allocations in both RHEL5 and upstream.

Comment 8 David Miller 2008-04-21 23:12:11 UTC
Correct, RHEL5 has the SG code just like upstream.  There is no difference.


Comment 9 Olaf Kirch 2008-04-24 06:29:42 UTC
Feel free to call me dense this morning - but to me it looks like in RHEL5,
for the first fragment we still call sock_alloc_send_skb() to alloc an skb
with essentially the MTU minus overhead. This is the code from 2.6.18-53.*:

                        datalen = length + fraggap;
                        if (datalen > mtu - fragheaderlen)
                                datalen = maxfraglen - fragheaderlen;
[...]
                        if ((flags & MSG_MORE) &&
                            !(rt->u.dst.dev->features&NETIF_F_SG))
                                alloclen = mtu;
                        else
                                alloclen = datalen + fragheaderlen;
[...]
                        if (transhdrlen) {
                                skb = sock_alloc_send_skb(sk, alloclen ...

If the MTU is a multiple of 8, then maxfraglen == mtu; and when we start
with length >= mtu we end up with alloclen = mtu. This means alloc_skb will
try to do a kmalloc(16k) on loopback, and that's what's causing trouble for
Oracle.

What am I missing?

Comment 10 Neil Horman 2008-05-27 12:16:32 UTC
That may be the case, and that seems to fit with the origional comments (but not
comment number #6).  If that is the case however, it seems that  daves arguments
in Comment #4 are applicable.  We might improve memory availability if we do
multiple allocations, but the result of that is that we send out frames that
only taek up a fraction of the interface mtu on each frame, which destroys
performance. It seems the easier thing to do in this case, if oracle is having
allocation failures under extreeme load, is to reduce the MTU of the loopback
interface to an order zero allocation.   This solves the memory allocation
problem, and performance shuld be identical to what the proposed patch provides.

Comment 11 Neil Horman 2009-03-23 14:18:20 UTC
closing due to lack of response from reporter.