Bug 439923

Summary:

Avoid multi-page allocations in IP fragmentation

Product:

Red Hat Enterprise Linux 4

Reporter:

Greg Marsden <greg.marsden>

Component:

kernel

Assignee:

Neil Horman <nhorman>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Martin Jenner <mjenner>

Severity:

low

Docs Contact:

Priority:

low

Version:

4.6

CC:

davem, okir, tgraf, vgoyal

Target Milestone:

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-03-23 14:18:20 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Avoid multi-page allocations in IP fragmentation	none

Description Greg Marsden 2008-03-31 23:47:08 UTC

From: Olaf Kirch <olaf.kirch>
Subject: Avoid multi-page allocations in IP fragmentation

This patch is based on a patch by Zach Brown. The idea is to avoid
multi-order allocations in the fragment handling code, because that
will fail on heavily loaded machines where memory tends to become
rather fragmented. Original posting can be found here:

http://marc.info/?l=linux-netdev&m=114425947024500&w=2

This modified patch addresses a problem encountered on ppc - the
original patch caused skb_shared_info to become unaligned, which causes
crashes on ppc (see bug #6140918). The first iteration introduced a new
problem (described in bug #6845794).

Olaf Kirch <olaf.kirch>
---
 net/ipv4/ip_output.c |   40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

Comment 1 Greg Marsden 2008-03-31 23:47:08 UTC

Created attachment 299793 [details]
Avoid multi-page allocations in IP fragmentation

Comment 2 Neil Horman 2008-04-15 16:11:12 UTC

whats the status on this patch?    I've read the thread you reference above, and
it end after Herbert suggests some changes that are never answered.  I'm looking
in the upstream tree and I see that this patch hasn't been accepted.  Given that
the thread above was from 2006, I'm hesitant to accept it.  Can you provide me
with some sort of upstream status here?

Comment 3 Greg Marsden 2008-04-15 20:37:57 UTC

Upstream progress for this patch is stalled, but the latest version of the patch
(attached here) addresses Herbert's concerns about trailer_len.

The problem itself is fairly simple to reproduce on a system that supports large
frames.

Comment 4 David Miller 2008-04-16 11:12:44 UTC

If I am reading this patch and remembering the case correctly, it
only lets the first SKB be up to PAGE_SIZE when fragmenting.

If PAGE_SIZE is significantly less than the MTU this is going to
kill performance, because the result will be that fragments
will only be using a fraction of the MTU.

This really needs to be reinvigorated upstream if it's still an issue.
I suspect that the thing to do is to make protocols like UDP use
the sk->sk_sndmsg_page scheme like TCP, so that outgoing frames are
paged and thus not susceptible to this problem.

Actually... this is exactly what the code does.  ip_append_data()
uses the sk_sndmsg_page when the device supports scatter-gather.

So it appears the that only case that can result in multi-page allocs
is when the device does not support scatter-gather.  Nearly all devices
support that on transmit, so this issue can only effect extremely old
and primitive network devices which also support large MTUs which is a
group exponentially approaching zero.

Comment 5 Neil Horman 2008-04-16 11:36:17 UTC

Based on davem's comments, this is a NOTABUG.

Comment 6 Greg Marsden 2008-04-21 17:50:26 UTC

Dave is right, this is not an issue in mainline, because in the mainline code,
it does page-at-a-time allocations if the interface supports NETIF_SG. However,
that's not the behavior for EL4 or EL5, which is at issue here... so there's no 
upstream for this, as it's already fixed via the scatter-gather solution, but we
still have the problem with current releases.

Comment 7 Neil Horman 2008-04-21 18:59:09 UTC

Not really sure what you're talking about here, A cursory look at ip_output.c
shows identical behavior for NETIF_F_SG between RHEL5 and upstream.  So I'm not
sure where you're seeing multiple order-0 allocations vs. higher order
allocations.  I'm looking upstream  and both alloc_new_skb implementations call
though sock_alloc_send_skb, which in the end just calls alloc_skb, which happily
makes higher order allocations in both RHEL5 and upstream.

Comment 8 David Miller 2008-04-21 23:12:11 UTC

Correct, RHEL5 has the SG code just like upstream.  There is no difference.

Comment 9 Olaf Kirch 2008-04-24 06:29:42 UTC

Feel free to call me dense this morning - but to me it looks like in RHEL5,
for the first fragment we still call sock_alloc_send_skb() to alloc an skb
with essentially the MTU minus overhead. This is the code from 2.6.18-53.*:

                        datalen = length + fraggap;
                        if (datalen > mtu - fragheaderlen)
                                datalen = maxfraglen - fragheaderlen;
[...]
                        if ((flags & MSG_MORE) &&
                            !(rt->u.dst.dev->features&NETIF_F_SG))
                                alloclen = mtu;
                        else
                                alloclen = datalen + fragheaderlen;
[...]
                        if (transhdrlen) {
                                skb = sock_alloc_send_skb(sk, alloclen ...

If the MTU is a multiple of 8, then maxfraglen == mtu; and when we start
with length >= mtu we end up with alloclen = mtu. This means alloc_skb will
try to do a kmalloc(16k) on loopback, and that's what's causing trouble for
Oracle.

What am I missing?

Comment 10 Neil Horman 2008-05-27 12:16:32 UTC

That may be the case, and that seems to fit with the origional comments (but not
comment number #6).  If that is the case however, it seems that  daves arguments
in Comment #4 are applicable.  We might improve memory availability if we do
multiple allocations, but the result of that is that we send out frames that
only taek up a fraction of the interface mtu on each frame, which destroys
performance. It seems the easier thing to do in this case, if oracle is having
allocation failures under extreeme load, is to reduce the MTU of the loopback
interface to an order zero allocation.   This solves the memory allocation
problem, and performance shuld be identical to what the proposed patch provides.

Comment 11 Neil Horman 2009-03-23 14:18:20 UTC

closing due to lack of response from reporter.