Bug 757645 - call trace " page allocation failure " when scp a lot of files
Summary: call trace " page allocation failure " when scp a lot of files
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.2
Hardware: Unspecified
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Larry Woodman
QA Contact: Li Wang
URL:
Whiteboard:
Depends On:
Blocks: 1159933 1172231 1269194 1270638 1359574
TreeView+ depends on / blocked
 
Reported: 2011-11-28 08:15 UTC by Xiaoqing Wei
Modified: 2018-12-09 16:46 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-20 19:48:45 UTC
Target Upstream Version:
lwoodman: needinfo+


Attachments (Terms of Use)
sos report (1.40 MB, application/x-xz)
2011-11-28 09:08 UTC, Xiaoqing Wei
no flags Details
sos report (11.63 MB, application/zip)
2014-06-18 09:06 UTC, David Busby
no flags Details

Description Xiaoqing Wei 2011-11-28 08:15:07 UTC
Description of problem:

call trace " page allocation failure " when scp a lot of files

Version-Release number of selected component (if applicable):

kernel-2.6.32-220.el6.x86_64
How reproducible:
Only met it once

Steps to Reproduce:
1. scp a lot of files from one host to another (both rhel62_64)
2. call trace happens on src machine.
3.
  
Actual results:
dmesg can see call traces

Expected results:
scp finish and not call trace / warnning  

Additional info:
1) SRC machine has a tg3 nic.
2) DEST machine does not call trace.
3) NOTE: sosreport attached.

4) glance of dmesg:

ssh: page allocation failure. order:3, mode:0x20
Pid: 4960, comm: ssh Tainted: P           ----------------   2.6.32-220.el6.x86_64 #1
Call Trace:
 [<ffffffff81123f0f>] ? __alloc_pages_nodemask+0x77f/0x940
 [<ffffffff8115ddc2>] ? kmem_getpages+0x62/0x170
 [<ffffffff8115e9da>] ? fallback_alloc+0x1ba/0x270
 [<ffffffff8115e42f>] ? cache_grow+0x2cf/0x320
 [<ffffffff8115e759>] ? ____cache_alloc_node+0x99/0x160
 [<ffffffff814218da>] ? __alloc_skb+0x7a/0x180
 [<ffffffff8115f61f>] ? kmem_cache_alloc_node_notrace+0x6f/0x130
 [<ffffffff8115f85b>] ? __kmalloc_node+0x7b/0x100
 [<ffffffff814218da>] ? __alloc_skb+0x7a/0x180
 [<ffffffff814229e6>] ? skb_copy+0x36/0xa0
 [<ffffffffa0468d24>] ? tg3_start_xmit+0xcd4/0x1030 [tg3]
 [<ffffffff8142c7dc>] ? dev_hard_start_xmit+0x2bc/0x3f0
 [<ffffffff81449b8a>] ? sch_direct_xmit+0x15a/0x1c0

Comment 1 Xiaoqing Wei 2011-11-28 08:18:43 UTC
Since the host network still working, setting this bug to low / low

Comment 3 Xiaoqing Wei 2011-11-28 09:08:15 UTC
Created attachment 537404 [details]
sos report

Comment 5 RHEL Program Management 2012-05-03 04:39:10 UTC
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 6 Jes Sorensen 2013-02-16 14:28:30 UTC
System was out of memory and it tried to allocate more - not really a bug

Comment 7 David Busby 2014-06-18 09:06:09 UTC
Created attachment 909897 [details]
sos report

Same bug encountered when placing a system under heavy network load.

Typically this issue presents itself when backup processes stream backups over an NFS share.

I have noted in the past the tg3 driver has issues with TSO, if previoulsy would drop the connection entirely: https://plus.google.com/+DavidBusby/posts/A9SCpUNADSk

Seems however this is no longer the case and leader to the error reports from the kernel


e.g.

Mar  4 01:50:05 db5 kernel: sshd: page allocation failure. order:4, mode:0x20
Mar  4 01:50:05 db5 kernel: Pid: 29073, comm: sshd Tainted: P           ---------------    2.6.32-358.el6.x86_64 #1
Mar  4 01:50:05 db5 kernel: Call Trace:
Mar  4 01:50:05 db5 kernel: [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
Mar  4 01:50:05 db5 kernel: [<ffffffff811669d2>] ? kmem_getpages+0x62/0x170
Mar  4 01:50:05 db5 kernel: [<ffffffff811675ea>] ? fallback_alloc+0x1ba/0x270
Mar  4 01:50:05 db5 kernel: [<ffffffff8116703f>] ? cache_grow+0x2cf/0x320
Mar  4 01:50:05 db5 kernel: [<ffffffff81167369>] ? ____cache_alloc_node+0x99/0x160
Mar  4 01:50:05 db5 kernel: [<ffffffff81168530>] ? kmem_cache_alloc_node_trace+0x90/0x200
Mar  4 01:50:05 db5 kernel: [<ffffffff8116874d>] ? __kmalloc_node+0x4d/0x60
Mar  4 01:50:05 db5 kernel: [<ffffffff8143d6ad>] ? __alloc_skb+0x6d/0x190
Mar  4 01:50:05 db5 kernel: [<ffffffff8143e7c6>] ? skb_copy+0x36/0xa0
Mar  4 01:50:05 db5 kernel: [<ffffffffa01f06cc>] ? tg3_start_xmit+0xa8c/0xd50 [tg3]
Mar  4 01:50:05 db5 kernel: [<ffffffff81448ca8>] ? dev_hard_start_xmit+0x308/0x530
Mar  4 01:50:05 db5 kernel: [<ffffffff81466fca>] ? sch_direct_xmit+0x15a/0x1c0
Mar  4 01:50:05 db5 kernel: [<ffffffff8144c9b0>] ? dev_queue_xmit+0x3b0/0x550
Mar  4 01:50:05 db5 kernel: [<ffffffffa0506ed7>] ? bond_dev_queue_xmit+0x67/0x200 [bonding]
Mar  4 01:50:05 db5 kernel: [<ffffffffa05075ab>] ? bond_start_xmit+0x53b/0x5d0 [bonding]
Mar  4 01:50:05 db5 kernel: [<ffffffff81448ca8>] ? dev_hard_start_xmit+0x308/0x530
Mar  4 01:50:05 db5 kernel: [<ffffffff81474609>] ? nf_iterate+0x69/0xb0
Mar  4 01:50:05 db5 kernel: [<ffffffff8144c805>] ? dev_queue_xmit+0x205/0x550
Mar  4 01:50:05 db5 kernel: [<ffffffff81484f40>] ? ip_finish_output+0x0/0x310
Mar  4 01:50:05 db5 kernel: [<ffffffff8148507c>] ? ip_finish_output+0x13c/0x310
Mar  4 01:50:05 db5 kernel: [<ffffffff81485308>] ? ip_output+0xb8/0xc0
Mar  4 01:50:05 db5 kernel: [<ffffffff814845cf>] ? __ip_local_out+0x9f/0xb0
Mar  4 01:50:05 db5 kernel: [<ffffffff81484605>] ? ip_local_out+0x25/0x30
Mar  4 01:50:05 db5 kernel: [<ffffffff81484ae0>] ? ip_queue_xmit+0x190/0x420
Mar  4 01:50:05 db5 kernel: [<ffffffff814997ce>] ? tcp_transmit_skb+0x3fe/0x7b0
Mar  4 01:50:05 db5 kernel: [<ffffffff8149bb8b>] ? tcp_write_xmit+0x1fb/0xa20
Mar  4 01:50:05 db5 kernel: [<ffffffff8149c3e0>] ? tcp_push_one+0x30/0x40
Mar  4 01:50:05 db5 kernel: [<ffffffff8148d13c>] ? tcp_sendmsg+0x9cc/0xa20
Mar  4 01:50:05 db5 kernel: [<ffffffff81437b9b>] ? sock_aio_write+0x19b/0x1c0
Mar  4 01:50:05 db5 kernel: [<ffffffff81180c9a>] ? do_sync_write+0xfa/0x140
Mar  4 01:50:05 db5 kernel: [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40
Mar  4 01:50:05 db5 kernel: [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
Mar  4 01:50:05 db5 kernel: [<ffffffff81181064>] ? vfs_write+0x184/0x1a0
Mar  4 01:50:05 db5 kernel: [<ffffffff81181891>] ? sys_write+0x51/0x90
Mar  4 01:50:05 db5 kernel: [<ffffffff810dc565>] ? __audit_syscall_exit+0x265/0x290
Mar  4 01:50:05 db5 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Comment 8 John 2014-06-26 00:41:28 UTC
This is an absolute joke.

Servers and workstations are loaded with these garbage broadcom tg3 NICs, and now you can't even use a 1Gbe NIC in RHEL without disabling TSO and other nonsense?!

This bug is still present in RHEL6.5, 2.6.32-431.5.1.el6.x86_64

We see this on machines with 24 GB RAM or more, UNLOADED besides a single rsync over the 1Gbe broadcom NIC.

HOW THE HELL CAN MEMORY MANAGEMENT IN LINUX BE SO BAD THAT IT CANNOT ALLOCATE MEMORY FOR A 1Gbe NIC?

THESE DRIVERS HAVE BEEN AROUND FOR YEARS, AND THEY'RE STILL NOT WORKING?

WHAT HOPE IS THERE THAT RHEL IS GOING TO WORK WHEN I STICK 10GBe NICS in these machines?

Useless. Utterly useless.

Comment 9 John 2014-06-26 00:57:49 UTC
So, let me make this very clear:

When this bug happens, the network DOES sometimes show strange problems. 

Some clients seem unable to regain a working connection to a server that has hit this issue, although other clients can be unaffected.

IT IS A BUG.
And it occurs on machines with craploads of memory, under absolutely ZERO memory pressure, apart from use of memory for filesystem cache.

I'll say it again. Linux memory management is an abyssmal joke. 

You give a machine a bunch of memory, and linux uses it for fileystem cache, and then it's so useless it can't free that memory fast enough when something actually wants memory.

Just pathetic.

Comment 10 Larry Woodman 2014-11-19 19:55:38 UTC
The problem here is clearly the tg3 nic driver is doing a 64KB atomic allocation and not properly falling back to a smaller size when it fails:

Mar  4 01:50:05 db5 kernel: sshd: page allocation failure. order:4, mode:0x20
...
#define GFP_ATOMIC      (__GFP_HIGH)
#define __GFP_HIGH             0x20

The kernel can never guarantee that there will be more than one page of physically contiguous memory available let alone 16 pages!  Its up to the driver to properly handle large memory allocation failures and back off to smaller&smaller sizes until the allocation succeeds.  This has NOTHING to do with how fast the system frees up memory.

As you can see __alloc_skb() is coded to handle failures of the data buffer.  This is especially important since the gfp_mask obviously includes __GFP_DMA and that zone is very small reguardless of how much menory your system has.

/**
 *      __alloc_skb     -       allocate a network buffer
 *      @size: size to allocate
 *      @gfp_mask: allocation mask
 *      @fclone: allocate from fclone cache instead of head cache
 *              and allocate a cloned (child) skb
 *      @node: numa node to allocate memory on
 *
 *      Allocate a new &sk_buff. The returned buffer has no headroom and a
 *      tail room of size bytes. The object has a reference count of one.
 *      The return is the buffer. On a failure the return is %NULL.
 *
 *      Buffers may only be allocated from interrupts using a @gfp_mask of
 *      %GFP_ATOMIC.
 */
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
                            int fclone, int node)
{
        struct kmem_cache *cache;
        struct skb_shared_info *shinfo;
        struct sk_buff *skb;
        u8 *data;

        cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

        /* Get the HEAD */
        skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
        if (!skb)
                goto out;

        size = SKB_DATA_ALIGN(size);
        data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
                        gfp_mask, node);
        if (!data)
                goto nodata;


When this happens please "echo m > /proc/sysrq-trigger" & attach the show_mem output so I can show you exactly where the DMA memory is and why this allocation is failing.

Larry

Comment 11 KOSAKI Motohiro 2014-11-20 15:30:47 UTC
I agree with Larry.

I think this call stack mean

tg3_start_xmit
  tigon3_dma_hwbug_workaround
    skb_copy(GFP_ATOMIC)
      alloc_skb(GFP_ATOMIC)


tigon3_dma_hwbug_workaround() mean your hardware has a buggy DMA and kernel must allocate additional buffer for a workaround. Unfortunately, it may fail.
But I don't think this is kernel bug nor fault.

Moreover, if this allocation failure makes unstable network, some your application don't handle UDP packet lost properly. It's not kernel fault.

Comment 12 Rafael Aquini 2014-11-26 14:19:17 UTC
As Larry and Kosaki have emphasized, the kernel can never guarantee that GFP_ATOMIC high-order alloc requests will always succeed because memory fragmentation might show itself in the way of getting those blocks of contiguous memory, and the atomic context cannot afford waiting on PFRA to grant it that contiguous memory chunk. So, it's really up to the network stack (or NIC driver) to handle the alloc failure backing-off to smaller chunks and retransmitting.

As this problem raises not in function of how much memory is available in the system, but in function of how much fragmented it is, perhaps a way to work this around would be through forcing memory compaction after a forced cache drop, in a scheduled job. Would you mind trying to issue the following one-liner script scheduled a couple of times a day or just moments before your workload activity spikes that lead to most of the observed allocation failures?

 echo 3 > /proc/sys/vm/drop_caches && echo 1 > /proc/sys/vm/compact_memory

Alongside with data requested by Larry, if we can confirm that the procedure of forcing a cache drop followed by memory compaction does help on reducing those observed alloc failures we'll have a good set of info to derive an action plan on how to improve the kernel behaviour in this regard.

-- Rafael

Comment 13 Larry Woodman 2015-08-19 17:44:06 UTC
What should we do with this BZ?  As we said, the is passing GFP_ATOMIC|GFP_DMA with a non-zero order to the allocator and it simply fails if there isnt the requested amount of physically contiguous memory in the buddy list for the DMA zone.  The DMA zone is small and is certain to become fragmented enough to cause high order allocations to fail once the system has been up and running for a while if some driver is using that memory, which in this case there is.

Larry

Comment 14 John 2015-08-19 20:27:46 UTC
I still can't believe this.

I need to ask several questions at this point before my head explodes:

1) why are people saying the DMA zone is small? On x86_64, it's 4GB on any machine with more RAM than that. That's not small.

2) why do people just happily accept dma zone fragmentation, as if it is unavoidable? I've seen this bug on machines with 24GB+ of RAM. What excuse is there for having 4GB of DMA zone fragmented so badly that a 64 KB allocation can't be satisfied, on machines with 24+GB of RAM under no load apart from filesystem caching?

3) How is it that expensive servers from HP and Dell still use tg3 nics with a known DMA bug, and a buggy driver that can't fall back to lower order allocations. For Christ's sake i think the tg3 driver has been around for long enough and been deployed in enough enterprise systems for the driver to work by now

4) If a driver needs big chunks of contiguous memory that are unlikely to be satisfied by the abyssmally poor memory management of the linux kernel, why doesn't it just claim a good sized chunk when the driver is initialised, and then manage that memory itself. This is a 1GbE nic, if it wants a bunch of memory, it can get it from anywhere - it cannot possibly need much memory from the DMA zone, so why doesn't it grab the DMA memory it needs and hold onto it?

5) How is it possible that linux memory management is so poor that filesystem caching can cause memory fragmentation and lead to problems like this? The whole reason linux uses pretty much all available memory for filesystem cache is because filesystem cache can be dropped in an instant so the memory can be reallocated for something more urgent. Are people now saying that GFP_ATOMIC does not give enough time for the kernel to reclaim memory from filesystem cache and so kernel doesn't even try to reclaim memory and will only use what is available? If so, why doesn't the linux kernel think ahead a little, and pre-emptively drop filesystem cache to de-fragment memory and ensure there are ALWAYS some larger contiguous chunks available for GFP_ATOMIC allocations from dma zone. When you have an x86_64 machine with 36GB of RAM it should not be difficult to drop some filesystem cache out of the bottom 4GB to free up a bucketload of contiguous pages. At any time.

6) In the thread above, it has been made very clear that the tg3 driver should handle the allocation failure better and fall back to lower-order allocations. SO WHY IS THIS BUG STILL OPEN? WHY HAS THE TG3 DRIVER NOT BEEN FIXED?

Comment 16 John 2015-10-13 12:59:34 UTC
Registering a block vs bug #1270638 which is a private bug and cannot be viewed by anyone, is NOT helpful.

Comment 17 John 2015-10-13 13:10:24 UTC
Nor was it helpful to register the earlier block vs another private bug: #1172231

Can someone please explain how it is possible for these two private bugs to be blocking this one?

There is no real reason why resolution of this bug should prevented by any other bug report. There is no piece of software in existence whose correct functioning DEPENDS on this bug being PRESERVED

If we fix the page allocations in the TG3 driver, or improve the ability of the kernel to reclaim some memory and pre-emptively defragment the DMA Zone, is it going to break other software? NO. It cannot conceivably break anything (unless the fix itself is buggy, but that would be another issue)

Please REMOVE THESE BLOCKS or open up the blocker bugs so we can see what the problem is. There is no way in my view that there can be any blockers for this bug. There may be duplicates, or other bugs which have the same root cause, and which you may wish to concentrate on, in which case this bug will be resolved when you fix the other bug. But that is not a valid reason to lodge those bugs as blockers for this one.

Comment 18 John 2015-10-13 13:30:19 UTC
I'll say it again:

It wasn't too long ago that most RHEL systems were running 32bit kernels, and 4GB of RAM was pretty typical. On those systems, the DMA zone was pretty small. But they still managed to run the same sort of hardware. TG3 NICs and so on.

Now, on 64bit kernels, the DMA Zone has an entire 32bit address space of 4GB, and people are still saying the DMA Zone is small? Is this meant to be some sort of joke?

Comment 19 Larry Woodman 2015-10-13 13:53:36 UTC
John, first of all the DMA zone is the first 16MB(2^24) and the DMA32 zone is whatever useable RAM exists in the first 4GB of physical memory and NOT 4GB.  You can see the actual size written tpo the console via "echo m > /proc/sysrq-trigger".  However this problem is caused by the tg3 nic driver passing GFP_ATOMIC for a 64KB allocation.  In this case the allocator simply returns null if 16 pages of physically contiguous memory is not free.  It does NOT attempt to direct reclaim or coalesce physical memory, it simply wakes up kswapd and returns null.   Any driver or other kernel code that uses GFFP_ATOMIC must be prepared to deal with memory allocation failures and reduce the size of the allocation before retrying.  The fact that this failure is happening is NOT the faukt of the allocator but indicated memory is fragmented.  This is typically caused by other drivers or even the same driver requesting lots of higher order allocations and not promptly freeing this memory.

If you can reproduce this problem(I cant), please get me that AltSysrq-M output via that echo command above.

Larry

Comment 20 John 2015-10-14 11:50:04 UTC
The TG3 NIC isn't an ancient ISA card, if it can't use the full 32-bit 4GB DMA32 zone, i'll just spit the dummy and give up. Stick a fork in me, I'm done. It's not worth being a member of the human race anymore. We're all doomed.

So, I STILL say that if the kernel cannot allocate 64KB of contiguous memory from somewhere in the 4GB DMA32 zone, then something is wrong. With the kernel.

If we look at RHEL6 kernels, for example, they've removed the ability to easily limit filesystem cache (which was present in EL5 kernels) - with the logic being that the kernel cam manage it.

And then when memory gets fragmented all to hell, people say well what do you expect, memory gets fragmented. 

Well, I'm just saying it's not good enough for the kernel to allow memory to be fragmented to this extent by nothing more than pushing a few files out and filling the pagecache. If fragmentation is still causing problems for (poorly written) device drivers, why can't people face this simple fact, and say hey, i've got 32GB of RAM in my server, there's no benefit from using every last drop of it for filesystem cache. Lets just limit the pagecache a LITTLE to keep some contiguous memory lying around in the bottom 4GB. It can't be that hard. And even if it IS hard, i still don't see why noone will admit it's even an issue. If it IS a hard problem to solve, the first step might be to admit it's a problem and it's a hard one. 

Instead, the standard response from everone is to deny it's even a problem. 

Well, this bug, and others, are proof that it is.

And the only solution that's been suggested is to drop ALL the pagecache, which is ridiculous. There should be no reason to drop it all. That IS a waste. All the kernel has to do is just drop SOME when it's getting low on contiguous memory, BEFORE the contiguous memory gets so low a driver can't get 64KB.

As for reproducing this problem, we have moved on to newer kernels, and, to be fair, they do seem to be doing better. So maybe, just maybe, the worst of these problems are resolved at last. But I'm not completely convinced yet. Maybe the problems have just been resolved by switching our heavily loaded servers to 10GbE intel nics, and maybe we'll install something in the future which triggers these sorts of issues again if the underlying fragmentation problme is still there, and i think it is.

In any case, it still aggravates me that the problem was brushed under the carpet for so long with these lame excuses.

Bah. <end rant>


Note You need to log in before you can comment on or make changes to this bug.