Bug 616600
| Summary: | SPECsfs NFS V3 workload on RHEL6 running kernels later than 2.6.32-33 fail with thousands of kernel alloc order messages from most kernel threads | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Barry Marson <bmarson> |
| Component: | kernel | Assignee: | Rik van Riel <riel> |
| Status: | CLOSED DUPLICATE | QA Contact: | Barry Marson <bmarson> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.0 | CC: | aarcange, agospoda, bfields, dshaks, jlayton, kosaki.motohiro, linuxdev-kernel-it, lwoodman, nhorman, qcai, riel, rwheeler, steved, syeghiay |
| Target Milestone: | rc | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-03-02 18:27:24 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
This issue has been proposed when we are only considering blocker issues in the current Red Hat Enterprise Linux release. ** If you would still like this issue considered for the current release, ask your support representative to file as a blocker on your behalf. Otherwise ask that it be considered for the next Red Hat Enterprise Linux release. ** We need to review this before pushing it out of 6.0, sounds like it might be a regression. Please try with "echo never >/sys/kernel/mm/redhat_transparent_hugepage/enabled" on latest rhel6 kernel (not old). If you get more errors than on -33 it could mean that memory compaction is triggering more compound page allocation failures but that's hard to believe... so more likely disabling THP will fix it. Maybe enabling THP makes all hugepage used in userland and so they become harder to allocate. Also those are generally warnings, do you really get an userland failure? Also you said SLAB fixed it, but the above stack trace is with SLAB. So you still get the errors with SLAB? Things like this: page allocation failure. order:2, mode:0x20 are warning and expected, there's no way to reliably allocate any page with GFP_ATOMIC, even less an order 2 page. I guess it's all expected if you only get errors like this. If you can attach all thousand of errors I can check if there are more relevant ones. Maybe for tg3 with jumbo size packets we should reserve some bigger page for the gfp-atomic to make it more reliable. THP was disabled ... ie [never]. In some cases, I've seen the system hang under this workload. I typically leave panic on oops set to 0 so I can best see whats going on since console access is a challenge. Will be attaching a messages log of what should be a full run which means building the various filesystems ext2/3/4, xfs. I held back on gfs2 due to this issue. Barry Created attachment 433419 [details]
messages file for a .32-37 run doing multiple filesystems.
I can see one potential cause for this issue. Before we introduced the memory compaction code, kswapd would do higher order reclaims if an allocation failed (see the "order" argument to wakeup_kswapd). With memory compaction, we rely on the allocator to compact memory. This obviously is not going to happen with atomic allocations. I suspect we'll have to let kswapd, or some other kernel thread, do memory compaction in order to get rid of those allocation errors. That way kswapd will get woken up once we reach the low threshold for the order of allocations that need to happen, and we can have more free areas of this order freed up by kswapd. If THP is disabled then it's definitely a memory compaction issue. I agree with Rik's theory in comment #8. 2.6.32-34 did not change that: * Tue Jun 15 2010 Aristeu Rozanski <arozansk> [2.6.32-34.el6] - [net] Revert "[net] bridge: make bridge support netpoll" (Herbert Xu) [602927] - [virt] always invalidate and flush on spte page size change (Andrea Arcangeli) [578134] - [mm] root anon vma bugchecks (Andrea Arcangeli) [578134] - [mm] resurrect the check in page_address_in_vma (Andrea Arcangeli) [578134] - [mm] root anon vma use root (Andrea Arcangeli) [578134] - [mm] avoid ksm hang (Andrea Arcangeli) [578134] - [mm] always add new vmas at the end (Andrea Arcangeli) [578134] - [mm] remove unnecessary lock from __vma_link (Andrea Arcangeli) [578134] - [mm] optimize hugepage tracking for memcgroup & handle splitting (Rik van Riel) [597108] - [mm] properly move a transparent hugepage between cgroups (Rik van Riel) [597081] - [mm] scale statistics if the page is a transparent hugepage (Rik van Riel) [597077] - [mm] enhance mem_cgroup_charge_statistics with a page_size argument (Rik van Riel) [597058] - [virt] add option to disable spinlock patching on hypervisor (Gleb Natapov) [599068] - [virt] xen: don't touch xsave in cr4 (Andrew Jones) [599069] - [drm] Update core to current drm-linus (Adam Jackson) [589547 589792 597022] - [mm] fix refcount bug in anon_vma code (Rik van Riel) [602739] I just built a -36 kernel and removed all of the [mm] patches from -34 and still see the issue .... again .. THP was disabled .. Barry Created attachment 433533 [details]
make kswapd call the memory compaction code
(I am only just now compiling a test RPM with this patch for my own systems, so the patch is currently untested).
Replacing lumpy reclaim with memory compaction seems to have
helped some of the worst cases encountered with large memory
allocations.
However, this did accidentally remove kswapd's helping out
with the defragmenting of memory. This can cause memory
allocation failures for atomic and other non-waiting allocations,
which rely on kswapd to free and defragment memory for it.
This patch makes kswapd restrict itself to free memory up
to the PAGE_ALLOC_COSTLY_ORDER watermarks and has kswapd
call the memory compaction code if higher orders need to
be freed up.
This should help higher-order non waiting allocations, and
will hopefully also reduce the number of compaction stalls
encountered when running with transparent hugepages.
Signed-off-by: Rik van Riel <riel>
---
include/linux/compaction.h | 6 ++++++
mm/compaction.c | 2 +-
mm/vmscan.c | 24 ++++++++++++++++++------
3 files changed, 25 insertions(+), 7 deletions(-)
I have been running tests on a kernel with this patch most of the evening now, it appears to be stable. The brew task is here: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2617241 why not calling memory compactionfor order <= PAGE_ALLOC_COSTLY_ORDER? Because the compaction code refuses to run for order <= PAGE_ALLOC_COSTLY_ORDER :) But an order 2 allocation failed, PAGE_ALLOC_COSTLY_ORDER is 3, so this won't invoke compaction for order 2 allocations, so how's supposed to help or make any difference for order 2 GFP_ATOMIC allocations? Just because some incidental >3 order allocation triggers in the background by something else? Those order >3 aren't guaranteed in background... And shouldn't it be throttled with compaction_deferred? Nevertheless it'll be interesting if it helps... so not against testing it.. Also I don't think it's correct that compaction refuses to run for order 1 2 and 3, probably we should remove that limit. Also lumpy reclaim was a noop for order 2 allocations, so it's hard to see how it's related. I refer to the failure at the top. I've yet to check the message file though maybe it shows bigger order. The lockup may also be a console issue with kernel error flooding. I think tg3 should use __GFP_NOWARN in its jumbo packet allocation and that is the real bug, it seems it's just that failing with order 2. It must have a fallback. However if it happens more frequently we should figure it out why, but I'd replace the huge dump with a rate limited warning in tg3. there are no interesting changes from 32 to 36 (other than THP turned on and off by default on different versions) but you were running with THP disabled according to comment 11. So if 32 or 33 run the most stable it makes no sense the regression starts in 36... I would suggest the regression may have started from -28 to -29 with 28 showing less errors and 29 showing more errors. Can you check? Created attachment 433859 [details]
use compaction to create order >0 GFP_ATOMIC pages (should be more reliable than lumpy with prio < DEF_PRIORTY -2)
I created a test rpm with the patch in id=433859 (adjusted to apply clean) http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2619614 This also has the freeezing fix for khugepaged/ksmd and the young-bit-heuristic-removal from khugepaged. Please keep testing after running: "echo never >/sys/kernel/mm/redhat_transparent_hugepage/enabled" to reduce the number of variables in the equation. Once test passes please try again with "echo always >/sys/kernel/mm/redhat_transparent_hugepage/enabled". I still think tg3 shall be more friendly but hopefully we can make order 2 GFP_ATOMIC more reliable than it has ever been, with this. proceeding to test kernel in #20. This test requires kernels before -38 need to be built with a patch for NFS. The work recommended by #18 will be looked into if we dont make progress. Barry I tested the kernel in #20 and got the allocation errors a little later in the test run (instead of at 6000 Ops/sec at 10000 Ops/sec. Would you like me to post the messages log ? Barry Ok, I'm wondering if maybe the all_unreclaimable logic would potentially get activated and inhibit compaction. I disabled the all_unreclaimable for order > 0 allocations. I'm not too optimistic that all_unreclaimable was the problem but you may try it out again just in case... http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2623005 Created attachment 434149 [details]
same as 433859 but avoid setting all_unreclaimable for order > 0
I tried the kernel from #23 and still got some allocation failures. I will let it run and post a summary of the number and types of failures Barry After just doing ext2 run, there were 2.5K allocation failures ... This is roughly in line with the kernel from #20 Barry Hi Barry, thanks a lot for testing. what do you mean kernel from #20. Comment #20 or kernel version number -20? (before the memory compaction and lumpy reclaim removal?) Does this mean we're good with the last patch in comment #24 included in build in comment #23? Note: we cannot eliminate the failures, the large pages aren't reserved pinned for order 2 allocations in the min-watermark zone, so as other allocation runs they can get fragmented, and when the GFP_ATOMIC order 2 is invoked from TG3 driver, it will fail once, then kswapd is activated and told to generate more order 2 pages for the next GFP_ATOMIC allocations. I could try to remove the priority check, that might speedup further the generation of order 2 pages, further reducing the number of failures, but they cannot go away completely, not in rhel6 at least. I think tg3 should be less verbose about it, it's normal if order 2 allocation with GFP_ATOMIC fail sometime and it should get rate limited. Sorry, meant comment #20 I think there was confusion about the build referenced with comment #23. If I was suppose to take that and patch it with comment #24 patch, then I haven't done the correct testing. I'll grab the two bits build it and test it now Well hopefully we can get rid of virtually all of them as I haven't seen them in past RHELs or even in RHEL6 when we moved to the SLAB around -19. I suppose I could run this without jumbo frames and see what happens too. I'll build the kernel and try that first. Barry So it the build from comment #23 does have the patch from comment #24. So this has been tested. Just for grins .. Im going to try removing jumbo frames. Barry If you remove jumbo frames the problem shall go away... I assume the kernels around -19 also used jumbo frames?!? I think some failures are definitely to be expected with tg3 using jumbo frames allocated with GFP_ATOMIC order 2, that is absolutely unavoidable. Problem is how many is normal and by making kswapd more aggressive we can reduce the number of jumbo frames atomic alloc failures. But tt's hard to imagine you really got _none_ in old kernels, if they were using jumbo frames too! I'd suggest to try -19 again and make sure jumbo frames were enabled and that you got zero failures with SLAB. Ive always used jumbo frames ... and doing the tests around -19 with SLAB, there were no failures in any of the file systems. I'ld like to verify that disabling jumbo frames makes it go away. Barry OK .. so I ran all sorts of things the last few days. Using the 52.el6transhuge failures occured as expected ... Then I took jumbo frames out of the scene and this kernel worked for ext2. I then proceeded to see if I could induce less errors by changing MTU from what was 9000, down to 8000. I got the same errors. I got bold and tried reducing MTU to under a page (ie 4000) and once again I got the same errors. I thought this would have done better; actually I thought it would have made the problem disappear as much as default MTU 1500. Late yesterday I reinstalled .32-19 and ran it with jumbo frames. Something has been lurking for a while with this workload where we panic and we dont have access to the console. One month it works, then the setup fails (and it's remote RDU)). I disable panic on oops in an attempt to get a early culprit into the messages log. In this case we got nothing but an iLO snapshot of half a panic message :( But the point of the run was 19 out of 22 points were run and there wasn't a single allocation error. I'm re-queueing it again ... Barry So the rerun was successful (not a single allocation failure) .. ie -19 kernel with jumbo frames MTU=9000 for both ext2 and xfs. I picked these two, since all filesystems have been giving me trouble on more recent kernels, but the best fairly recently kernel was -33 where the problem occurred much less and only with xfs. Barry Ok I'll try to make kswapd a lot more aggressive calling memory compaction then we'll see. Created attachment 435576 [details]
same as 434149 but run compaction immediately if kswapd notices fragmentation
going to build a kernel with 435576 and all patches from bug 614427 This -55 build includes: https://bugzilla.redhat.com/attachment.cgi?id=435580 https://bugzilla.redhat.com/attachment.cgi?id=432044 https://bugzilla.redhat.com/attachment.cgi?id=435340 https://bugzilla.redhat.com/attachment.cgi?id=432045 from bug 614427. https://bugzilla.redhat.com/attachment.cgi?id=435576 from bug 616600 removal of young bit check from khugepaged bug 615381 http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2641692 please test, hope this makes some difference. I think reintroducing lumpy for all high order alloc, is very risky with THP enabled by default (at most we could restrict it to kswapd considering that THP async-allocation are only given to khugepaged throttled with alloc_sleep_millisecs and never to kswapd thanks to __GFP_NO_KSWAPD being part of GFP_TRANSHUGE). But frankly I hope this is going to work ok, and that we can solve it with compaction. People on lkmk is hacking around lumpy because they also noticed it made the system hang and they're not even trying to run order 9 allocation in a flood like THP does all the time... lumpy going blind on referenced bits is a responsiveness killer and in turn an hazardous logic that we better avoid if we can. For this workload having THP on makes no sense IMO. Millions of small files average size 8KB. Will run this new kernel with THP off as always. If it works, I'll try it with it on for reference. Barry one patch of the above list slipped, please use this build instead: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2641800 I think turning THP on, could be invoking compaction more frequently and it may actually risk to _hide_ the problem. So yes it's good idea to test it with THP off, to avoid hiding the problem compared to the reference kernel .32-19 . But I think THP shall be on even for this workload, if your workload won't take advantage of it, probably no hugepage will be allocated simply. In short the only thing you risk to waste is some little memory on userland, the memory taken by millions of little slab created by the little file won't change at all regardless of THP on or off. using the kernel from brew referenced in comment #39, with MTU=9000 and THP off, I got nearly 3800 allocation failed messages ... Here's their frequency 3438 nfsd: page allocation failure. order:2, mode:0x20 102 rsyslogd: page allocation failure. order:2, mode:0x20 57 kswapd0: page allocation failure. order:2, mode:0x20 48 ksoftirqd/2: page allocation failure. order:2, mode:0x20 31 ksoftirqd/1: page allocation failure. order:2, mode:0x20 18 automount: page allocation failure. order:2, mode:0x20 17 irqbalance: page allocation failure. order:2, mode:0x20 9 flush-8:240: page allocation failure. order:2, mode:0x20 6 scsi_eh_1: page allocation failure. order:2, mode:0x20 3 flush-8:176: page allocation failure. order:2, mode:0x20 3 events/1: page allocation failure. order:2, mode:0x20 2 flush-8:80: page allocation failure. order:2, mode:0x20 2 flush-66:224: page allocation failure. order:2, mode:0x20 2 flush-66:208: page allocation failure. order:2, mode:0x20 2 ata/1: page allocation failure. order:2, mode:0x20 1 swapper: page allocation failure. order:2, mode:0x20 1 master: page allocation failure. order:2, mode:0x20 1 hald-addon-stor: page allocation failure. order:2, mode:0x20 1 flush-8:16: page allocation failure. order:2, mode:0x20 1 flush-67:112: page allocation failure. order:2, mode:0x20 1 flush-66:80: page allocation failure. order:2, mode:0x20 1 flush-65:224: page allocation failure. order:2, mode:0x20 I'll try THP on as an experiment My question is .. on the prior tests with the -52 based kernel, why do we get allocation errors with MTU=4000 and not with 1500 ? I expected anything comfortably under a page size would not fail Barry Why 4000 is not enough must be a networking detail likely there are more than 96 bytes of header space. So the allocation error goes down compared to the 11000. Reading compaction code I think I need to tweak compaction, it's not made for GFP_ATOMIC and it's super tuned for direct reclaim (low watermark instead of high) and it stops immediately after there's a single page of the right order (again ideal for direct reclaim but broken for GFP_ATOMIC preparation of the high watermark). I'll fix this and we'll try again. here a new build to try, this time compaction will not stop if there is 1 free page of right order in the freelist and it'll keep going until we reach the high wmark. http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2645962 Created attachment 436082 [details]
create GFP_ATOMIC high order pages with compaction
Tried THP on for the kernel referenced in comment #39 and still had the issues. Also the error count is actually about the same. The 11K referenced was the sum of testing multiple file system types. Chatting with gospo, I found out that the allocation needs for tg3 network cards (half of what I have) always allocate more than order 2 if MTU > 1500 (sigh). The e1000's do it totally different (they have some sort of DMA). Might experiment with reduced client counts to see if I can point this to one NIC drivers behavior vs the other. But first I'll try the kernel in comment #45 Is there a way to get those kernel bits without using the GUI. Im remote here still. Barry wget --no-check-certificate "https://brewweb.devel.redhat.com/getfile?taskID=2645965&name=kernel-2.6.32-56.el6transhuge.x86_64.rpm" kernel from comment #45 fails same way. Sorted histogram for ext2 run follows ... 3427 nfsd: page allocation failure. order:2, mode:0x20 108 rsyslogd: page allocation failure. order:2, mode:0x20 65 irqbalance: page allocation failure. order:2, mode:0x20 44 ksoftirqd/2: page allocation failure. order:2, mode:0x20 41 ksoftirqd/1: page allocation failure. order:2, mode:0x20 31 kswapd0: page allocation failure. order:2, mode:0x20 15 automount: page allocation failure. order:2, mode:0x20 6 events/1: page allocation failure. order:2, mode:0x20 5 flush-65:160: page allocation failure. order:2, mode:0x20 3 migration/1: page allocation failure. order:2, mode:0x20 3 flush-66:224: page allocation failure. order:2, mode:0x20 2 swapper: page allocation failure. order:2, mode:0x20 2 jbd2/dm-0-8: page allocation failure. order:2, mode:0x20 2 flush-67:112: page allocation failure. order:2, mode:0x20 2 flush-66:64: page allocation failure. order:2, mode:0x20 2 events/2: page allocation failure. order:2, mode:0x20 1 flush-8:144: page allocation failure. order:2, mode:0x20 1 flush-67:80: page allocation failure. order:2, mode:0x20 1 flush-66:80: page allocation failure. order:2, mode:0x20 1 flush-65:208: page allocation failure. order:2, mode:0x20 Barry make that comment #43 ... ie http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2645962 Barry Ok I will have to write a module to simulate this GFP_ATOMIC order 2... and hopefully see what is going on. So the workload is working fine, just you get flooded by these warnings? No more lockup? Or still? I did some experimenting over the last couple of days and here's what was observed. Running the benchmark using half the clients which are exclusively connected to the tg3 NICS shows the problem. Running the benchmark using half the clients which are exclusively connected to the e1000 NICS shows NO problem at all. This is definitely a sensitivity to the way the tg3 driver allocates. One could argue the workload functions, but the synchronous allocation failure effect performance on the server. Every once in a while the system will lock up .. but that hasn't happened for a week. Barry I wonder if this is related to commit d2757fc4076118e13180e91f02c3c52659be3d9d. I've got a test build here: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2671142 Barry, could you please test it out when its complete? Thanks! We may have a fix for this issue in our kernel tree. Barry, would you have time to test the latest RHEL6 kernel?
This got merged in kernel 2.6.32-60.el6.
commit 58ee95276aa7ae00d289823d3471c0bd4c0ffb1b
Author: Andrea Arcangeli <aarcange>
Date: Fri Jul 30 16:18:20 2010 -0400
[mm] Correctly assign the number of MIGRATE_RESERVE pageblocks
Message-id: <20100730161820.GH16655>
Patchwork-id: 27259
O-Subject: [RHEL6 PATCH] Correctly assign the number of MIGRATE_RESERVE
pageblocks
Bugzilla: 614427
RH-Acked-by: Rik van Riel <riel>
RH-Acked-by: Larry Woodman <lwoodman>
RH-Acked-by: Marcelo Tosatti <mtosatti>
https://bugzilla.redhat.com/show_bug.cgi?id=614427
https://bugzilla.redhat.com/attachment.cgi?id=435340
========
Subject: mm: page-allocator: Correctly assign the number of MIGRATE_RESERVE
From: Mel Gorman <mel.ie>
The page allocator marks a maximum of 2 pageblocks per zone
MIGRATE_RESERVE. This is to implement a "wilderness preservation heuristic"
for high-order atomic allocations. Ordinarily, min free kbytes is low and
increasing it helps fragmentation control. hugeadm can set a recommended
min_free_kbytes value that should be run from an init script.
Rather than using an init script, the RHEL kernel sets the recommended
min_free_kbytes early in boot. The problem is that
when selecting pageblocks to convert to MIGRATE_RESERVE, MIGRATE_MOVABLE
blocks are preferred of which none or very few exist early in boot.
The end result is that the wrong number of MIGRATE_RESERVE blocks are set
and high-order atomic allocation success rates suffer. While guarantees
are never made for high-order atomic allocations, correct setting of
MIGRATE_RESERVE pageblocks is vital for them to succeed at all. This
patch corrects the problem by taking a second pass at setting
MIGRATE_RESERVE if enough MIGRATE_MOVABLE blocks do not exist.
Signed-off-by: Mel Gorman <mel.ie>
Signed-off-by: Andrea Arcangeli <aarcange>
Signed-off-by: Aristeu Rozanski <arozansk>
I've had to move forward on this issue by migrating to all Intel based e1000 NICs. I'm presently testing that so I can at least move forward. Testing is going to take a couple of days. The Broadcoms are still installed but not connected. Since the machine is remote to me, I would have to get HP's help with re-cabling. Barry Barry, didnt you tell me this only happens when running over tg3??? Larry Yes, this is definitely an interaction issue with the tg3 driver and this workload under heavy memory pressure. Before very recent changes, there were 4 clients, 2 talked to tg3 and 2 to e1000. Running the test using only the e1000 served clients worked fine. The tg3 served clients had all the allocation messages. Barry *** This bug has been marked as a duplicate of bug 614427 *** reopen to have Barry verified if the patch in bug 614427 fixes the issue that he was seeing. So I got my config back where I can use the tg3 NIC cards. I tested the -64 kernel with just the ext2 file system and jumbo frames. The problem still happens. It's not clear from the bz referenced in comment #58 whether a much older kernel (-57) was what should have been tested. If so, I've obviously tested something even newer. One suggestion gospo recommended was to try setting the ring parameters. This is what I see for two NICS with ethtool -g e1000: Ring parameters for eth2: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 256 RX Mini: 0 RX Jumbo: 0 TX: 256 tg3: Ring parameters for eth3: Pre-set maximums: RX: 511 RX Mini: 0 RX Jumbo: 255 TX: 511 Current hardware settings: RX: 200 RX Mini: 0 RX Jumbo: 100 TX: 511 Can someone give me the syntax to set the RX and RX jumbo higher ? Thanks, Barry relating to comment 51, my build broke, heres a new brew build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2700455 Tried the kernel from comment #61 and it fails in the same way. Going to mess with ring buffer sizes Barry Hi I've faced similar problem. because RHEL5 and RHEL6 have difference GFP_ATOMIC definition. RHEL5 #define GFP_ATOMIC (__GFP_HIGH | __GFP_NOWARN) RHEL6 #define GFP_ATOMIC (__GFP_HIGH) That said, network stress workload often makes allocation failure but RHEL5 doesn't display it. but RHEL6 does. Can I ask why was it changed? I bet a lot of customers think this message mean regression occur. - KOSAKI Motohiro Ohhh, thank you for finding that! It appears that RHEL5 has a patch (linux-2.6-vm-silence-atomic-alloc-failures.patch) to silence failures from atomic allocations. This patch appears to have been in Fedora since 2.6.12 in Fedora Core 5. I am not sure why it never made it upstream, but I'll submit it right now. Upstream discussion: http://lkml.org/lkml/2010/9/21/204 This is very interesting. I find it even more interesting that these failures dont happen with the intel e1000, just the broadcom tg3 . I wonder what intel does so differently. Barry Hi I guess e1000 doesn't have jumbo frame feature. (dunno, please contact network expert). tg3 has. jumbo frame need order-2 allocation and it can fail very frequently. I guess more recent intel network card has similar issue. - kosaki *** This bug has been marked as a duplicate of bug 674147 *** |
Description of problem: Running the SPECsfs NFS workload on the BIGI testbed has yielded many 1000's of allocation order errors from .32-36 onward. I have tried .. -36, -37, -39, -42, -48 and they have all failed in the same way. I've been trying to nail down the cause of these allocation issues for quite some time. First we see them with SLUB, then they went away with SLAB. The last kernel that ran the most stable was the 32-33 kernel. In this case, only testing XFS file systems did we see less than 100 allocation errors, and only from nfsd. The above listed kernels have showed allocation errors and relative frequencies of ... 11149 nfsd: 406 kjournald: 338 rsyslogd: 126 kswapd0: 104 ksoftirqd/3: 99 irqbalance: 92 ksoftirqd/1: 84 xfslogd/3: 75 xfsbufd: 70 automount: 46 xfslogd/1: 42 bash: 26 jbd2/sdk-8: 26 abrtd: 25 hald-addon-stor: 24 xfsaild: 21 jbd2/sdq-8: 18 kthreadd: ... My belief was looking at the patches in -34 through -36, that the [mm] based patches were the culprit. I tried removing - [mm] remove unnecessary lock from __vma_link (Andrea Arcangeli) [578134] from a -36 build but this didn't fix anything. As a reminder, this test is being run with 4 NFS clients 1Gb enet to a DL580 server with 8 cpu/16GB of RAM. Storage is HBA attached 4 MSA1000 each presenting 14 LUNS. All storage is direct attach (no switch). The benchmark creates and works on 56 file systems. File systems tested include ext2/3/4, xfs, gfs2. There are 128 nfsd threads. It's all V3 TCP/IP. I really need help with this one. Thanks, Barry Version-Release number of selected component (if applicable): see above How reproducible: every time. In fact from a test perspective, the 3rd ext2 test run point (45 minutes into the benchmark is where we now consistently see the errors). Steps to Reproduce: 1. I run the workload on the BIGI testbed 2. 3. Actual results: A typical allocation error ... Jul 20 14:31:45 bigi kernel: nfsd: page allocation failure. order:2, mode:0x20 Jul 20 14:31:55 bigi kernel: Pid: 8849, comm: nfsd Tainted: G W 2.6.32-36.el6nommvmaulnk.x86_64 #1 Jul 20 14:31:55 bigi kernel: Call Trace: Jul 20 14:31:55 bigi kernel: <IRQ> [<ffffffff8111c2ff>] __alloc_pages_nodemask+0x65f/0x7e0 Jul 20 14:31:55 bigi kernel: [<ffffffff81152be2>] kmem_getpages+0x62/0x170 Jul 20 14:31:55 bigi kernel: [<ffffffff8115391a>] fallback_alloc+0x19a/0x240 Jul 20 14:31:55 bigi kernel: [<ffffffff81153731>] ? cache_grow+0x2d1/0x320 Jul 20 14:31:55 bigi kernel: [<ffffffff811531c9>] ____cache_alloc_node+0x99/0x160 Jul 20 14:31:55 bigi kernel: [<ffffffff8140738a>] ? __alloc_skb+0x7a/0x180 Jul 20 14:31:55 bigi kernel: [<ffffffff81153c6f>] kmem_cache_alloc_node_notrace+0x6f/0x140 Jul 20 14:31:55 bigi kernel: [<ffffffff81153ebb>] __kmalloc_node+0x7b/0x100 Jul 20 14:31:55 bigi kernel: [<ffffffff8140738a>] __alloc_skb+0x7a/0x180 Jul 20 14:31:55 bigi kernel: [<ffffffff81407746>] __netdev_alloc_skb+0x36/0x60 Jul 20 14:31:55 bigi kernel: [<ffffffffa0197332>] tg3_alloc_rx_skb+0xa2/0x240 [tg3] Jul 20 14:31:55 bigi kernel: [<ffffffffa019b61a>] tg3_poll_work+0x8da/0xd60 [tg3] Jul 20 14:31:55 bigi kernel: [<ffffffff8109e956>] ? tick_periodic+0x36/0x90 Jul 20 14:31:55 bigi kernel: [<ffffffffa019bb04>] tg3_poll+0x64/0x210 [tg3] Jul 20 14:31:55 bigi kernel: [<ffffffff814145a3>] net_rx_action+0x103/0x210 Jul 20 14:31:55 bigi kernel: [<ffffffff810728f7>] __do_softirq+0xb7/0x1e0 Jul 20 14:31:55 bigi kernel: [<ffffffff8101430c>] call_softirq+0x1c/0x30 Jul 20 14:31:55 bigi kernel: [<ffffffff81015f25>] do_softirq+0x65/0xa0 Jul 20 14:31:55 bigi kernel: [<ffffffff810726f5>] irq_exit+0x85/0x90 Jul 20 14:31:55 bigi kernel: [<ffffffff814ddc55>] do_IRQ+0x75/0xf0 Jul 20 14:31:55 bigi kernel: [<ffffffff81013b13>] ret_from_intr+0x0/0x11 Jul 20 14:31:55 bigi kernel: <EOI> [<ffffffff810dcea4>] ? __call_rcu+0xc4/0x160 Jul 20 14:31:55 bigi kernel: [<ffffffff810dcf75>] call_rcu_sched+0x15/0x20 Jul 20 14:31:55 bigi kernel: [<ffffffff810dcf8e>] call_rcu+0xe/0x10 Jul 20 14:31:55 bigi kernel: [<ffffffff8116b280>] __fput+0x180/0x210 Jul 20 14:31:55 bigi kernel: [<ffffffff8116b335>] fput+0x25/0x30 Jul 20 14:31:55 bigi kernel: [<ffffffffa032d97e>] nfsd_close+0xe/0x10 [nfsd] Jul 20 14:31:55 bigi kernel: [<ffffffffa032f653>] nfsd_write+0xf3/0x100 [nfsd] Jul 20 14:31:55 bigi kernel: [<ffffffffa03374ff>] nfsd3_proc_write+0xaf/0x140 [nfsd] Jul 20 14:31:55 bigi kernel: [<ffffffffa03283ea>] nfsd_dispatch+0xba/0x250 [nfsd] Jul 20 14:31:55 bigi kernel: [<ffffffffa027c9c4>] svc_process_common+0x344/0x610 [sunrpc] Jul 20 14:31:55 bigi kernel: [<ffffffffa027cfd0>] svc_process+0x110/0x150 [sunrpc] Jul 20 14:31:55 bigi kernel: [<ffffffffa0328ae6>] nfsd+0xd6/0x190 [nfsd] Jul 20 14:31:55 bigi kernel: [<ffffffffa0328a10>] ? nfsd+0x0/0x190 [nfsd] Jul 20 14:31:55 bigi kernel: [<ffffffff810904f6>] kthread+0x96/0xa0 Jul 20 14:31:55 bigi kernel: [<ffffffff8101420a>] child_rip+0xa/0x20 Jul 20 14:31:55 bigi kernel: [<ffffffff81090460>] ? kthread+0x0/0xa0 Jul 20 14:31:55 bigi kernel: [<ffffffff81014200>] ? child_rip+0x0/0x20 Jul 20 14:31:55 bigi kernel: Mem-Info: Jul 20 14:31:55 bigi kernel: Node 0 DMA per-cpu: Jul 20 14:31:55 bigi kernel: CPU 0: hi: 0, btch: 1 usd: 0 Jul 20 14:31:55 bigi kernel: CPU 1: hi: 0, btch: 1 usd: 0 Jul 20 14:31:55 bigi kernel: CPU 2: hi: 0, btch: 1 usd: 0 Jul 20 14:31:55 bigi kernel: CPU 3: hi: 0, btch: 1 usd: 0 Jul 20 14:31:55 bigi kernel: CPU 4: hi: 0, btch: 1 usd: 0 Jul 20 14:31:55 bigi kernel: CPU 5: hi: 0, btch: 1 usd: 0 Jul 20 14:31:55 bigi kernel: CPU 6: hi: 0, btch: 1 usd: 0 Jul 20 14:31:55 bigi kernel: CPU 7: hi: 0, btch: 1 usd: 0 Jul 20 14:31:55 bigi kernel: Node 0 DMA32 per-cpu: Jul 20 14:31:55 bigi kernel: CPU 0: hi: 186, btch: 31 usd: 7 Jul 20 14:31:55 bigi kernel: CPU 1: hi: 186, btch: 31 usd: 162 Jul 20 14:31:55 bigi kernel: CPU 2: hi: 186, btch: 31 usd: 166 Jul 20 14:31:55 bigi kernel: CPU 3: hi: 186, btch: 31 usd: 160 Jul 20 14:31:55 bigi kernel: CPU 4: hi: 186, btch: 31 usd: 151 Jul 20 14:31:55 bigi kernel: CPU 5: hi: 186, btch: 31 usd: 175 Jul 20 14:31:55 bigi kernel: CPU 6: hi: 186, btch: 31 usd: 162 Jul 20 14:31:55 bigi kernel: CPU 7: hi: 186, btch: 31 usd: 62 Jul 20 14:31:55 bigi kernel: Node 0 Normal per-cpu: Jul 20 14:31:55 bigi kernel: CPU 0: hi: 186, btch: 31 usd: 84 Jul 20 14:31:55 bigi kernel: CPU 1: hi: 186, btch: 31 usd: 152 Jul 20 14:31:55 bigi kernel: CPU 2: hi: 186, btch: 31 usd: 171 Jul 20 14:31:55 bigi kernel: CPU 3: hi: 186, btch: 31 usd: 111 Jul 20 14:31:55 bigi kernel: CPU 4: hi: 186, btch: 31 usd: 136 Jul 20 14:31:55 bigi kernel: CPU 5: hi: 186, btch: 31 usd: 150 Jul 20 14:31:55 bigi kernel: CPU 6: hi: 186, btch: 31 usd: 137 Jul 20 14:31:55 bigi kernel: CPU 7: hi: 186, btch: 31 usd: 108 Jul 20 14:31:55 bigi kernel: active_anon:2935 inactive_anon:1273 isolated_anon:0 Jul 20 14:31:55 bigi kernel: active_file:396310 inactive_file:2433629 isolated_file:160 Jul 20 14:31:55 bigi kernel: unevictable:0 dirty:29918 writeback:1 unstable:0 Jul 20 14:31:55 bigi kernel: free:36069 slab_reclaimable:831222 slab_unreclaimable:139432 Jul 20 14:31:55 bigi kernel: mapped:2411 shmem:116 pagetables:869 bounce:0 Jul 20 14:31:55 bigi kernel: Node 0 DMA free:15696kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0 kB isolated(file):0kB present:15308kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0 kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Jul 20 14:31:55 bigi kernel: lowmem_reserve[]: 0 3511 15631 15631 Jul 20 14:31:55 bigi kernel: Node 0 DMA32 free:65012kB min:15164kB low:18952kB high:22744kB active_anon:48kB inactive_anon:28kB active_file:326868kB inactive_file:2039604kB unevicta ble:0kB isolated(anon):0kB isolated(file):256kB present:3595336kB mlocked:0kB dirty:25000kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:719372kB slab_unreclaimable:60268kB k ernel_stack:0kB pagetables:32kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Jul 20 14:31:55 bigi kernel: lowmem_reserve[]: 0 0 12120 12120 Jul 20 14:31:55 bigi kernel: Node 0 Normal free:63568kB min:52352kB low:65440kB high:78528kB active_anon:11692kB inactive_anon:5064kB active_file:1258372kB inactive_file:7694912kB u nevictable:0kB isolated(anon):0kB isolated(file):384kB present:12410880kB mlocked:0kB dirty:94672kB writeback:4kB mapped:9644kB shmem:464kB slab_reclaimable:2605516kB slab_unreclaim able:497460kB kernel_stack:3224kB pagetables:3444kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:34 all_unreclaimable? no Jul 20 14:31:55 bigi kernel: lowmem_reserve[]: 0 0 0 0 Jul 20 14:31:55 bigi kernel: Node 0 DMA: 2*4kB 1*8kB 2*16kB 1*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15696kB Jul 20 14:31:55 bigi kernel: Node 0 DMA32: 15574*4kB 14*8kB 0*16kB 1*32kB 0*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 65384kB Jul 20 14:31:55 bigi kernel: Node 0 Normal: 14806*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 63320kB Jul 20 14:31:55 bigi kernel: 2830036 total pagecache pages Jul 20 14:31:55 bigi kernel: 0 pages in swap cache Jul 20 14:31:55 bigi kernel: Swap cache stats: add 0, delete 0, find 0/0 Jul 20 14:31:55 bigi kernel: Free swap = 8441848kB Jul 20 14:31:55 bigi kernel: Total swap = 8441848kB Jul 20 14:31:55 bigi kernel: 4063231 pages RAM Jul 20 14:31:55 bigi kernel: 109330 pages reserved Jul 20 14:31:55 bigi kernel: 2826850 pages shared Jul 20 14:31:55 bigi kernel: 1027941 pages non-shared Expected results: Additional info: