616600 – SPECsfs NFS V3 workload on RHEL6 running kernels later than 2.6.32-33 fail with thousands of kernel alloc order messages from most kernel threads

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 616600 - SPECsfs NFS V3 workload on RHEL6 running kernels later than 2.6.32-33 fail with thousands of kernel alloc order messages from most kernel threads

Summary: SPECsfs NFS V3 workload on RHEL6 running kernels later than 2.6.32-33 fail wi...

Keywords:
Status:	CLOSED DUPLICATE of bug 674147
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Rik van Riel
QA Contact:	Barry Marson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-20 21:35 UTC by Barry Marson
Modified:	2011-03-02 18:27 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-03-02 18:27:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
messages file for a .32-37 run doing multiple filesystems. (5.44 MB, application/x-gzip) 2010-07-21 13:40 UTC, Barry Marson	no flags	Details
make kswapd call the memory compaction code (4.78 KB, patch) 2010-07-21 22:35 UTC, Rik van Riel	no flags	Details \| Diff
use compaction to create order >0 GFP_ATOMIC pages (should be more reliable than lumpy with prio < DEF_PRIORTY -2) (2.58 KB, patch) 2010-07-23 02:25 UTC, Andrea Arcangeli	no flags	Details \| Diff
same as 433859 but avoid setting all_unreclaimable for order > 0 (3.18 KB, patch) 2010-07-24 14:28 UTC, Andrea Arcangeli	no flags	Details \| Diff
same as 434149 but run compaction immediately if kswapd notices fragmentation (3.32 KB, patch) 2010-07-30 14:56 UTC, Andrea Arcangeli	no flags	Details \| Diff
create GFP_ATOMIC high order pages with compaction (6.06 KB, patch) 2010-08-02 18:02 UTC, Andrea Arcangeli	no flags	Details \| Diff
Show Obsolete (3) View All

Description Barry Marson 2010-07-20 21:35:29 UTC

Description of problem:

Running the SPECsfs NFS workload on the BIGI testbed has yielded many 1000's of allocation order errors from .32-36 onward.  I have tried ..

-36, -37, -39, -42, -48 and they have all failed in the same way.

I've been trying to nail down the cause of these allocation issues for quite some time.  First we see them with SLUB, then they went away with SLAB.  The last kernel that ran the most stable was the 32-33 kernel.  In this case, only testing XFS file systems did we see less than 100 allocation errors, and only from nfsd.  The above listed kernels have showed allocation errors and relative frequencies of ...

  11149 nfsd:
    406 kjournald:
    338 rsyslogd:
    126 kswapd0:
    104 ksoftirqd/3:
     99 irqbalance:
     92 ksoftirqd/1:
     84 xfslogd/3:
     75 xfsbufd:
     70 automount:
     46 xfslogd/1:
     42 bash:
     26 jbd2/sdk-8:
     26 abrtd:
     25 hald-addon-stor:
     24 xfsaild:
     21 jbd2/sdq-8:
     18 kthreadd:
   ...

My belief was looking at the patches in -34 through -36, that the [mm] based patches were the culprit.  I tried removing 

- [mm] remove unnecessary lock from __vma_link (Andrea Arcangeli) [578134]

from a -36 build but this didn't fix anything.

As a reminder, this test is being run with 4 NFS clients 1Gb enet to a DL580 server with 8 cpu/16GB of RAM.  Storage is HBA attached 4 MSA1000 each presenting 14 LUNS.  All storage is direct attach (no switch).  The benchmark creates and works on 56 file systems.  File systems tested include ext2/3/4, xfs, gfs2.  There are 128 nfsd threads.  It's all V3 TCP/IP.

I really need help with this one.

Thanks,
Barry

Version-Release number of selected component (if applicable):
see above

How reproducible:
every time.  In fact from a test perspective, the 3rd ext2 test run point (45 minutes into the benchmark is where we now consistently see the errors).

Steps to Reproduce:
1. I run the workload on the BIGI testbed
2.
3.
  
Actual results:

A typical allocation error ...

Jul 20 14:31:45 bigi kernel: nfsd: page allocation failure. order:2, mode:0x20
Jul 20 14:31:55 bigi kernel: Pid: 8849, comm: nfsd Tainted: G        W  2.6.32-36.el6nommvmaulnk.x86_64 #1
Jul 20 14:31:55 bigi kernel: Call Trace:
Jul 20 14:31:55 bigi kernel: <IRQ>  [<ffffffff8111c2ff>] __alloc_pages_nodemask+0x65f/0x7e0
Jul 20 14:31:55 bigi kernel: [<ffffffff81152be2>] kmem_getpages+0x62/0x170
Jul 20 14:31:55 bigi kernel: [<ffffffff8115391a>] fallback_alloc+0x19a/0x240
Jul 20 14:31:55 bigi kernel: [<ffffffff81153731>] ? cache_grow+0x2d1/0x320
Jul 20 14:31:55 bigi kernel: [<ffffffff811531c9>] ____cache_alloc_node+0x99/0x160
Jul 20 14:31:55 bigi kernel: [<ffffffff8140738a>] ? __alloc_skb+0x7a/0x180
Jul 20 14:31:55 bigi kernel: [<ffffffff81153c6f>] kmem_cache_alloc_node_notrace+0x6f/0x140
Jul 20 14:31:55 bigi kernel: [<ffffffff81153ebb>] __kmalloc_node+0x7b/0x100
Jul 20 14:31:55 bigi kernel: [<ffffffff8140738a>] __alloc_skb+0x7a/0x180
Jul 20 14:31:55 bigi kernel: [<ffffffff81407746>] __netdev_alloc_skb+0x36/0x60
Jul 20 14:31:55 bigi kernel: [<ffffffffa0197332>] tg3_alloc_rx_skb+0xa2/0x240 [tg3]
Jul 20 14:31:55 bigi kernel: [<ffffffffa019b61a>] tg3_poll_work+0x8da/0xd60 [tg3]
Jul 20 14:31:55 bigi kernel: [<ffffffff8109e956>] ? tick_periodic+0x36/0x90
Jul 20 14:31:55 bigi kernel: [<ffffffffa019bb04>] tg3_poll+0x64/0x210 [tg3]
Jul 20 14:31:55 bigi kernel: [<ffffffff814145a3>] net_rx_action+0x103/0x210
Jul 20 14:31:55 bigi kernel: [<ffffffff810728f7>] __do_softirq+0xb7/0x1e0
Jul 20 14:31:55 bigi kernel: [<ffffffff8101430c>] call_softirq+0x1c/0x30
Jul 20 14:31:55 bigi kernel: [<ffffffff81015f25>] do_softirq+0x65/0xa0
Jul 20 14:31:55 bigi kernel: [<ffffffff810726f5>] irq_exit+0x85/0x90
Jul 20 14:31:55 bigi kernel: [<ffffffff814ddc55>] do_IRQ+0x75/0xf0
Jul 20 14:31:55 bigi kernel: [<ffffffff81013b13>] ret_from_intr+0x0/0x11
Jul 20 14:31:55 bigi kernel: <EOI>  [<ffffffff810dcea4>] ? __call_rcu+0xc4/0x160
Jul 20 14:31:55 bigi kernel: [<ffffffff810dcf75>] call_rcu_sched+0x15/0x20
Jul 20 14:31:55 bigi kernel: [<ffffffff810dcf8e>] call_rcu+0xe/0x10
Jul 20 14:31:55 bigi kernel: [<ffffffff8116b280>] __fput+0x180/0x210
Jul 20 14:31:55 bigi kernel: [<ffffffff8116b335>] fput+0x25/0x30
Jul 20 14:31:55 bigi kernel: [<ffffffffa032d97e>] nfsd_close+0xe/0x10 [nfsd]
Jul 20 14:31:55 bigi kernel: [<ffffffffa032f653>] nfsd_write+0xf3/0x100 [nfsd]
Jul 20 14:31:55 bigi kernel: [<ffffffffa03374ff>] nfsd3_proc_write+0xaf/0x140 [nfsd]
Jul 20 14:31:55 bigi kernel: [<ffffffffa03283ea>] nfsd_dispatch+0xba/0x250 [nfsd]
Jul 20 14:31:55 bigi kernel: [<ffffffffa027c9c4>] svc_process_common+0x344/0x610 [sunrpc]
Jul 20 14:31:55 bigi kernel: [<ffffffffa027cfd0>] svc_process+0x110/0x150 [sunrpc]
Jul 20 14:31:55 bigi kernel: [<ffffffffa0328ae6>] nfsd+0xd6/0x190 [nfsd]
Jul 20 14:31:55 bigi kernel: [<ffffffffa0328a10>] ? nfsd+0x0/0x190 [nfsd]
Jul 20 14:31:55 bigi kernel: [<ffffffff810904f6>] kthread+0x96/0xa0
Jul 20 14:31:55 bigi kernel: [<ffffffff8101420a>] child_rip+0xa/0x20
Jul 20 14:31:55 bigi kernel: [<ffffffff81090460>] ? kthread+0x0/0xa0
Jul 20 14:31:55 bigi kernel: [<ffffffff81014200>] ? child_rip+0x0/0x20
Jul 20 14:31:55 bigi kernel: Mem-Info:
Jul 20 14:31:55 bigi kernel: Node 0 DMA per-cpu:
Jul 20 14:31:55 bigi kernel: CPU    0: hi:    0, btch:   1 usd:   0
Jul 20 14:31:55 bigi kernel: CPU    1: hi:    0, btch:   1 usd:   0
Jul 20 14:31:55 bigi kernel: CPU    2: hi:    0, btch:   1 usd:   0
Jul 20 14:31:55 bigi kernel: CPU    3: hi:    0, btch:   1 usd:   0
Jul 20 14:31:55 bigi kernel: CPU    4: hi:    0, btch:   1 usd:   0
Jul 20 14:31:55 bigi kernel: CPU    5: hi:    0, btch:   1 usd:   0
Jul 20 14:31:55 bigi kernel: CPU    6: hi:    0, btch:   1 usd:   0
Jul 20 14:31:55 bigi kernel: CPU    7: hi:    0, btch:   1 usd:   0
Jul 20 14:31:55 bigi kernel: Node 0 DMA32 per-cpu:
Jul 20 14:31:55 bigi kernel: CPU    0: hi:  186, btch:  31 usd:   7
Jul 20 14:31:55 bigi kernel: CPU    1: hi:  186, btch:  31 usd: 162
Jul 20 14:31:55 bigi kernel: CPU    2: hi:  186, btch:  31 usd: 166
Jul 20 14:31:55 bigi kernel: CPU    3: hi:  186, btch:  31 usd: 160
Jul 20 14:31:55 bigi kernel: CPU    4: hi:  186, btch:  31 usd: 151
Jul 20 14:31:55 bigi kernel: CPU    5: hi:  186, btch:  31 usd: 175
Jul 20 14:31:55 bigi kernel: CPU    6: hi:  186, btch:  31 usd: 162
Jul 20 14:31:55 bigi kernel: CPU    7: hi:  186, btch:  31 usd:  62
Jul 20 14:31:55 bigi kernel: Node 0 Normal per-cpu:
Jul 20 14:31:55 bigi kernel: CPU    0: hi:  186, btch:  31 usd:  84
Jul 20 14:31:55 bigi kernel: CPU    1: hi:  186, btch:  31 usd: 152
Jul 20 14:31:55 bigi kernel: CPU    2: hi:  186, btch:  31 usd: 171
Jul 20 14:31:55 bigi kernel: CPU    3: hi:  186, btch:  31 usd: 111
Jul 20 14:31:55 bigi kernel: CPU    4: hi:  186, btch:  31 usd: 136
Jul 20 14:31:55 bigi kernel: CPU    5: hi:  186, btch:  31 usd: 150
Jul 20 14:31:55 bigi kernel: CPU    6: hi:  186, btch:  31 usd: 137
Jul 20 14:31:55 bigi kernel: CPU    7: hi:  186, btch:  31 usd: 108
Jul 20 14:31:55 bigi kernel: active_anon:2935 inactive_anon:1273 isolated_anon:0
Jul 20 14:31:55 bigi kernel: active_file:396310 inactive_file:2433629 isolated_file:160
Jul 20 14:31:55 bigi kernel: unevictable:0 dirty:29918 writeback:1 unstable:0
Jul 20 14:31:55 bigi kernel: free:36069 slab_reclaimable:831222 slab_unreclaimable:139432
Jul 20 14:31:55 bigi kernel: mapped:2411 shmem:116 pagetables:869 bounce:0
Jul 20 14:31:55 bigi kernel: Node 0 DMA free:15696kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0
kB isolated(file):0kB present:15308kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0
kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jul 20 14:31:55 bigi kernel: lowmem_reserve[]: 0 3511 15631 15631
Jul 20 14:31:55 bigi kernel: Node 0 DMA32 free:65012kB min:15164kB low:18952kB high:22744kB active_anon:48kB inactive_anon:28kB active_file:326868kB inactive_file:2039604kB unevicta
ble:0kB isolated(anon):0kB isolated(file):256kB present:3595336kB mlocked:0kB dirty:25000kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:719372kB slab_unreclaimable:60268kB k
ernel_stack:0kB pagetables:32kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jul 20 14:31:55 bigi kernel: lowmem_reserve[]: 0 0 12120 12120
Jul 20 14:31:55 bigi kernel: Node 0 Normal free:63568kB min:52352kB low:65440kB high:78528kB active_anon:11692kB inactive_anon:5064kB active_file:1258372kB inactive_file:7694912kB u
nevictable:0kB isolated(anon):0kB isolated(file):384kB present:12410880kB mlocked:0kB dirty:94672kB writeback:4kB mapped:9644kB shmem:464kB slab_reclaimable:2605516kB slab_unreclaim
able:497460kB kernel_stack:3224kB pagetables:3444kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:34 all_unreclaimable? no
Jul 20 14:31:55 bigi kernel: lowmem_reserve[]: 0 0 0 0
Jul 20 14:31:55 bigi kernel: Node 0 DMA: 2*4kB 1*8kB 2*16kB 1*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15696kB
Jul 20 14:31:55 bigi kernel: Node 0 DMA32: 15574*4kB 14*8kB 0*16kB 1*32kB 0*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 65384kB
Jul 20 14:31:55 bigi kernel: Node 0 Normal: 14806*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 63320kB
Jul 20 14:31:55 bigi kernel: 2830036 total pagecache pages
Jul 20 14:31:55 bigi kernel: 0 pages in swap cache
Jul 20 14:31:55 bigi kernel: Swap cache stats: add 0, delete 0, find 0/0
Jul 20 14:31:55 bigi kernel: Free swap  = 8441848kB
Jul 20 14:31:55 bigi kernel: Total swap = 8441848kB
Jul 20 14:31:55 bigi kernel: 4063231 pages RAM
Jul 20 14:31:55 bigi kernel: 109330 pages reserved
Jul 20 14:31:55 bigi kernel: 2826850 pages shared
Jul 20 14:31:55 bigi kernel: 1027941 pages non-shared


Expected results:


Additional info:

Comment 2 RHEL Program Management 2010-07-20 22:57:32 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 Ric Wheeler 2010-07-20 23:32:33 UTC

We need to review this before pushing it out of 6.0, sounds like it might be a regression.

Comment 4 Andrea Arcangeli 2010-07-21 10:48:13 UTC

Please try with "echo never >/sys/kernel/mm/redhat_transparent_hugepage/enabled"

on latest rhel6 kernel (not old).

If you get more errors than on -33 it could mean that memory compaction is triggering more compound page allocation failures but that's hard to believe... so more likely disabling THP will fix it. Maybe enabling THP makes all hugepage used in userland and so they become harder to allocate.

Also those are generally warnings, do you really get an userland failure?

Also you said SLAB fixed it, but the above stack trace is with SLAB. So you still get the errors with SLAB?

Comment 5 Andrea Arcangeli 2010-07-21 10:51:13 UTC

Things like this:

page allocation failure. order:2, mode:0x20

are warning and expected, there's no way to reliably allocate any page with GFP_ATOMIC, even less an order 2 page. I guess it's all expected if you only get errors like this.

If you can attach all thousand of errors I can check if there are more relevant ones.

Maybe for tg3 with jumbo size packets we should reserve some bigger page for the gfp-atomic to make it more reliable.

Comment 6 Barry Marson 2010-07-21 13:38:48 UTC

THP was disabled ... ie [never].  In some cases, I've seen the system hang under this workload.  I typically leave panic on oops set to 0 so I can best see whats going on since console access is a challenge.

Will be attaching a messages log of what should be a full run which means building the various filesystems ext2/3/4, xfs.  I held back on gfs2 due to this issue.

Barry

Comment 7 Barry Marson 2010-07-21 13:40:21 UTC

Created attachment 433419 [details]
messages file for a .32-37 run doing multiple filesystems.

Comment 8 Rik van Riel 2010-07-21 14:18:50 UTC

I can see one potential cause for this issue.

Before we introduced the memory compaction code, kswapd would do higher order reclaims if an allocation failed (see the "order" argument to wakeup_kswapd).

With memory compaction, we rely on the allocator to compact memory.  This obviously is not going to happen with atomic allocations.

I suspect we'll have to let kswapd, or some other kernel thread, do memory compaction in order to get rid of those allocation errors.  That way kswapd will get woken up once we reach the low threshold for the order of allocations that need to happen, and we can have more free areas of this order freed up by kswapd.

Comment 9 Andrea Arcangeli 2010-07-21 14:26:01 UTC

If THP is disabled then it's definitely a memory compaction issue.

I agree with Rik's theory in comment #8.

Comment 10 Larry Woodman 2010-07-21 14:38:33 UTC

2.6.32-34 did not change that:

* Tue Jun 15 2010 Aristeu Rozanski <arozansk> [2.6.32-34.el6]
- [net] Revert "[net] bridge: make bridge support netpoll" (Herbert Xu) [602927]
- [virt] always invalidate and flush on spte page size change (Andrea Arcangeli) [578134]
- [mm] root anon vma bugchecks (Andrea Arcangeli) [578134]
- [mm] resurrect the check in page_address_in_vma (Andrea Arcangeli) [578134]
- [mm] root anon vma use root (Andrea Arcangeli) [578134]
- [mm] avoid ksm hang (Andrea Arcangeli) [578134]
- [mm] always add new vmas at the end (Andrea Arcangeli) [578134]
- [mm] remove unnecessary lock from __vma_link (Andrea Arcangeli) [578134]
- [mm] optimize hugepage tracking for memcgroup & handle splitting (Rik van Riel) [597108]
- [mm] properly move a transparent hugepage between cgroups (Rik van Riel) [597081]
- [mm] scale statistics if the page is a transparent hugepage (Rik van Riel) [597077]
- [mm] enhance mem_cgroup_charge_statistics with a page_size argument (Rik van Riel) [597058]
- [virt] add option to disable spinlock patching on hypervisor (Gleb Natapov) [599068]
- [virt] xen: don't touch xsave in cr4 (Andrew Jones) [599069]
- [drm] Update core to current drm-linus (Adam Jackson) [589547 589792 597022]
- [mm] fix refcount bug in anon_vma code (Rik van Riel) [602739]

Comment 11 Barry Marson 2010-07-21 17:23:05 UTC

I just built a -36 kernel and removed all of the [mm] patches from -34 and still see the issue ....  again .. THP was disabled ..

Barry

Comment 12 Rik van Riel 2010-07-21 22:35:25 UTC

Created attachment 433533 [details]
make kswapd call the memory compaction code

(I am only just now compiling a test RPM with this patch for my own systems, so the patch is currently untested).

Replacing lumpy reclaim with memory compaction seems to have
helped some of the worst cases encountered with large memory
allocations.

However, this did accidentally remove kswapd's helping out
with the defragmenting of memory.  This can cause memory
allocation failures for atomic and other non-waiting allocations,
which rely on kswapd to free and defragment memory for it.

This patch makes kswapd restrict itself to free memory up
to the PAGE_ALLOC_COSTLY_ORDER watermarks and has kswapd
call the memory compaction code if higher orders need to
be freed up.

This should help higher-order non waiting allocations, and
will hopefully also reduce the number of compaction stalls
encountered when running with transparent hugepages.

Signed-off-by: Rik van Riel <riel>
--- 
 include/linux/compaction.h |    6 ++++++
 mm/compaction.c            |    2 +-
 mm/vmscan.c                |   24 ++++++++++++++++++------
 3 files changed, 25 insertions(+), 7 deletions(-)

Comment 13 Rik van Riel 2010-07-22 03:14:41 UTC

I have been running tests on a kernel with this patch most of the evening now, it appears to be stable.

The brew task is here:

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2617241

Comment 14 Andrea Arcangeli 2010-07-22 09:19:56 UTC

why not calling memory compactionfor order <= PAGE_ALLOC_COSTLY_ORDER?

Comment 15 Rik van Riel 2010-07-22 13:17:30 UTC

Because the compaction code refuses to run for order <= PAGE_ALLOC_COSTLY_ORDER :)

Comment 16 Andrea Arcangeli 2010-07-22 17:56:22 UTC

But an order 2 allocation failed, PAGE_ALLOC_COSTLY_ORDER is 3, so this won't
invoke compaction for order 2 allocations, so how's supposed to help or make any difference for order 2 GFP_ATOMIC allocations? Just because some incidental >3 order allocation triggers in the background by something else? Those order >3 aren't guaranteed in background...

And shouldn't it be throttled with compaction_deferred?

Comment 17 Andrea Arcangeli 2010-07-22 18:02:24 UTC

Nevertheless it'll be interesting if it helps... so not against testing it..

Also I don't think it's correct that compaction refuses to run for order 1 2 and 3, probably we should remove that limit.

Also lumpy reclaim was a noop for order 2 allocations, so it's hard to see how it's related. I refer to the failure at the top. I've yet to check the message file though maybe it shows bigger order.

The lockup may also be a console issue with kernel error flooding.

I think tg3 should use __GFP_NOWARN in its jumbo packet allocation and that is the real bug, it seems it's just that failing with order 2. It must have a fallback. However if it happens more frequently we should figure it out why, but I'd replace the huge dump with a rate limited warning in tg3.

Comment 18 Andrea Arcangeli 2010-07-23 01:56:01 UTC

there are no interesting changes from 32 to 36 (other than THP turned on and off by default on different versions) but you were running with THP disabled according to comment 11. So if 32 or 33 run the most stable it makes no sense the regression starts in 36...

I would suggest the regression may have started from -28 to -29 with 28 showing less errors and 29 showing more errors. Can you check?

Comment 19 Andrea Arcangeli 2010-07-23 02:25:35 UTC

Created attachment 433859 [details]
use compaction to create order >0 GFP_ATOMIC pages (should be more reliable than lumpy with prio < DEF_PRIORTY -2)

Comment 20 Andrea Arcangeli 2010-07-23 02:41:25 UTC

I created a test rpm with the patch in id=433859 (adjusted to apply clean)

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2619614

This also has the freeezing fix for khugepaged/ksmd and the young-bit-heuristic-removal  from khugepaged.

Please keep testing after running:

"echo never >/sys/kernel/mm/redhat_transparent_hugepage/enabled"

to reduce the number of variables in the equation. Once test passes please try again with "echo always >/sys/kernel/mm/redhat_transparent_hugepage/enabled".

I still think tg3 shall be more friendly but hopefully we can make order 2 GFP_ATOMIC more reliable than it has ever been, with this.

Comment 21 Barry Marson 2010-07-23 15:19:54 UTC

proceeding to test kernel in #20.  This test requires kernels before -38 need to be built with a patch for NFS.  The work recommended by #18 will be looked into if we dont make progress.

Barry

Comment 22 Barry Marson 2010-07-23 18:38:25 UTC

I tested the kernel in #20 and got the allocation errors a little later in the test run (instead of at 6000 Ops/sec at 10000 Ops/sec.  Would you like me to post the messages log ?

Barry

Comment 23 Andrea Arcangeli 2010-07-24 14:24:48 UTC

Ok, I'm wondering if maybe the all_unreclaimable logic would potentially get activated and inhibit compaction. I disabled the all_unreclaimable for order > 0 allocations. I'm not too optimistic that all_unreclaimable was the problem but you may try it out again just in case...

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2623005

Comment 24 Andrea Arcangeli 2010-07-24 14:28:31 UTC

Created attachment 434149 [details]
same as 433859 but avoid setting all_unreclaimable for order > 0

Comment 25 Barry Marson 2010-07-25 02:34:25 UTC

I tried the kernel from #23 and still got some allocation failures.  I will let it run and post a summary of the number and types of failures

Barry

Comment 26 Barry Marson 2010-07-26 14:09:04 UTC

After just doing ext2 run, there were 2.5K allocation failures ... This is roughly in line with the kernel from #20

Barry

Comment 27 Andrea Arcangeli 2010-07-26 14:24:03 UTC

Hi Barry,

thanks a lot for testing.

what do you mean kernel from #20. Comment #20 or kernel version number -20? (before the memory compaction and lumpy reclaim removal?)

Does this mean we're good with the last patch in comment #24 included in build in comment #23?

Note: we cannot eliminate the failures, the large pages aren't reserved pinned for order 2 allocations in the min-watermark zone, so as other allocation runs they can get fragmented, and when the GFP_ATOMIC order 2 is invoked from TG3 driver, it will fail once, then kswapd is activated and told to generate more order 2 pages for the next GFP_ATOMIC allocations.

I could try to remove the priority check, that might speedup further the generation of order 2 pages, further reducing the number of failures, but they cannot go away completely, not in rhel6 at least.

I think tg3 should be less verbose about it, it's normal if order 2 allocation with GFP_ATOMIC  fail sometime and it should get rate limited.

Comment 28 Barry Marson 2010-07-26 14:59:05 UTC

Sorry, meant comment #20

I think there was confusion about the build referenced with comment #23.  If I was suppose to take that and patch it with comment #24 patch, then I haven't done the correct testing.  I'll grab the two bits build it and test it now

Well hopefully we can get rid of virtually all of them as I haven't seen them in past RHELs or even in RHEL6 when we moved to the SLAB around -19.

I suppose I could run this without jumbo frames and see what happens too.  I'll build the kernel and try that first.

Barry

Comment 29 Barry Marson 2010-07-26 15:56:53 UTC

So it the build from comment #23 does have the patch from comment #24.  So this has been tested.  Just for grins .. Im going to try removing jumbo frames.

Barry

Comment 30 Andrea Arcangeli 2010-07-26 16:12:40 UTC

If you remove jumbo frames the problem shall go away...

I assume the kernels around -19 also used jumbo frames?!? I think some failures are definitely to be expected with tg3 using jumbo frames allocated with GFP_ATOMIC order 2, that is absolutely unavoidable. Problem is how many is normal and by making kswapd more aggressive we can reduce the number of jumbo frames atomic alloc failures.

But tt's hard to imagine you really got _none_ in old kernels, if they were using jumbo frames too! I'd suggest to try -19 again and make sure jumbo frames were enabled and that you got zero failures with SLAB.

Comment 31 Barry Marson 2010-07-26 16:35:21 UTC

Ive always used jumbo frames ... and doing the tests around -19 with SLAB, there were no failures in any of the file systems.  I'ld like to verify that disabling jumbo frames makes it go away.  

Barry

Comment 32 Barry Marson 2010-07-28 12:43:37 UTC

OK .. so I ran all sorts of things the last few days.  Using the 52.el6transhuge failures occured as expected ...  Then I took jumbo frames out of the scene and this kernel worked for ext2.

I then proceeded to see if I could induce less errors by changing MTU from what was 9000, down to 8000.  I got the same errors.

I got bold and tried reducing MTU to under a page (ie 4000) and once again I got the same errors.  I thought this would have done better; actually I thought it would have made the problem disappear as much as default MTU 1500.

Late yesterday I reinstalled .32-19 and ran it with jumbo frames.  Something has been lurking for a while with this workload where we panic and we dont have access to the console.  One month it works, then the setup fails (and it's remote RDU)).  I disable panic on oops in an attempt to get a early culprit into the messages log.  In this case we got nothing but an iLO snapshot of half a panic message :(

But the point of the run was 19 out of 22 points were run and there wasn't a single allocation error.  I'm re-queueing it again ...

Barry

Comment 33 Barry Marson 2010-07-29 13:51:04 UTC

So the rerun was successful (not a single allocation failure) .. ie -19 kernel with jumbo frames MTU=9000 for both ext2 and xfs.  I picked these two, since all filesystems have been giving me trouble on more recent kernels, but the best fairly recently kernel was -33 where the problem occurred much less and only with xfs.

Barry

Comment 34 Andrea Arcangeli 2010-07-30 13:49:30 UTC

Ok I'll try to make kswapd a lot more aggressive calling memory compaction then we'll see.

Comment 35 Andrea Arcangeli 2010-07-30 14:56:40 UTC

Created attachment 435576 [details]
same as 434149 but run compaction immediately if kswapd notices fragmentation

Comment 36 Andrea Arcangeli 2010-07-30 14:58:30 UTC

going to build a kernel with 435576 and all patches from bug 614427

Comment 37 Andrea Arcangeli 2010-07-30 15:36:00 UTC

This -55 build includes:

https://bugzilla.redhat.com/attachment.cgi?id=435580
https://bugzilla.redhat.com/attachment.cgi?id=432044
https://bugzilla.redhat.com/attachment.cgi?id=435340
https://bugzilla.redhat.com/attachment.cgi?id=432045

from bug 614427.

https://bugzilla.redhat.com/attachment.cgi?id=435576

from bug 616600

removal of young bit check from khugepaged bug 615381


http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2641692

please test, hope this makes some difference. I think reintroducing lumpy for all high order alloc, is very risky with THP enabled by default (at most we could restrict it to kswapd considering that THP async-allocation are only given to khugepaged throttled with alloc_sleep_millisecs and never to kswapd thanks to __GFP_NO_KSWAPD being part of GFP_TRANSHUGE).

But frankly I hope this is going to work ok, and that we can solve it with compaction. People on lkmk is hacking around lumpy because they also noticed it made the system hang and they're not even trying to run order 9 allocation in a flood like THP does all the time... lumpy going blind on referenced bits is a responsiveness killer and in turn an hazardous logic that we better avoid if we can.

Comment 38 Barry Marson 2010-07-30 15:49:27 UTC

For this workload having THP on makes no sense IMO.  Millions of small files average size 8KB.

Will run this new kernel with THP off as always.  If it works, I'll try it with it on for reference.

Barry

Comment 39 Andrea Arcangeli 2010-07-30 15:50:21 UTC

one patch of the above list slipped, please use this build instead:

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2641800

Comment 40 Andrea Arcangeli 2010-07-30 15:54:46 UTC

I think turning THP on, could be invoking compaction more frequently and it may actually risk to _hide_ the problem. So yes it's good idea to test it with THP off, to avoid hiding the problem compared to the reference kernel .32-19 .

But I think THP shall be on even for this workload, if your workload won't take advantage of it, probably no hugepage will be allocated simply. In short the only thing you risk to waste is some little memory on userland, the memory taken by millions of little slab created by the little file won't change at all regardless of THP on or off.

Comment 41 Barry Marson 2010-08-02 14:22:00 UTC

using the kernel from brew referenced in comment #39, with MTU=9000 and THP off, I got nearly 3800 allocation failed messages ...  Here's their frequency

   3438 nfsd: page allocation failure. order:2, mode:0x20
    102 rsyslogd: page allocation failure. order:2, mode:0x20
     57 kswapd0: page allocation failure. order:2, mode:0x20
     48 ksoftirqd/2: page allocation failure. order:2, mode:0x20
     31 ksoftirqd/1: page allocation failure. order:2, mode:0x20
     18 automount: page allocation failure. order:2, mode:0x20
     17 irqbalance: page allocation failure. order:2, mode:0x20
      9 flush-8:240: page allocation failure. order:2, mode:0x20
      6 scsi_eh_1: page allocation failure. order:2, mode:0x20
      3 flush-8:176: page allocation failure. order:2, mode:0x20
      3 events/1: page allocation failure. order:2, mode:0x20
      2 flush-8:80: page allocation failure. order:2, mode:0x20
      2 flush-66:224: page allocation failure. order:2, mode:0x20
      2 flush-66:208: page allocation failure. order:2, mode:0x20
      2 ata/1: page allocation failure. order:2, mode:0x20
      1 swapper: page allocation failure. order:2, mode:0x20
      1 master: page allocation failure. order:2, mode:0x20
      1 hald-addon-stor: page allocation failure. order:2, mode:0x20
      1 flush-8:16: page allocation failure. order:2, mode:0x20
      1 flush-67:112: page allocation failure. order:2, mode:0x20
      1 flush-66:80: page allocation failure. order:2, mode:0x20
      1 flush-65:224: page allocation failure. order:2, mode:0x20

I'll try THP on as an experiment

My question is .. on the prior tests with the -52 based kernel, why do we get allocation errors with MTU=4000 and not with 1500 ? I expected anything comfortably under a page size would not fail

Barry

Comment 42 Andrea Arcangeli 2010-08-02 17:18:09 UTC

Why 4000 is not enough must be a networking detail likely there are more than 96 bytes of header space.

So the allocation error goes down compared to the 11000. Reading compaction code I think I need to tweak compaction, it's not made for GFP_ATOMIC and it's super tuned for direct reclaim (low watermark instead of high)  and it stops immediately after there's a single page of the right order (again ideal for direct reclaim but broken for GFP_ATOMIC preparation of the high watermark). I'll fix this and we'll try again.

Comment 43 Andrea Arcangeli 2010-08-02 18:01:27 UTC

here a new build to try, this time compaction will not stop if there is 1 free page of right order in the freelist and it'll keep going until we reach the high wmark.

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2645962

Comment 44 Andrea Arcangeli 2010-08-02 18:02:20 UTC

Created attachment 436082 [details]
create GFP_ATOMIC high order pages with compaction

Comment 45 Barry Marson 2010-08-02 21:50:46 UTC

Tried THP on for the kernel referenced in comment #39 and still had the issues.  Also the error count is actually about the same.  The 11K referenced was the sum of testing multiple file system types.

Chatting with gospo, I found out that the allocation needs for tg3 network cards (half of what I have) always allocate more than order 2 if MTU > 1500 (sigh).  The e1000's do it totally different (they have some sort of DMA).  Might experiment with reduced client counts to see if I can point this to one NIC drivers behavior vs the other.

But first I'll try the kernel in comment #45

Is there a way to get those kernel bits without using the GUI.  Im remote here still.

Barry

Comment 46 Andrea Arcangeli 2010-08-02 22:06:16 UTC

wget --no-check-certificate "https://brewweb.devel.redhat.com/getfile?taskID=2645965&name=kernel-2.6.32-56.el6transhuge.x86_64.rpm"

Comment 47 Barry Marson 2010-08-03 14:56:55 UTC

kernel from comment #45 fails same way. Sorted histogram for ext2 run follows ...

   3427 nfsd: page allocation failure. order:2, mode:0x20
    108 rsyslogd: page allocation failure. order:2, mode:0x20
     65 irqbalance: page allocation failure. order:2, mode:0x20
     44 ksoftirqd/2: page allocation failure. order:2, mode:0x20
     41 ksoftirqd/1: page allocation failure. order:2, mode:0x20
     31 kswapd0: page allocation failure. order:2, mode:0x20
     15 automount: page allocation failure. order:2, mode:0x20
      6 events/1: page allocation failure. order:2, mode:0x20
      5 flush-65:160: page allocation failure. order:2, mode:0x20
      3 migration/1: page allocation failure. order:2, mode:0x20
      3 flush-66:224: page allocation failure. order:2, mode:0x20
      2 swapper: page allocation failure. order:2, mode:0x20
      2 jbd2/dm-0-8: page allocation failure. order:2, mode:0x20
      2 flush-67:112: page allocation failure. order:2, mode:0x20
      2 flush-66:64: page allocation failure. order:2, mode:0x20
      2 events/2: page allocation failure. order:2, mode:0x20
      1 flush-8:144: page allocation failure. order:2, mode:0x20
      1 flush-67:80: page allocation failure. order:2, mode:0x20
      1 flush-66:80: page allocation failure. order:2, mode:0x20
      1 flush-65:208: page allocation failure. order:2, mode:0x20

Barry

Comment 48 Barry Marson 2010-08-03 14:58:13 UTC

make that comment #43 ...  ie

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2645962

Barry

Comment 49 Andrea Arcangeli 2010-08-05 13:18:58 UTC

Ok I will have to write a module to simulate this GFP_ATOMIC order 2... and hopefully see what is going on.

So the workload is working fine, just you get flooded by these warnings? No more lockup? Or still?

Comment 50 Barry Marson 2010-08-05 14:31:33 UTC

I did some experimenting over the last couple of days and here's what was observed.

Running the benchmark using half the clients which are exclusively connected to the tg3 NICS shows the problem.

Running the benchmark using half the clients which are exclusively connected to the e1000 NICS shows NO problem at all.

This is definitely a sensitivity to the way the tg3 driver allocates.

One could argue the workload functions, but the synchronous allocation failure effect performance on the server.

Every once in a while the system will lock up .. but that hasn't happened for a week.

Barry

Comment 51 Neil Horman 2010-08-10 14:56:16 UTC

I wonder if this is related to commit d2757fc4076118e13180e91f02c3c52659be3d9d.  I've got a test build here:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2671142

Barry, could you please test it out when its complete?  Thanks!

Comment 53 Rik van Riel 2010-08-16 19:32:08 UTC

We may have a fix for this issue in our kernel tree.  Barry, would you have time to test the latest RHEL6 kernel?

This got merged in kernel 2.6.32-60.el6.

commit 58ee95276aa7ae00d289823d3471c0bd4c0ffb1b
Author: Andrea Arcangeli <aarcange>
Date:   Fri Jul 30 16:18:20 2010 -0400

    [mm] Correctly assign the number of MIGRATE_RESERVE pageblocks
    
    Message-id: <20100730161820.GH16655>
    Patchwork-id: 27259
    O-Subject: [RHEL6 PATCH] Correctly assign the number of MIGRATE_RESERVE
        pageblocks
    Bugzilla: 614427
    RH-Acked-by: Rik van Riel <riel>
    RH-Acked-by: Larry Woodman <lwoodman>
    RH-Acked-by: Marcelo Tosatti <mtosatti>
    
    https://bugzilla.redhat.com/show_bug.cgi?id=614427
    https://bugzilla.redhat.com/attachment.cgi?id=435340
    ========
    Subject: mm: page-allocator: Correctly assign the number of MIGRATE_RESERVE 
    
    From: Mel Gorman <mel.ie>
    
    The page allocator marks a maximum of 2 pageblocks per zone
    MIGRATE_RESERVE. This is to implement a "wilderness preservation heuristic"
    for high-order atomic allocations. Ordinarily, min free kbytes is low and
    increasing it helps fragmentation control. hugeadm can set a recommended
    min_free_kbytes value that should be run from an init script.
    
    Rather than using an init script, the RHEL kernel sets the recommended
    min_free_kbytes early in boot. The problem is that
    when selecting pageblocks to convert to MIGRATE_RESERVE, MIGRATE_MOVABLE
    blocks are preferred of which none or very few exist early in boot.
    
    The end result is that the wrong number of MIGRATE_RESERVE blocks are set
    and high-order atomic allocation success rates suffer. While guarantees
    are never made for high-order atomic allocations, correct setting of
    MIGRATE_RESERVE pageblocks is vital for them to succeed at all. This
    patch corrects the problem by taking a second pass at setting
    MIGRATE_RESERVE if enough MIGRATE_MOVABLE blocks do not exist.
    
    Signed-off-by: Mel Gorman <mel.ie>
    Signed-off-by: Andrea Arcangeli <aarcange>
    Signed-off-by: Aristeu Rozanski <arozansk>

Comment 54 Barry Marson 2010-08-16 20:42:48 UTC

I've had to move forward on this issue by migrating to all Intel based e1000 NICs.  I'm presently testing that so I can at least move forward.  Testing is going to take a couple of days.

The Broadcoms are still installed but not connected.  Since the machine is remote to me, I would have to get HP's help with re-cabling.

Barry

Comment 55 Larry Woodman 2010-08-16 20:47:36 UTC

Barry, didnt you tell me this only happens when running over tg3???

Larry

Comment 56 Barry Marson 2010-08-16 20:54:54 UTC

Yes, this is definitely an interaction issue with the tg3 driver and this workload under heavy memory pressure.  Before very recent changes, there were 4 clients, 2 talked to tg3 and 2 to e1000.  Running the test using only the e1000 served clients worked fine.  The tg3 served clients had all the allocation messages.

Barry

Comment 57 Linda Wang 2010-08-16 22:11:19 UTC


*** This bug has been marked as a duplicate of bug 614427 ***

Comment 58 Linda Wang 2010-08-16 22:59:12 UTC

reopen to have Barry verified if the
patch in bug 614427 fixes the issue that he was seeing.

Comment 60 Barry Marson 2010-08-20 18:46:36 UTC

So I got my config back where I can use the tg3 NIC cards.  I tested the -64 kernel with just the ext2 file system and jumbo frames.  The problem still happens.

It's not clear from the bz referenced in comment #58 whether a much older kernel (-57) was what should have been tested.  If so, I've obviously tested something even newer.

One suggestion gospo recommended was to try setting the ring parameters.  This is what I see for two NICS with ethtool -g

e1000:

Ring parameters for eth2:
Pre-set maximums:
RX:		4096
RX Mini:	0
RX Jumbo:	0
TX:		4096
Current hardware settings:
RX:		256
RX Mini:	0
RX Jumbo:	0
TX:		256

tg3:

Ring parameters for eth3:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	255
TX:		511
Current hardware settings:
RX:		200
RX Mini:	0
RX Jumbo:	100
TX:		511

Can someone give me the syntax to set the RX and RX jumbo higher ?

Thanks,
Barry

Comment 61 Neil Horman 2010-08-23 15:54:30 UTC

relating to comment 51, my build broke, heres a new brew build:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2700455

Comment 62 Barry Marson 2010-08-23 20:35:13 UTC

Tried the kernel from comment #61 and it fails in the same way.

Going to mess with ring buffer sizes

Barry

Comment 65 Fujitsu kernel engineers 2010-09-21 07:11:59 UTC

Hi

I've faced similar problem. because RHEL5 and RHEL6 have difference GFP_ATOMIC definition.

RHEL5
 #define GFP_ATOMIC      (__GFP_HIGH | __GFP_NOWARN)

RHEL6
 #define GFP_ATOMIC      (__GFP_HIGH)


That said, network stress workload often makes allocation failure but RHEL5 doesn't display it. but RHEL6 does.

Can I ask why was it changed? I bet a lot of customers think this message mean regression occur.



 - KOSAKI Motohiro

Comment 66 Rik van Riel 2010-09-21 14:54:14 UTC

Ohhh, thank you for finding that!

It appears that RHEL5 has a patch (linux-2.6-vm-silence-atomic-alloc-failures.patch) to silence failures from atomic allocations.  This patch appears to have been in Fedora since 2.6.12 in Fedora Core 5.

I am not sure why it never made it upstream, but I'll submit it right now.

Comment 67 Rik van Riel 2010-09-21 17:03:08 UTC

Upstream discussion: http://lkml.org/lkml/2010/9/21/204

Comment 68 Barry Marson 2010-09-22 00:52:00 UTC

This is very interesting.  I find it even more interesting that these failures dont happen with the intel e1000, just the broadcom tg3 .  I wonder what intel does so differently.


Barry

Comment 69 Fujitsu kernel engineers 2010-09-22 06:14:03 UTC

Hi

I guess e1000 doesn't have jumbo frame feature. (dunno, please contact network expert). tg3 has.
jumbo frame need order-2 allocation and it can fail very frequently. 

I guess more recent intel network card has similar issue.


 - kosaki

Comment 73 Rik van Riel 2011-03-02 18:27:24 UTC


*** This bug has been marked as a duplicate of bug 674147 ***

Note You need to log in before you can comment on or make changes to this bug.