Bug 710528

Summary: use numa_zonelist_order=N by default
Product: Red Hat Enterprise Linux 6 Reporter: Luming Yu <luyu>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1CC: bmarson, jburke, peterm
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-25 13:37:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Luming Yu 2011-06-03 15:49:05 UTC
Description of problem:
 "RHEL6.1 allocator tries to preserve free pages in DMA32 memory zone, which only exist on numa0 node.This is a 3 GB region (accessible by 32-bit PCIe devices), and on SNB system with the discussed issue has 681628 free pages = 2660 MB. These pages stay untouched until numa1 is exchausted.

By default, we have seen the HPCC’s Star Stream subtest show significant performance variations on SNB nodes.By adding boot parameter numa_zonelist_order=N, the problem disappears.

Comment 2 Larry Woodman 2011-06-03 16:38:40 UTC
This was added by design to prevent the system from incurring OOM kills and/or DMA allocations start failing because the DMA32 zone becomes exhausted and the memory can not be reclaimed.  In order to prevent this from happening build_zonelists creates zonelists for all nodes that place the DMA32 zone from node0 after all Normal zones by default.  This means the system will use all Normal zone memory before using any DMA32 zone memory from node0.  While this is generally the desired behaviour, it will allocate memory on a remote node rather than on node0 even it it will all fit on node0.

However, this default behaviour can be changed to force the system to use the DMA32 zone before falling over to another node by adding numa_zonelist_order=N
on the boot cmdline on /boot/grub/grub.conf.  Rather than changing anything here I think we should create a release note that describes this scenario and tells when to use the numa_zonelist_order=N boot parameter.

Larry Woodman

Comment 3 Larry Woodman 2011-07-25 13:37:28 UTC
This works as designed in the upstream kernel.  By default the system will exhaust all normal zone memory before attempting to allocate and use DMA32 zone memory.  If this is not the desired effect the system should be booted with "numa_zonelist_order=N" on the boot cmdline.  We can not change this default behavior because the system is prone to OOMkills.  If necessary we can write a release not or kbase article firther describing the details.

Larry Woodman