Bug 178977

Summary:	NUMA hash table lookup fails on dual opteron 252 system w/16GB of RAM
Product:	Red Hat Enterprise Linux 4	Reporter:	Jarod Wilson <jarodwilson>
Component:	kernel	Assignee:	Peter Martuccelli <peterm>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.0	CC:	jarod, jbaron
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-06-15 21:11:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jarod Wilson 2006-01-25 21:48:27 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20051130 Fedora/1.5-1.jw Firefox/1.5

Description of problem:
I'm responsible for a number of Opteron clusters at work. Some of the older ones have compute nodes with dual Opteron 248 processors, and either 16 or 8 GB of RAM, the newer clusters have dual Opteron 252 processors and either 16 or 8 GB of RAM. NUMA hash table lookups (and thus memory controller setup/assignment for numactl) work fine on both the 16 and 8 GB Opteron 248 nodes, as well as on the 8 GB Opteron 252 nodes, but fail on the 16 GB Opteron 252 nodes.

At system startup, we see the following:

<6>BIOS-provided physical RAM map:
<4> BIOS-e820: 0000000000000000 - 000000000009a800 (usable)
<4> BIOS-e820: 000000000009a800 - 00000000000a0000 (reserved)
<4> BIOS-e820: 00000000000cc000 - 0000000000100000 (reserved)
<4> BIOS-e820: 0000000000100000 - 00000000fbf7c000 (usable)
<4> BIOS-e820: 00000000fbf7c000 - 00000000fbf80000 (ACPI NVS)
<4> BIOS-e820: 00000000fbf80000 - 00000000fc000000 (reserved)
<4> BIOS-e820: 00000000fec00000 - 00000000fec00400 (reserved)
<4> BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
<4> BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
<4> BIOS-e820: 0000000100000000 - 0000000404000000 (usable)
<6>Scanning NUMA topology in Northbridge 24
<6>Number of nodes 2 (10010)
<6>Node 0 MemBase 0000000000000000 Limit 0000000203ffffff
<6>Node 1 MemBase 0000000204000000 Limit 0000000403ffffff
<6>node 1 shift 24 addr 204000000 conflict 0
<6>node 1 shift 25 addr 204000000 conflict 0
<6>node 1 shift 26 addr 3fc000000 conflict 0
<6>node 1 shift 27 addr 204000000 conflict 0
<6>node 1 shift 28 addr 204000000 conflict 0
<6>node 1 shift 29 addr 204000000 conflict 0
<6>node 1 shift 30 addr 204000000 conflict 0
<6>node 1 shift 31 addr 204000000 conflict 0
<6>node 1 shift 32 addr 204000000 conflict 0
<6>node 1 shift 33 addr 204000000 conflict 0
<6>node 1 shift 34 addr 204000000 conflict 0
<6>node 1 shift 35 addr 204000000 conflict 0
<6>node 1 shift 36 addr 204000000 conflict 0
<6>node 1 shift 37 addr 204000000 conflict 0
<6>node 1 shift 38 addr 204000000 conflict 0
<6>node 1 shift 39 addr 204000000 conflict 0
<6>node 1 shift 40 addr 204000000 conflict 0
<6>node 1 shift 41 addr 204000000 conflict 0
<6>node 1 shift 42 addr 204000000 conflict 0
<6>node 1 shift 43 addr 204000000 conflict 0
<6>node 1 shift 44 addr 204000000 conflict 0
<6>node 1 shift 45 addr 204000000 conflict 0
<6>node 1 shift 46 addr 204000000 conflict 0
<6>node 1 shift 47 addr 204000000 conflict 0
<3>No NUMA node hash function found. Contact maintainer
<6>No NUMA configuration found
<6>Faking a node at 0000000000000000-0000000404000000
<4>Bootmem setup node 0 0000000000000000-0000000404000000
<6>No mptable found.
<4>On node 0 totalpages: 4210688
<4> DMA zone: 4096 pages, LIFO batch:1
<4> Normal zone: 4206592 pages, LIFO batch:31
<4> HighMem zone: 0 pages, LIFO batch:1


Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-22.0.1.EL

How reproducible:
Always

Steps to Reproduce:
1. Set up a dual Opteron 252 system with 16GB of RAM
2. Install latest smp kernel
3. Boot it up, check out your logs and the output of 'numactl --hardware'
  

Actual Results:  # numactl --hardware
available: 1 nodes (0-0)
node 0 size: 16576 MB
node 0 free: 15887 MB 

Expected Results:  # numactl --hardware
available: 2 nodes (0-1)
node 0 size: 8383 MB
node 0 free: 7876 MB
node 1 size: 8191 MB
node 1 free: 7945 MB 

Additional info:

This problem presents itself with SUSE Linux Enterprise Server 9, all kernels prior to their SP3 release, but was fixed in their SP3 kernel. Haven't tested under 2.6.9-22.0.2.EL, but didn't see anything in the changelogs to indicate it had been addressed yet.

There's approximately a 15% degredation in compute node performance when running without the proper memory controllers set up (likely due to processes on cpu0 getting assigned memory on cpu1 and vice versa, instead of attempting to stick to memory on the local cpu's memory controller).

Comment 1 Jarod Wilson 2006-01-26 00:26:51 UTC

I'll see if I can't isolate the patch SUSE added to their SP3 kernel and slap it
on top of 2.6.9-22.0.2.EL later tonight or tomorrow.

Comment 2 Jarod Wilson 2006-01-26 22:28:18 UTC

Dead simple patch, if this is really all that's needed. Will get this applied
later today and see if the issue is resolved...

--------

From: ak
Subject: Increase NUMA node hash size
Suse-bugzilla: 106287
Patch-mainline: yes

This is needed on some systems with AMD E stepping CPUs which
have memory hoisting enabled. The memory map is not unform
enough for the 256 entry hash table. Enlarge to 0xfff

diff -u linux-2.6.5-hack/include/asm-x86_64/mmzone.h-o
linux-2.6.5-hack/include/asm-x86_64/mmzone.h
--- linux-2.6.5-hack/include/asm-x86_64/mmzone.h-o      2004-04-04
05:38:00.000000000 +0200
+++ linux-2.6.5-hack/include/asm-x86_64/mmzone.h        2005-09-30
13:46:17.000000000 +0200
@@ -13,7 +13,7 @@
 #include <asm/smp.h>

 #define MAXNODE 8
-#define NODEMAPSIZE 0xff
+#define NODEMAPSIZE 0xfff

 /* Simple perfect hash to map physical addresses to node numbers */
 extern int memnode_shift;

Comment 3 Jarod Wilson 2006-01-30 19:59:36 UTC

Well, apparently, that is NOT all that is required to fix this. I've verified
that the kernel I'm running now does have this patch implemented, but the
problem still exists. Back to the drawing board...

Comment 4 Jim Paradis 2006-01-30 22:38:59 UTC

The NUMA hash function was re-implemented in RHEL4 Update 2.  Please upgrade to
Update 2 or later and inform us if the problem persists.

Comment 5 Jarod Wilson 2006-01-30 23:03:31 UTC

The problem still exists with kernel-smp-2.6.9-22.0.2.EL, as well as with a
kernel built from the same sources w/the extra hash size patch (the
reimplemented numa hash function may explain why that patch didn't help). All
released updates have been applied to this system. I have yet to try out a U3
beta kernel though.

Comment 6 Jim Paradis 2006-01-30 23:12:23 UTC

If that is the case, then please provide a console log of an affected system
running the most recent kernel you have.  Printouts of the form:

        <6>node 1 shift 29 addr 204000000 conflict 0

were eliminated in the re-implementation of the NUMA hash function for U2.  Any
boot log that shows lines like this must be prior to U2.

Comment 7 Jarod Wilson 2006-01-30 23:25:53 UTC

I believe the initial console log was from am earlier kernel, but 'numactl
--hardware' on 2.6.9-22.0.2.EL does still show only a single memory controller.
I'll grab current console output a bit later this afternoon.

Comment 8 Jarod Wilson 2006-01-31 04:29:45 UTC

Here's the console output w/kernel-smp-2.6.9-22.0.2.EL:

Scanning NUMA topology in Northbridge 24
Number of nodes 2 (10010)
Node 0 using interleaving mode 1/0
No NUMA configuration found
Faking a node at 0000000000000000-0000000420000000
Bootmem setup node 0 0000000000000000-0000000420000000
No mptable found.
On node 0 totalpages: 4325376
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 4321280 pages, LIFO batch:16
  HighMem zone: 0 pages, LIFO batch:1

Comment 9 Jim Paradis 2006-09-28 21:30:23 UTC

Does this problem persist with the most recent kernel?

Comment 10 Jarod Wilson 2006-09-29 03:42:37 UTC

Unfortunately, I don't have access to the hardware to test this on anymore...
Lemme see if I can ping someone back at my former employer to take a look though.

Comment 11 Red Hat Bugzilla 2007-03-18 22:37:32 UTC

User jparadis's account has been closed

Comment 12 Jarod Wilson 2007-06-15 21:11:47 UTC

No access to hardware and nobody else has reported a problem in over a year.
Closing INSUFFICIENT_DATA.