Created attachment 342659 [details] /var/log/dmesg I've got a new Dell R710 server that has two quad core i7 CPUs and 24GB of memory running an up to date F11RC+rawhide (kernel 2.6.29.2-126.fc11.x86_64) and it looks to me link it's not seeing all the memory: [root@tessellate tjb]# cat /proc/meminfo MemTotal: 3052108 kB MemFree: 2602928 kB Buffers: 19528 kB Cached: 165560 kB SwapCached: 0 kB Active: 75148 kB Inactive: 137076 kB Active(anon): 27368 kB Inactive(anon): 0 kB Active(file): 47780 kB Inactive(file): 137076 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 8388600 kB SwapFree: 8388600 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 27168 kB Mapped: 17680 kB Slab: 82164 kB SReclaimable: 12884 kB SUnreclaim: 69280 kB PageTables: 5560 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 9914652 kB Committed_AS: 295396 kB VmallocTotal: 34359738367 kB VmallocUsed: 283692 kB VmallocChunk: 34359422535 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 6720 kB DirectMap2M: 3129344 kB [root@tessellate tjb]# cat /proc/version Linux version 2.6.29.2-126.fc11.x86_64 (mockbuild.phx.redhat.com) (gcc version 4.4.0 20090427 (Red Hat 4.4.0-3) (GCC) ) #1 SMP Mon May 4 04:46:15 EDT 2009 [root@tessellate tjb]# I'll attach the dmesg next.
Your BIOS is only providing the e801 interface for getting the information for memory below 4GB: BIOS-provided physical RAM map: BIOS-e801: 0000000000000000 - 000000000009f000 (usable) BIOS-e801: 0000000000100000 - 00000000bf690000 (usable) Either this is a BIOS bug or Dell is now requiring the OS to get the information via EFI instead of using the e820 interface.
Can you try booting with "add_efi_memmap" on the kernel flags? Thanks, Kyle
I added add_efi_memmap to the kernel command line but it didn't seem to help.
Created attachment 342831 [details] /var/log/message with efi command line option added
Adding some cluefull folks.
Thomas, is it possible that "OS Install Mode" (or similar wording) is enabled in BIOS SETUP (F2 during POST) ? That has the effect of limiting the amount of RAM that BIOS exposes to the OS, to handle legacy OSs with problems when presented with large amounts of RAM.
I just searched the BIOS and didn't see anything like that (and I've seen it in other BIOSes before). This is the first Dell BIOS I've seen that has a dedicated memory section. I've got it set to do node interleaving now as it works around another bug I was seeing with virtualization (#499633) but either setting only yields 3GB of memory to linux. (BIOS reports full 24GB).
New kernel 2.6.29.3-140.fc11.x86_64 doesn't help.
I wouldn't expect it to... Can you try a 2.6.30-rc5 kernel from koji? Maybe we can backport it if we find it's fixed, otherwise we'll need to support your machine upstream anyway. http://koji.fedoraproject.org/koji/buildinfo?buildID=101770 cheers, Kyle
PANIC: early exception 0e rip 10:ffffffff010f5fbf error 0 cr2 5a10 immediately on boot of kernel-2.6.30-0.81.rc5.git1.fc12.x86_64.rpm
(In reply to comment #10) > PANIC: early exception 0e rip 10:ffffffff010f5fbf error 0 cr2 5a10 > > immediately on boot of kernel-2.6.30-0.81.rc5.git1.fc12.x86_64.rpm at next_zones_zonelist+0x2b The early exception code is supposed to dump the stack and print the address symbol but I've never seen that working.
(In reply to comment #2) > Can you try booting with "add_efi_memmap" on the kernel flags? > add_efi_memmap doesn't do anything unless the system is booting in efi mode.
I suspect this machine should be booting with EFI...
Created attachment 344217 [details] patch to add some debugging for e820 failure Looks like if e820 parsing fails we silently fall back to e801.
As a test, I just booted a CentOS/RHEL 5.3 kernel and it sees all 24GB. Do you think this is something that will get fixed soon? I need to get this system up and running but could keep it at F11 for a few days if this is something being actively worked on. Otherwise there's some pressure to get it working.
Created attachment 344606 [details] dmesg from successful centos/rhel 5.3 boot
Created attachment 344607 [details] dmesg missed the important parts, here's the syslog
Can you try a Fedora 10 install disk? You'd just have to boot the network-based installer and go to the console to look at /proc/meminfo . http://download.fedora.redhat.com/pub/fedora/linux/releases/10/Fedora/x86_64/iso/Fedora-10-x86_64-netinst.iso
Created attachment 344660 [details] works with fedora 10 - here's the syslog
http://koji.fedoraproject.org/koji/taskinfo?taskID=1364149 can you try booting one of the kernels here? this will at least tell us what entry is failing and probably why, so we can fix the error in the e820 sanitizer. the lines needed should start with nr_map. (the build will probably take an hour or two to chug through.)
http://koji.fedoraproject.org/koji/taskinfo?taskID=1364338 on second thought, could you try this one which has a few e820 patches reverted. thanks! kyle
[root@tessellate /]# cat /proc/version Linux version 2.6.29.3-152.bz499396.fc11.x86_64 (mockbuild.phx.redhat.com) (gcc version 4.4.0 20090506 (Red Hat 4.4.0-4) (GCC) ) #1 SMP Tue May 19 15:23:37 EDT 2009 [root@tessellate /]# cat /proc/meminfo MemTotal: 24489808 kB MemFree: 24128584 kB Buffers: 20088 kB Cached: 55820 kB SwapCached: 0 kB Active: 48188 kB Inactive: 54200 kB Active(anon): 26728 kB Inactive(anon): 0 kB Active(file): 21460 kB Inactive(file): 54200 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 8388600 kB SwapFree: 8388600 kB Dirty: 72 kB Writeback: 0 kB AnonPages: 26616 kB Mapped: 15412 kB Slab: 55276 kB SReclaimable: 11608 kB SUnreclaim: 43668 kB PageTables: 5008 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 20633504 kB Committed_AS: 303580 kB VmallocTotal: 34359738367 kB VmallocUsed: 333764 kB VmallocChunk: 34359372543 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 6756 kB DirectMap2M: 25149440 kB
Ok. Damn. This pretty much means it's probably broken upstream in 2.6.30 too. :/ Just to clarify, this output is from the second kernel? Can you grep syslog for those nr_map lines? Thanks! Kyle
Yes, this is the second kernel with "a few e820 patches reverted" as referenced in comment #21. Perhaps this doesn't have the debugging enabled as there are no nr_map messages in dmesg or syslog.
Oh, duh, I'm brainless. I guess the patches fix whatever is corrupting the e820 table. Can you fish these lines out of the scratchbuild I posted just before it? I'll have to do one more build to really narrow down the broken changeset. cheers, Kyle
I installed the kernel from #20 (it's named identically to the one in #21 which is making this somewhat confusing) and it booted, shows only 3GB but doesn't have any nr_map messages.
Ok, thanks! I've tagged and am building a proper -154 which should fix this. I'll try to get this sorted out upstream as well. Thanks again for testing! kyle
Just fwiw, it's one of these patches upstream that causes the issue: commit cd670599b7b00d9263f6f11a05c0edeb9cbedaf3 Author: H. Peter Anvin <hpa> Date: Wed Apr 1 11:35:00 2009 -0700 x86, setup: guard against pre-ACPI 3 e820 code not updating %ecx Impact: BIOS bug safety For pre-ACPI 3 BIOSes, pre-initialize the end of the e820 buffer just in case the BIOS returns an unchanged %ecx but without actually touching the ACPI 3 extended flags field. Signed-off-by: H. Peter Anvin <hpa> commit c549e71d073a6e9a4847497344db28a784061455 Author: H. Peter Anvin <hpa> Date: Sat Mar 28 13:53:26 2009 -0700 x86, setup: ACPI 3, BIOS workaround for E820-probing code Impact: ACPI 3 spec compliance, BIOS bug workaround The ACPI 3 spec added another field to the E820 buffer -- which is backwards incompatible, since it contains a validity bit. Furthermore, there has been at least one report of a BIOS which assumes that the buffer it is pointed at is the same buffer as for the previous E820 call. Therefore, read the data into a temporary buffer and copy the standard part of it if and only if the valid bit is set. Signed-off-by: H. Peter Anvin <hpa> commit 32ec7fd08b597586774b92ac1cd2678021ccac1b Author: H. Peter Anvin <hpa> Date: Sat Mar 28 13:53:26 2009 -0700 x86, setup: preemptively save/restore edi and ebp around INT 15 E820 Impact: BIOS bugproofing Since there are BIOSes known to clobber %ebx and %esi for INT 15 E820, assume there is something out there clobbering %edi and/or %ebp too, and don't wait for it to fail. Signed-off-by: H. Peter Anvin <hpa>
adding hpa, as you're reverting his patches. :-)
OK, I really need to know this. This sounds like a BIOS which purports to report ACPI 3 information, which is broken. It would be helpful if someone could run the "meminfo" module from the Syslinux distribution on this system; it should report the raw memory map as reported by the BIOS.
Also, what BIOS version is/was running on the offending server? I think we need to understand this, critically.
The server is a Dell R710 running BIOS version 1.0.4. Could you elaborate on how to run the meminfo module?
I don't know if the Fedora install disks include it; if they do you can just boot to the menu, press Esc to get to a boot prompt, and then type "meminfo".
meminfo doesn't seem to be included on the Fedora 8 disc I tried. Memtest is though and it correctly shows 24GB. Does that help? I could include a screenshot if need be.
Created attachment 344871 [details] t610.jpg meminfo.c32 from an t610 (albeit with a test BIOS).
From conversation with hpa: ACPI 3.0b, section 14.1; in particular tables 14-4 and 14-5 Table 14-5 specifies bit 0 in the extended flags as "AddressRangeEnabled": "If clear, the OSPM ignores the Address Range Descriptor. This allows the BIOS to populate the E820 table with a static number of structures but only enable them as necessary." Engaging the BIOS team leads.
Created attachment 344901 [details] Proposed patch for upstream I am proposing this the attached patch for upstream. I would appreciate any help testing this patch to verify that it fixes the problem.
Scratch rpms are building here: http://koji.fedoraproject.org/koji/taskinfo?taskID=1367616 Could you please test this, Thomas, and let us know if you still get your full 24GB of memory with them? regards, Kyle
Created attachment 344950 [details] meminfo.c32 from actual machine
It does seem to fix it. Any other info you want? tessellate> cat /proc/version Linux version 2.6.29.3-155.bz499396.fc11.x86_64 (mockbuild.phx.redhat.com) (gcc version 4.4.0 20090506 (Red Hat 4.4.0-4) (GCC) ) #1 SMP Wed May 20 21:36:06 EDT 2009 tessellate> cat /proc/meminfo MemTotal: 24489808 kB MemFree: 24131152 kB Buffers: 20120 kB Cached: 55144 kB SwapCached: 0 kB Active: 45192 kB Inactive: 55248 kB Active(anon): 25388 kB Inactive(anon): 0 kB Active(file): 19804 kB Inactive(file): 55248 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 8388600 kB SwapFree: 8388600 kB Dirty: 24 kB Writeback: 0 kB AnonPages: 25236 kB Mapped: 14700 kB Slab: 54532 kB SReclaimable: 11352 kB SUnreclaim: 43180 kB PageTables: 4812 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 20633504 kB Committed_AS: 301440 kB VmallocTotal: 34359738367 kB VmallocUsed: 333764 kB VmallocChunk: 34359371951 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 6756 kB DirectMap2M: 25149440 kB tessellate>
hpa, http://koji.fedoraproject.org/koji/taskinfo?taskID=1370708 new scratch build is available here with your latest patch (the reversion of the ACPI 3 e820 code.) thanks! Kyle
That build seems to work for me. Do you want any output?
The patch included in the build in comment #41 (reversion of ACPI 3 e820 code) has landed in the x86-tip tree, and is expected to be in Linus's tree shortly.