Description of problem: After upgrading to kernel-smp-2.6.11-1.14_FC3 a dual-opteron tyan 2882 system displays several log errors like Apr 21 13:53:58 n26 kernel: mm/memory.c:97: bad pmd ffff8100126a4808(00000035f8800a88). Apr 21 13:53:58 n26 kernel: mm/memory.c:97: bad pmd ffff8100126a4810(0000000000000001). Apr 21 13:53:58 n26 kernel: mm/memory.c:97: bad pmd ffff8100126a4818(00007ffffffffa32). and finally starts to contiually oops: Apr 21 13:56:17 n26 kernel: mm/memory.c:97: bad pmd ffff81000e1a8b10(34365f3638780000). Apr 21 13:56:17 n26 kernel: Unable to handle kernel paging request at ffff810250800000 RIP: Apr 21 13:56:17 n26 kernel: <ffffffff80177024>{free_pages_and_swap_cache+68} Apr 21 13:56:17 n26 kernel: PGD 8063 PUD 0 Apr 21 13:56:17 n26 kernel: Oops: 0000 [1] SMP Apr 21 13:56:17 n26 kernel: CPU 0 Apr 21 13:56:17 n26 kernel: Modules linked in: nfs lockd md5 ipv6 parport_pc lp parport autofs4 sunrpc pcmcia yenta_socket rsrc_nonstatic pcmcia_ core video button battery ac ohci_hcd e100 mii tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_sil libata sd_mod scsi_mod Apr 21 13:56:17 n26 kernel: Pid: 20886, comm: id Not tainted 2.6.11-1.14_FC3smp Apr 21 13:56:17 n26 kernel: RIP: 0010:[<ffffffff80177024>] <ffffffff80177024>{free_pages_and_swap_cache+68} Apr 21 13:56:17 n26 kernel: RSP: 0000:ffff810015ef1ce8 EFLAGS: 00010202 Apr 21 13:56:17 n26 kernel: RAX: 0000000001000000 RBX: ffff810250800000 RCX: ffff81000156a450 Apr 21 13:56:17 n26 kernel: RDX: ffff8100014064f0 RSI: 0000000000000001 RDI: 0000000000000068 Apr 21 13:56:17 n26 kernel: RBP: 0000000000000004 R08: ffff81007f107240 R09: 000000000000000f Apr 21 13:56:17 n26 kernel: R10: 0000000000000001 R11: ffffffff8011caf0 R12: ffff810001e083a8 Apr 21 13:56:17 n26 kernel: R13: 0000000000000005 R14: 0000000000000005 R15: ffff810001e083a0 Apr 21 13:56:17 n26 kernel: FS: 00002aaaaafbd0a0(0000) GS:ffffffff804e8980(0000) knlGS:00000000557bf0a0 Apr 21 13:56:17 n26 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Apr 21 13:56:17 n26 kernel: CR2: ffff810250800000 CR3: 0000000019252000 CR4: 00000000000006e0 Apr 21 13:56:17 n26 kernel: Process id (pid: 20886, threadinfo ffff810015ef0000, task ffff81003fb9a030) Apr 21 13:56:17 n26 kernel: Stack: 0000800000000000 ffff810001e08280 ffff81007c8049c0 ffff81007c804940 Apr 21 13:56:17 n26 kernel: ffff81007c8049b8 0000000000000001 ffff810015ef1ef8 ffffffff80172ac3 Apr 21 13:56:17 n26 kernel: 0000000000000000 ffff810001e08280 Apr 21 13:56:17 n26 kernel: Call Trace:<ffffffff80172ac3>{exit_mmap+307} <ffffffff80135ec4>{mmput+52} Apr 21 13:56:17 n26 kernel: <ffffffff8013adb3>{do_exit+355} <ffffffff801436d5>{__dequeue_signal+485} Apr 21 13:56:17 n26 kernel: <ffffffff8013b8ff>{do_group_exit+239} <ffffffff801457da>{get_signal_to_deliver+1514} Apr 21 13:56:17 n26 kernel: <ffffffff8010d963>{do_signal+163} <ffffffff80201451>{__up_write+49} Apr 21 13:56:17 n26 kernel: <ffffffff8010ebb2>{retint_signal+62} Apr 21 13:56:17 n26 kernel: Apr 21 13:56:17 n26 kernel: Code: 8b 03 a9 00 00 01 00 74 1b f0 0f ba 2b 00 19 c0 85 c0 75 10 Apr 21 13:56:17 n26 kernel: RIP <ffffffff80177024>{free_pages_and_swap_cache+68} RSP <ffff810015ef1ce8> Apr 21 13:56:17 n26 kernel: CR2: ffff810250800000 Apr 21 13:56:17 n26 kernel: <1>Unable to handle kernel NULL pointer dereference at 0000000000000048 RIP: Apr 21 13:56:17 n26 kernel: <ffffffff80136026>{mm_release+86} Apr 21 13:56:17 n26 kernel: PGD 0 Apr 21 13:56:17 n26 kernel: Oops: 0000 [2] SMP Apr 21 13:56:17 n26 kernel: CPU 0 Apr 21 13:56:17 n26 kernel: Modules linked in: nfs lockd md5 ipv6 parport_pc lp parport autofs4 sunrpc pcmcia yenta_socket rsrc_nonstatic pcmcia_ core video button battery ac ohci_hcd e100 mii tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_sil libata sd_mod scsi_mod Apr 21 13:56:17 n26 kernel: Pid: 20886, comm: id Not tainted 2.6.11-1.14_FC3smp Apr 21 13:56:17 n26 kernel: RIP: 0010:[<ffffffff80136026>] <ffffffff80136026>{mm_release+86} Apr 21 13:56:17 n26 kernel: RSP: 0000:ffff810015ef1a78 EFLAGS: 00010206 Apr 21 13:56:17 n26 kernel: RAX: ffff81003fb9a030 RBX: ffff81003fb9a030 RCX: ffff81003fb9a030 Apr 21 13:56:17 n26 kernel: RDX: ffff81003fb9a000 RSI: 0000000000000000 RDI: 00002aaaaafbd130 Apr 21 13:56:17 n26 kernel: RBP: 0000000000000000 R08: ffffffff80529a00 R09: 0000000000000008 Apr 21 13:56:17 n26 kernel: R10: 0000000000000000 R11: ffffffff8011caf0 R12: 0000000000000000 Apr 21 13:56:17 n26 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Apr 21 13:56:17 n26 kernel: FS: 00002aaaaafbd0a0(0000) GS:ffffffff804e8980(0000) knlGS:00000000557bf0a0 Apr 21 13:56:17 n26 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Apr 21 13:56:17 n26 kernel: CR2: 0000000000000048 CR3: 0000000019252000 CR4: 00000000000006e0 Apr 21 13:56:17 n26 kernel: Process id (pid: 20886, threadinfo ffff810015ef0000, task ffff81003fb9a030) Apr 21 13:56:17 n26 kernel: Stack: 0000000000000000 0000000000000000 ffff81003fb9a030 ffff81003fb9a030 Apr 21 13:56:17 n26 kernel: 0000000000000009 ffffffff8013a93a 0000000000000000 0000000000000001 Apr 21 13:56:17 n26 kernel: ffff81003fb9a6b0 ffff81003fb9a030 Apr 21 13:56:17 n26 kernel: Call Trace:<ffffffff8013a93a>{exit_mm+42} <ffffffff8013adb3>{do_exit+355} Apr 21 13:56:17 n26 kernel: <ffffffff8010fe68>{oops_end+40} <ffffffff80122b52>{do_page_fault+2050} Apr 21 13:56:17 n26 kernel: <ffffffff80138d30>{vprintk+528} <ffffffff8010f041>{error_exit+0} Apr 21 13:56:17 n26 kernel: <ffffffff8011caf0>{flat_send_IPI_mask+0} <ffffffff80177024>{free_pages_and_swap_cache+68} Apr 21 13:56:17 n26 kernel: <ffffffff8017705d>{free_pages_and_swap_cache+125} <ffffffff80172ac3>{exit_mmap+307} Apr 21 13:56:17 n26 kernel: <ffffffff80135ec4>{mmput+52} <ffffffff8013adb3>{do_exit+355} Apr 21 13:56:17 n26 kernel: <ffffffff801436d5>{__dequeue_signal+485} <ffffffff8013b8ff>{do_group_exit+239} Apr 21 13:56:17 n26 kernel: <ffffffff801457da>{get_signal_to_deliver+1514} <ffffffff8010d963>{do_signal+163} Apr 21 13:56:17 n26 kernel: <ffffffff80201451>{__up_write+49} <ffffffff8010ebb2>{retint_signal+62} Apr 21 13:56:17 n26 kernel: Apr 21 13:56:17 n26 kernel: Apr 21 13:56:17 n26 kernel: Code: 41 8b 45 48 ff c8 7e 63 48 c7 83 e8 01 00 00 00 00 00 00 65 Apr 21 13:56:17 n26 kernel: RIP <ffffffff80136026>{mm_release+86} RSP <ffff810015ef1a78> Apr 21 13:56:17 n26 kernel: CR2: 0000000000000048 Apr 21 13:56:17 n26 kernel: <1>Unable to handle kernel NULL pointer dereference at 0000000000000048 RIP: [...] Apr 21 14:24:07 n26 kernel: Oops: 0000 [3482] SMP Version-Release number of selected component (if applicable): kernel-smp-2.6.11-1.14_FC3 How reproducible: difficult, it looks like the machine needs to be under CPU and NFS load. It's 100% reproducable if I start rebuilding ATrpms over NFS (using one processor only). Steps to Reproduce: 1.Install the kernel 2.stress the system over NFS? Actual results: Expected results: Additional info: The issue seems to be known on lkml and also fedora kernel maintainers there, although it was hoped that this has been fixed in fedora kernels. Filing this here, so we have a reference point. The system runs w/o X, and w/o any extra kernel modules, tainting etc. Let me know what other input I can provide, or whether I should try another kernel.
argh. this one is a mystery, and I've spent quite a few days chasing it. The good news is that it is fixed in 2.6.12, but a rebase to that for FC3 is some way off, so it'd be good to get to the bottom of this before then.
I see it too: May 25 23:31:40 sheen kernel: mm/memory.c:97: bad pmd ffff81004ab41750(000000000000000f). May 25 23:31:40 sheen kernel: mm/memory.c:97: bad pmd ffff81004ab41758(00007ffffffff773). May 25 23:31:40 sheen kernel: mm/memory.c:97: bad pmd ffff81004ab41770(365f363878000000). May 25 23:31:40 sheen kernel: mm/memory.c:97: bad pmd ffff81004ab41778(0000000000000034). <etc> Also got: May 24 23:37:41 sheen kernel: swap_free: Bad swap offset entry 5f363878000000 May 24 23:37:41 sheen kernel: swap_free: Bad swap file entry d800000000000034 Finally, though this probably isn't related (i enabled some BIOS MCE reporting option after a reboot), I now get lots of: May 25 02:46:50 sheen kernel: Machine check events logged There doesnt seem anything logged about what events these are exactly. No oops though. Tyan S2885 machine with dual 2.2GHz Opteron CPUs, DDR-333, node interleave disabled in BIOS (so kernel sees it as a NUMA machine).
With the latest errata kernel, 2.6.11-1.27_FC3smp, the logs have changed, perhaps that gives some information on what's happening? May 27 18:31:32 n26 kernel: check-files:11239: mm/memory.c:98: bad pmd ffff81003068c7d8(00000035f8800a88). May 27 18:31:32 n26 kernel: check-files:11239: mm/memory.c:98: bad pmd ffff81003068c7e0(0000000000000003). [...] May 27 18:33:49 n26 kernel: check-files:16154: mm/memory.c:98: bad pmd ffff810028ea47d8(00000035f8800a88). May 27 18:33:49 n26 kernel: check-files:16154: <7>Losing some ticks... checking if CPU frequency changed. May 27 18:33:49 n26 kernel: mm/memory.c:98: bad pmd ffff810028ea47e0(0000000000000003). May 27 18:33:49 n26 kernel: check-files:16154: mm/memory.c:98: bad pmd ffff810028ea47e8(00007ffffffffa25). check-files is a script in /usr/lib/rpm/check-files used by rpmbuild. Its contents look quite harmless (see below). Note that when the check-files script is run there is no NFS activity (%{buildroot} is on local files system). #!/bin/sh # # Gets file list on standard input and RPM_BUILD_ROOT as first parameter # and searches for omitted files (not counting directories). # Returns it's output on standard output. # # filon.pl RPM_BUILD_ROOT=$1 if [ ! -d "$RPM_BUILD_ROOT" ] ; then cat > /dev/null exit 1 fi [ "$TMPDIR" ] || TMPDIR=/tmp FILES_DISK=`mktemp $TMPDIR/rpmXXXXXX` FILES_RPM=`mktemp $TMPDIR/rpmXXXXXX` find $RPM_BUILD_ROOT -type f | LC_ALL=C sort > $FILES_DISK LC_ALL=C sort > $FILES_RPM for f in `diff -d "$FILES_DISK" "$FILES_RPM" | grep "^< " | cut -c3-`; do echo $f | sed -e "s#^$RPM_BUILD_ROOT# #g" done rm -f $FILES_DISK rm -f $FILES_RPM
I've now seen this on two different machines: dual AMD Opteron(tm) Processor 246 with 4GB ram dual AMD Opteron(tm) Processor 244 with 8GB ram Something that might be of interest is that it seems to appear along with the error: Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...] On one machine, I typed "reboot" yesterday, and it just gave that error (no reboot). Typing reboot again a few hours later worked as expected. Another machine gave that error four times when running the weekly makewhatis cronjob last night. We haven't had any kernel crashes yet, but having commands randomly fail is bad enough that I'm willing to try experimental kernels on one of our machines. If there are debugging kernels you want me to try, just let me know. (I see there's a -29 build out, but someone already reproduced the error there.) Also tell me if there's other info I can provide.
Has anyone tried rolling back to the most recent 2.6.10 kernel errata for FC2? Is there any reason why 2.6.10-1.771_FC2smp won't work on FC3?
I'm seeing this with kernel 2.6.11-1.27_FC3smp. I get the Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...] messages when trying to run configure for some software. Reboots fixes it sometimes, but most of the time the reboot fails when fsck is run on the / partition. This usually requires a powercycle to clear, but the bad pmd errors come back fairly quickly. Having to go to the system to clear this up is a pain for systems on the other side of campus. The systems are using Tyan 2882 mb with dual Opterons.
I was having this problem too with a Tyan S2885, but a patched kernel (x86_64 version of 2.6.11-1.31_FC3smp) provided by Dave Jones (http://people.redhat.com/davej/kernels/Fedora/FC3/) seems to have fixed the problem. You may want to try it.
I read about that kernel on the fedora-list and I have it running on my systems as well. So far so good...
*** Bug 159560 has been marked as a duplicate of this bug. ***
There's still 1-2 reports of this bug happening with the .31 kernel, so it seems we're really not making progress at nailing it down in the .11.x kernel. I've begin work on a 2.6.12rc backport to FC3, which is at http://people.redhat.com/davej/kernels/test/ It's had no testing at all just yet, so be very careful with it, and if anyone is brave enough to try it, I'd be *very* interested to hear from you if this bug reoccurs with it. Thanks.
I tried to be brave, but there are some selinux dependencies left: # rpm -ihv kernel-smp-2.6.11-1.1369_FC3.x86_64.rpm error: Failed dependencies: selinux-policy-targeted < 1.23.16-1 conflicts with kernel-smp-2.6.11-1.1369_FC3.x86_64 Should we try it with --nodeps anyway, or is it doomed to break due to selinux?
I think you'll get loads of avc errors if you --nodeps. I'm not sure if you can just take the FC4 policy or not. I'll check with Dan on Monday, and get something worked out.
>How reproducible: >difficult, it looks like the machine needs to be under CPU and NFS load. It's >100% reproducable if I start rebuilding ATrpms over NFS (using one processor >only). FWIW, NFS is not in the picture in my case https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=159560 I am using RAID5 partitions on a 3ware Escalade 8506-4. It happened during compilation (koffice, with gcc-3.4.3-22) using one CPU.
Any more progress on this? I'm using the 2.6.11-1.33 kernel now, and while trying to put the web100 changes into the kernel I get the usage message for "ld.so" (like in comment #4) occasionally during the make. Restarting the make gets it further, then it happens again. I've had to restart the make 5 times so far. I don't know if this is related to the "bad pmd" issue; I'm not getting the messages in /var/log/messages any more.
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which may contain a fix for your problem. Please update to this new kernel, and report whether or not it fixes your problem. If you have updated to Fedora Core 4 since this bug was opened, and the problem still occurs with the latest updates for that release, please change the version field of this bug to 'fc4'. Thank you.
Can anyone still seeing this problem on the latest kernel please check that they have the latest BIOS update installed ? There is an AMD errata in some of their CPUs that can only be worked around with a BIOS update, which hopefully all vendors should have picked up by now. As this bug has been quiet for a few weeks, I'm going to close this soon unless someone reports that they're still seeing it with the update. Thanks.
I upgraded to 2.6.12-1.1373_FC3smp, (from 2.6.11-1.35) and didn't get any 'bad pmd' messages (just under 2 hours of running). I also updated the Tyan 2885 BIOS to v2.05, and the machine check errors appear to have gone. Will observe a while longer, but no looks good so far.
Some machine check errors (no information on what these are about) since yesterday, but otherwise fine. Appears to be fixed otherwise.
The machine is still unstable. I was getting a lot of "Northbridge Chipkill ECC error" and "L2 cache ECC error" MCE's. After carefully reading through the AMD errata, I decided to disable chipkill and ECC scrubbing in the BIOS (there's at least one errata related to chipkill, old but I've no idea whether Tyan applied the workaround in the v2.05 BIOS update). However, with only ECC MCE reporting enabled, I still get MCEs: "Northbridge ECC error" at a rate of a couple/day, and the machine tends to lockup hard about every other week. I captured the forced panic via NMI watchdog using netconsole and it seems it might actually be an NMI handler bug (or bug in code interrupted by NMI handler) locking my machine up rather than an actual hardware lockup, as I get the following message: >Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 See the attached file for details.
Created attachment 119677 [details] NMI, sleeping function called from invalid context
The machine check problem is likely unrelated to the issue first reported in this bug. Have you tried running memtest86 on that box ?
Not recently no. And memtest86 isn't particularly good at finding RAM problems, I can't afford to not have this machine at my disposal for several days. :( However, if it's due to RAM that'd be weird as it affects a bunch of about 16 different reported addresses in 2 distinct sets of ranges (is there a way to figure out what ADDR reported by mcelog belongs to what DIMM? Eg, what ranges of physical addresses are assigned to what banks on which CPUs? What does the 'CPU X Y' line in mcelog mean? CPU number then bank?). And the RAM is over-spec'd, dmidecode thinks it's "400MHz" (DDR-400?) however for whatever reason (CPU model/speed) the BIOS sets them up for 166MHz (DDR-333?). Anyway, it does /seem/ like bad RAM or CPU, but I just keep wondering: "Am I hitting another Opteron errata"? :( (Pair of model 5 stepping 8 Opteron, C0 revision btw).
The original bug was fixed by the errata workaround. Any other problems that may be seen, please file a separate bug. Thanks.