Description of problem: On a busy system, ext3 oopsed in the middle of a page fault. The machine was doing a "make -j3" of the kernel, as well as mail reading and other activities. Version-Release number of selected component (if applicable): kernel-PAE-2.6.18-1.2689.fc6 How reproducible: Seen once Steps to Reproduce: 1. unsure 2. 3. Actual results: oops Expected results: no oops Additional info: Oops output attached. The madwifi driver was loaded, but not active; I really don't think it has any bearing on this. The fault address is 756e6547, which is ascii "Genu" - suspiciously like "GenuineIntel" from cpuid. Quite likely this is from the kernel source I was compiling at the time. Hm, looks like it might be a bad acl pointer passed to posix_acl_release()).
Created attachment 137343 [details] Oops output
Also, I forced a complete check of the filesystems on reboot, and there was no damage, so this looks like a purely in-core thing.
I don't suppose by any chance it's reproducable without the madwifi stuff loaded ? I noticed you posted this upstream too, so lets see if anything useful comes out of that. We don't have much of a delta between 2.6.18's ext3 right now. The biggest changes are the inode-diet patches from -mm (and now in Linus' tree for .19), and there's some work to make ext3/jbd safe for 16TB volumes. I don't think either of these are likely candidates for bugs, but maybe Eric will spot something.
Hm, ok, this is getting interesting. Bug #207658 (which I closed CANTFIX due to the tainted kernel...) looks almost exactly the same: Sep 22 12:22:35 localhost kernel: BUG: unable to handle kernel paging request at virtual address 756e6547 <--- look familiar? Sep 22 12:22:35 localhost kernel: EIP is at ext3_clear_inode+0x52/0x8b [ext3] but... it also had modules forced in, 3rd-party intel wireless stuff. Curious...
I wonder if they share the same ieee80211 code? Hm, perhaps not.
Well, at least now we know "Genu" probably didn't come from the kernel code you were compiling, but somewhere else... The other bug noted a suspension within the past hour, anything like that in your case?
Yes. I'd had resumed from a suspend-to-ram not long before the oops. The machine had been up for a while, and undergone a number of suspend-resume cycles.
Hm, just for fun, google turns up 1 other person who has tried to use that "memory address" http://lists.pld-linux.org/mailman/pipermail/pld-installer/2002-January.txt also someone else with proprietary modules with that address on their stack: https://www.redhat.com/archives/fedora-test-list/2003-October/msg00979.html but those are old. There's only one "Genu*" string in the i386 kernel, in intel.c: static struct cpu_dev intel_cpu_dev __cpuinitdata = { .c_vendor = "Intel", .c_ident = { "GenuineIntel" },
There are none in the ath_pci driver. The cpuid instruction puts that value into %ebx when run with %eax==1, but in both this bug and 207658 its in %edx. The nvidia one has pretty clearly just done a cpuid, and the crash is in the depths of the nvidia driver, so that's pretty clearly not it. And the Polish one omits so much detail its hard to tell if its comparable.
The other crash (thinkpad) is with [hanwen@haring root]$ ls -l /root/wireless/ totaal 200 -rw-r--r-- 1 root root 68832 aug 27 11:30 ieee80211-1.2.15.tgz -rw-r--r-- 1 root root 57929 aug 27 11:28 ipw3945d-1.7.18.tgz -rw-r--r-- 1 root root 61175 aug 27 11:28 ipw3945-ucode-1.13.tgz these sources (and the .o files) don't contain the string Genu, though.
From the disassembly, looks like we died here in ext3_clear_inode(): 0000af09 <ext3_clear_inode>: ... af5b: f0 ff 0a lock decl (%edx) <--- + 0x52 which should correspond to the 2nd posix_acl_release() call I think. And now gotta run, but hey, you found an interesting one :) Good eyes on the "Genu" thing, that'll be a good hint.
*** Bug 207658 has been marked as a duplicate of this bug. ***
If either of you can reproduce this, obtaining a dump in some manner might be helpful.
I just got a repro; same backtrace, different address. Again with madwifi loaded, unfortunately. Kernel kernel-2.6.18-1.2849.fc6
Created attachment 141829 [details] Second oops with the same backtrace
BTW, the system was under some disk load, running a mercurial "hg status" on a kernel source tree, while browsing in firefox. The oops happened, and then the machine locked up shortly afterwards, forcing a reboot. Fortunately the oops got saved to syslog.
Since this has only ever been reproduced (3x now) with tainted kernels, I'm going to have to close it CANTFIX. If you ever get an oops with a clean kernel, please re-open with as many details as possible; a kernel dump would be great. Thanks, -Eric