Bug 208488 - ext3: oops in ext3_clear_inode
ext3: oops in ext3_clear_inode
Status: CLOSED CANTFIX
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
6
i386 Linux
medium Severity high
: ---
: ---
Assigned To: Eric Sandeen
Brian Brock
:
: 207658 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-28 16:32 EDT by Jeremy Fitzhardinge
Modified: 2007-11-30 17:11 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-12-14 09:18:45 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Oops output (2.85 KB, text/plain)
2006-09-28 16:32 EDT, Jeremy Fitzhardinge
no flags Details
Second oops with the same backtrace (2.21 KB, text/plain)
2006-11-21 16:22 EST, Jeremy Fitzhardinge
no flags Details

  None (edit)
Description Jeremy Fitzhardinge 2006-09-28 16:32:22 EDT
Description of problem:
On a busy system, ext3 oopsed in the middle of a page fault.  The machine was
doing a "make -j3" of the kernel, as well as mail reading and other activities.

Version-Release number of selected component (if applicable):
kernel-PAE-2.6.18-1.2689.fc6

How reproducible:
Seen once

Steps to Reproduce:
1. unsure
2.
3.
  
Actual results:
oops

Expected results:
no oops

Additional info:
Oops output attached.  The madwifi driver was loaded, but not active; I really
don't think it has any bearing on this.  The fault address is 756e6547, which is
ascii "Genu" - suspiciously like "GenuineIntel" from cpuid.  Quite likely this
is from the kernel source I was compiling at the time.  Hm, looks like it might
be a bad acl pointer passed to posix_acl_release()).
Comment 1 Jeremy Fitzhardinge 2006-09-28 16:32:22 EDT
Created attachment 137343 [details]
Oops output
Comment 2 Jeremy Fitzhardinge 2006-09-28 16:42:56 EDT
Also, I forced a complete check of the filesystems on reboot, and there was no
damage, so this looks like a purely in-core thing.
Comment 3 Dave Jones 2006-09-28 18:05:33 EDT
I don't suppose by any chance it's reproducable without the madwifi stuff loaded ?

I noticed you posted this upstream too, so lets see if anything useful comes out
of that.  We don't have much of a delta between 2.6.18's ext3 right now.  The
biggest changes are the inode-diet patches from -mm (and now in Linus' tree for
.19), and there's some work to make ext3/jbd safe for 16TB volumes.  I don't
think either of these are likely candidates for bugs, but maybe Eric will spot
something.
Comment 4 Eric Sandeen 2006-09-28 18:22:14 EDT
Hm, ok, this is getting interesting.

Bug #207658 (which I closed CANTFIX due to the tainted kernel...) looks almost
exactly the same:

Sep 22 12:22:35 localhost kernel: BUG: unable to handle kernel paging request at
virtual address 756e6547 <--- look familiar?
Sep 22 12:22:35 localhost kernel: EIP is at ext3_clear_inode+0x52/0x8b [ext3]

but... it also had modules forced in, 3rd-party intel wireless stuff.

Curious...
Comment 5 Jeremy Fitzhardinge 2006-09-28 18:35:40 EDT
I wonder if they share the same ieee80211 code?  Hm, perhaps not.
Comment 6 Eric Sandeen 2006-09-28 18:39:05 EDT
Well, at least now we know "Genu" probably didn't come from the kernel code you
were compiling, but somewhere else...

The other bug noted a suspension within the past hour, anything like that in
your case?
Comment 7 Jeremy Fitzhardinge 2006-09-28 18:43:13 EDT
Yes.  I'd had resumed from a suspend-to-ram not long before the oops.  The
machine had been up for a while, and undergone a number of suspend-resume cycles.
Comment 8 Eric Sandeen 2006-09-28 18:57:53 EDT
Hm, just for fun, google turns up 1 other person who has tried to use that
"memory address"
http://lists.pld-linux.org/mailman/pipermail/pld-installer/2002-January.txt

also someone else with proprietary modules with that address on their stack:
https://www.redhat.com/archives/fedora-test-list/2003-October/msg00979.html

but those are old.

There's only one "Genu*" string in the i386 kernel, in intel.c:

static struct cpu_dev intel_cpu_dev __cpuinitdata = {
        .c_vendor       = "Intel",
        .c_ident        = { "GenuineIntel" },
Comment 9 Jeremy Fitzhardinge 2006-09-28 19:09:38 EDT
There are none in the ath_pci driver.

The cpuid instruction puts that value into %ebx when run with %eax==1, but in
both this bug and 207658 its in %edx.

The nvidia one has pretty clearly just done a cpuid, and the crash is in the
depths of the nvidia driver, so that's pretty clearly not it.

And the Polish one omits so much detail its hard to tell if its comparable.
Comment 10 Han-Wen Nienhuys 2006-09-29 08:13:20 EDT
The other crash (thinkpad) is with
 
[hanwen@haring root]$ ls -l /root/wireless/
totaal 200
-rw-r--r-- 1 root root 68832 aug 27 11:30 ieee80211-1.2.15.tgz
-rw-r--r-- 1 root root 57929 aug 27 11:28 ipw3945d-1.7.18.tgz
-rw-r--r-- 1 root root 61175 aug 27 11:28 ipw3945-ucode-1.13.tgz

these sources (and the .o files) don't contain the string Genu, though.
Comment 11 Eric Sandeen 2006-09-29 12:02:29 EDT
From the disassembly, looks like we died here in ext3_clear_inode():

0000af09 <ext3_clear_inode>:
...
    af5b:       f0 ff 0a                lock decl (%edx)	<--- + 0x52

which should correspond to the 2nd posix_acl_release() call I think.

And now gotta run, but hey, you found an interesting one :)  Good eyes on the
"Genu" thing, that'll be a good hint.
Comment 12 Eric Sandeen 2006-09-29 13:21:19 EDT
*** Bug 207658 has been marked as a duplicate of this bug. ***
Comment 13 Eric Sandeen 2006-09-29 14:00:58 EDT
If either of you can reproduce this, obtaining a dump in some manner might be
helpful.
Comment 14 Jeremy Fitzhardinge 2006-11-21 16:20:23 EST
I just got a repro; same backtrace, different address.  Again with madwifi
loaded, unfortunately.  Kernel kernel-2.6.18-1.2849.fc6
Comment 15 Jeremy Fitzhardinge 2006-11-21 16:22:41 EST
Created attachment 141829 [details]
Second oops with the same backtrace
Comment 16 Jeremy Fitzhardinge 2006-11-21 16:26:35 EST
BTW, the system was under some disk load, running a mercurial "hg status" on a
kernel source tree, while browsing in firefox.  The oops happened, and then the
machine locked up shortly afterwards, forcing a reboot.  Fortunately the oops
got saved to syslog.
Comment 17 Eric Sandeen 2006-12-14 09:18:45 EST
Since this has only ever been reproduced (3x now) with tainted kernels, I'm
going to have to close it CANTFIX.  If you ever get an oops with a clean kernel,
please re-open with as many details as possible; a kernel dump would be great.

Thanks,
-Eric

Note You need to log in before you can comment on or make changes to this bug.