208488 – ext3: oops in ext3_clear_inode

Bug 208488 - ext3: oops in ext3_clear_inode

Summary: ext3: oops in ext3_clear_inode

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	6
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Eric Sandeen
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	207658 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-09-28 20:32 UTC by Jeremy Fitzhardinge
Modified:	2007-11-30 22:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-12-14 14:18:45 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Oops output (2.85 KB, text/plain) 2006-09-28 20:32 UTC, Jeremy Fitzhardinge	no flags	Details
Second oops with the same backtrace (2.21 KB, text/plain) 2006-11-21 21:22 UTC, Jeremy Fitzhardinge	no flags	Details
View All

Description Jeremy Fitzhardinge 2006-09-28 20:32:22 UTC

Description of problem:
On a busy system, ext3 oopsed in the middle of a page fault.  The machine was
doing a "make -j3" of the kernel, as well as mail reading and other activities.

Version-Release number of selected component (if applicable):
kernel-PAE-2.6.18-1.2689.fc6

How reproducible:
Seen once

Steps to Reproduce:
1. unsure
2.
3.
  
Actual results:
oops

Expected results:
no oops

Additional info:
Oops output attached.  The madwifi driver was loaded, but not active; I really
don't think it has any bearing on this.  The fault address is 756e6547, which is
ascii "Genu" - suspiciously like "GenuineIntel" from cpuid.  Quite likely this
is from the kernel source I was compiling at the time.  Hm, looks like it might
be a bad acl pointer passed to posix_acl_release()).

Comment 1 Jeremy Fitzhardinge 2006-09-28 20:32:22 UTC

Created attachment 137343 [details]
Oops output

Comment 2 Jeremy Fitzhardinge 2006-09-28 20:42:56 UTC

Also, I forced a complete check of the filesystems on reboot, and there was no
damage, so this looks like a purely in-core thing.

Comment 3 Dave Jones 2006-09-28 22:05:33 UTC

I don't suppose by any chance it's reproducable without the madwifi stuff loaded ?

I noticed you posted this upstream too, so lets see if anything useful comes out
of that.  We don't have much of a delta between 2.6.18's ext3 right now.  The
biggest changes are the inode-diet patches from -mm (and now in Linus' tree for
.19), and there's some work to make ext3/jbd safe for 16TB volumes.  I don't
think either of these are likely candidates for bugs, but maybe Eric will spot
something.

Comment 4 Eric Sandeen 2006-09-28 22:22:14 UTC

Hm, ok, this is getting interesting.

Bug #207658 (which I closed CANTFIX due to the tainted kernel...) looks almost
exactly the same:

Sep 22 12:22:35 localhost kernel: BUG: unable to handle kernel paging request at
virtual address 756e6547 <--- look familiar?
Sep 22 12:22:35 localhost kernel: EIP is at ext3_clear_inode+0x52/0x8b [ext3]

but... it also had modules forced in, 3rd-party intel wireless stuff.

Curious...

Comment 5 Jeremy Fitzhardinge 2006-09-28 22:35:40 UTC

I wonder if they share the same ieee80211 code?  Hm, perhaps not.

Comment 6 Eric Sandeen 2006-09-28 22:39:05 UTC

Well, at least now we know "Genu" probably didn't come from the kernel code you
were compiling, but somewhere else...

The other bug noted a suspension within the past hour, anything like that in
your case?

Comment 7 Jeremy Fitzhardinge 2006-09-28 22:43:13 UTC

Yes.  I'd had resumed from a suspend-to-ram not long before the oops.  The
machine had been up for a while, and undergone a number of suspend-resume cycles.

Comment 8 Eric Sandeen 2006-09-28 22:57:53 UTC

Hm, just for fun, google turns up 1 other person who has tried to use that
"memory address"
http://lists.pld-linux.org/mailman/pipermail/pld-installer/2002-January.txt

also someone else with proprietary modules with that address on their stack:
https://www.redhat.com/archives/fedora-test-list/2003-October/msg00979.html

but those are old.

There's only one "Genu*" string in the i386 kernel, in intel.c:

static struct cpu_dev intel_cpu_dev __cpuinitdata = {
        .c_vendor       = "Intel",
        .c_ident        = { "GenuineIntel" },

Comment 9 Jeremy Fitzhardinge 2006-09-28 23:09:38 UTC

There are none in the ath_pci driver.

The cpuid instruction puts that value into %ebx when run with %eax==1, but in
both this bug and 207658 its in %edx.

The nvidia one has pretty clearly just done a cpuid, and the crash is in the
depths of the nvidia driver, so that's pretty clearly not it.

And the Polish one omits so much detail its hard to tell if its comparable.

Comment 10 Han-Wen Nienhuys 2006-09-29 12:13:20 UTC

The other crash (thinkpad) is with
 
[hanwen@haring root]$ ls -l /root/wireless/
totaal 200
-rw-r--r-- 1 root root 68832 aug 27 11:30 ieee80211-1.2.15.tgz
-rw-r--r-- 1 root root 57929 aug 27 11:28 ipw3945d-1.7.18.tgz
-rw-r--r-- 1 root root 61175 aug 27 11:28 ipw3945-ucode-1.13.tgz

these sources (and the .o files) don't contain the string Genu, though.

Comment 11 Eric Sandeen 2006-09-29 16:02:29 UTC

From the disassembly, looks like we died here in ext3_clear_inode():

0000af09 <ext3_clear_inode>:
...
    af5b:       f0 ff 0a                lock decl (%edx)	<--- + 0x52

which should correspond to the 2nd posix_acl_release() call I think.

And now gotta run, but hey, you found an interesting one :)  Good eyes on the
"Genu" thing, that'll be a good hint.

Comment 12 Eric Sandeen 2006-09-29 17:21:19 UTC

*** Bug 207658 has been marked as a duplicate of this bug. ***

Comment 13 Eric Sandeen 2006-09-29 18:00:58 UTC

If either of you can reproduce this, obtaining a dump in some manner might be
helpful.

Comment 14 Jeremy Fitzhardinge 2006-11-21 21:20:23 UTC

I just got a repro; same backtrace, different address.  Again with madwifi
loaded, unfortunately.  Kernel kernel-2.6.18-1.2849.fc6

Comment 15 Jeremy Fitzhardinge 2006-11-21 21:22:41 UTC

Created attachment 141829 [details]
Second oops with the same backtrace

Comment 16 Jeremy Fitzhardinge 2006-11-21 21:26:35 UTC

BTW, the system was under some disk load, running a mercurial "hg status" on a
kernel source tree, while browsing in firefox.  The oops happened, and then the
machine locked up shortly afterwards, forcing a reboot.  Fortunately the oops
got saved to syslog.

Comment 17 Eric Sandeen 2006-12-14 14:18:45 UTC

Since this has only ever been reproduced (3x now) with tainted kernels, I'm
going to have to close it CANTFIX.  If you ever get an oops with a clean kernel,
please re-open with as many details as possible; a kernel dump would be great.

Thanks,
-Eric

Note You need to log in before you can comment on or make changes to this bug.