209907 – kernel Soft lockup detected on corrupted ext3 filesystem

Bug 209907 - kernel Soft lockup detected on corrupted ext3 filesystem

Summary: kernel Soft lockup detected on corrupted ext3 filesystem

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Eric Sandeen
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	212053
TreeView+	depends on / blocked

Reported:	2006-10-07 22:30 UTC by Steve Grubb
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-11-12 22:05:52 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Steve Grubb 2006-10-07 22:30:15 UTC

Description of problem:
BUG: soft lockup detected on CPU#0!

Call Trace:
 [<ffffffff802691b5>] show_trace+0x34/0x47
 [<ffffffff802691da>] dump_stack+0x12/0x17
 [<ffffffff802b4711>] softlockup_tick+0xdb/0xf6
 [<ffffffff80293695>] update_process_times+0x42/0x68
 [<ffffffff80273d62>] smp_local_timer_interrupt+0x23/0x47
 [<ffffffff8027441e>] smp_apic_timer_interrupt+0x41/0x47
 [<ffffffff8025cb82>] apic_timer_interrupt+0x66/0x6c
DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c
Leftover inexact backtrace:
 <IRQ>  <EOI>  [<ffffffff88056573>] :ext3:htree_dirblock_to_tree+0xa8/0xc9
 [<ffffffff8805653d>] :ext3:htree_dirblock_to_tree+0x72/0xc9
 [<ffffffff88056610>] :ext3:ext3_htree_fill_tree+0x7c/0x1cb
 [<ffffffff8804ee97>] :ext3:ext3_readdir+0x1a9/0x4da
 [<ffffffff802257f2>] filldir+0x0/0xb7
 [<ffffffff80318dd6>] file_has_perm+0x94/0xa3
 [<ffffffff802257f2>] filldir+0x0/0xb7
 [<ffffffff802350a7>] vfs_readdir+0x77/0xa9
 [<ffffffff80238841>] sys_getdents+0x75/0xbd
 [<ffffffff8025c121>] tracesys+0x71/0xdc
 [<ffffffff8025c181>] tracesys+0xd1/0xdc

Version-Release number of selected component (if applicable):
kernel-2.6.18-1.2747.fc6

How reproducible:
Everytime < 2 minutes

Steps to Reproduce:
1. http://people.redhat.com/sgrubb/files/fsfuzzer.tar.gz
2.  build
3.  ./fsfuzz ext3

Comment 1 Steve Grubb 2006-10-07 22:50:09 UTC

Not sure if this is a recurrance of the same bug, but I can get this trace when
fuzzing the iso9660 file system:

BUG: soft lockup detected on CPU#0!

Call Trace:
 [<ffffffff802691b5>] show_trace+0x34/0x47
 [<ffffffff802691da>] dump_stack+0x12/0x17
 [<ffffffff802b4711>] softlockup_tick+0xdb/0xf6
 [<ffffffff80293695>] update_process_times+0x42/0x68
 [<ffffffff80273d62>] smp_local_timer_interrupt+0x23/0x47
 [<ffffffff8027441e>] smp_apic_timer_interrupt+0x41/0x47
 [<ffffffff8025cb82>] apic_timer_interrupt+0x66/0x6c
DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c
Leftover inexact backtrace:
 <IRQ>  <EOI>  [<ffffffff80215ac4>] __bitmap_empty+0x28/0x64
 [<ffffffff80222f63>] flush_tlb_others+0x96/0xbf
 [<ffffffff80272c23>] flush_tlb_mm+0xe3/0xea
 [<ffffffff8021260c>] unmap_region+0x104/0x12b
 [<ffffffff8021176a>] do_munmap+0x1fd/0x27a
 [<ffffffff802620a4>] __down_write_nested+0x34/0x96
 [<ffffffff80215b40>] sys_munmap+0x40/0x5a
 [<ffffffff8025c181>] tracesys+0xd1/0xdc

Comment 2 Jeff Moyer 2006-10-13 15:12:02 UTC

Steve says he can reproduce it in 20 seconds!  Eric, I won't have time to look
at this until next week, and this isn't my area of expertise.  If you have time,
feel free to take this bug.  =)

Comment 3 Eric Sandeen 2006-10-13 15:51:10 UTC

Eh, not sure it's my area of expertise either (based on the 2nd comment...) but
I'll investigate a bit and see what I can see.

There are some patches in -mm that might help debug this, see also bug 210390:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=210390#c4

Comment 4 Dave Jones 2006-10-13 16:23:12 UTC

does the box actually lock up after this? Or is this just the softlockup
detector being noisy ?

Booting with nosoftlockup as a boot option would be a good datapoint.

Comment 5 Eric Sandeen 2006-10-13 16:31:13 UTC

Hmmm I tested w/ 2.6.18-1.2726.fc6 and didn't see it... let me update & give it
another whirl.

Comment 6 Eric Sandeen 2006-10-13 17:10:25 UTC

Anything special about the machine you hit it on, Steve?  I can't hit it.

Comment 7 Steve Grubb 2006-10-13 17:21:38 UTC

The machine is a dual core processor. When I boot with the nosoftlockup and run
the test, it loops forever on this:

Oct 13 13:19:50 localhost kernel: EXT3-fs error (device loop0): ext3_readdir:
directory #2 contains a hole at offset 654184448
Oct 13 13:19:50 localhost kernel: EXT3-fs error (device loop0): ext3_readdir:
directory #2 contains a hole at offset 654188544

over and over and over. The offset is slightly different.

Comment 8 Eric Sandeen 2006-10-13 17:32:31 UTC

that loop may be indicative of a filesystem corruption that you found that ext3
is not handling.  This Is Not The Bug You Are Looking For.  :)

Since you can hit it, any chance you can run w/ the patches I pointed to in bug
210390?  (not that I've tried applying them or building them yet...)

--
When a spinlock lockup occurs, arrange for the NMI code to emit an all-cpu
backtrace, so we get to see which CPU is holding the lock, and where.
--

Comment 9 Steve Grubb 2006-10-13 18:12:21 UTC

None of those patches apply...it'll take some work. I guess there were changes
to nmi system in 2.6.19.

Comment 10 Eric Sandeen 2006-10-13 18:15:35 UTC

Ah, bummer.  I don't mind taking a crack at it, I have another bug assigned to
me that could use them.  I'll see what I can do.

Comment 11 Eric Sandeen 2006-10-13 18:56:26 UTC

Ok, I've got a patch that builds attached to bug 210390 now...

Comment 12 Jeff Moyer 2006-10-16 20:40:21 UTC

Eric,

Against which kernel does that patch apply?

[jmoyer@segfault linux-2.6.18.i686]$ patch -sp1 <
/tmp/all-cpu-backtrace-patch-take2.patch 
1 out of 4 hunks FAILED -- saving rejects to file arch/i386/kernel/nmi.c.rej
1 out of 4 hunks FAILED -- saving rejects to file arch/x86_64/kernel/nmi.c.rej
1 out of 1 hunk FAILED -- saving rejects to file include/linux/nmi.h.rej

That's from the current -devel branch.

Comment 13 Eric Sandeen 2006-10-16 20:48:45 UTC

Hm, I just built it against kernel-2.6.18-1.2784.fc6.src.rpm and it applied
cleanly & built...

("it" being https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=138620 from
bug 210390)

Comment 14 Jeff Moyer 2006-10-16 20:56:51 UTC

I tried the very same patch against 2786!  I reverted to 2784 and that works. 
I'm building a test kernel now.

Comment 15 Eric Sandeen 2006-10-16 21:10:25 UTC

Steve, can you provide me with a copy of one of the ext3 images that causes this
problem?

Comment 16 Eric Sandeen 2006-10-17 20:02:10 UTC

Ok, in the original case, we had corrupted directory entries, and the code was
looping in htree_dirblock_to_tree() if we tried to do an ls of the mounted image:

        for (; de < top; de = ext3_next_entry(de)) {
                err = ext3fs_dirhash(de->name, de->name_len, hinfo);
                if ((hinfo->hash < start_hash) ||
                    ((hinfo->hash == start_hash) &&
                     (hinfo->minor_hash < start_minor_hash)))
                        continue;
                if (de->inode == 0)
                        continue;
                if ((err = ext3_htree_store_dirent(dir_file,
                                   hinfo->hash, hinfo->minor_hash, de)) != 0) {
                        brelse(bh);
                        return err;
                }
                count++;
        }

The first (corrupted) entry looked like:

de d10c3000, inode 2, rec_len 230, name_len 1

which is corrupted anyway, but we got past it, and moved on 230/0xe6 bytes to:

de d10c30e6, inode 0, rec_len 0, name_len 0

now, ext3_next_entry(de) stops moving, since the length is 0 and the loop goes
infinite.

I added a call to ext3_check_dir_entry for each de we find, and it quickly
catches this corruption:
[root@magnesium ~]# mount -o loop ext3.32.img mnt/
[root@magnesium ~]# ls mnt
ls: reading directory mnt: Invalid argument
[root@magnesium ~]# dmesg
...
EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory #2:
rec_len % 4 != 0 - offset=0, inode=2, rec_len=230, name_len=1
Aborting journal on device loop0.
ext3_abort called.
EXT3-fs error (device loop0): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only

the ext3_readdir path was already calling ext3_check_dir_entry() later on, I'll
see if that check needs to stay or if it should just be moved up here.

Comment 17 Eric Sandeen 2006-10-17 20:51:19 UTC

Ho hum, now with that patch in place, I hit the problem you saw in Comment #7 ...

Comment 18 Eric Sandeen 2006-10-23 20:46:30 UTC

Sent to rhkernel-list on 10/23/2006

Comment 19 Dave Jones 2006-11-12 22:05:52 UTC

We have this fixed in FC5/FC6, and rawhide will pick it up when I build a
rebased 2.6.19, so I'll close this out.

Note You need to log in before you can comment on or make changes to this bug.