Description of problem: BUG: soft lockup detected on CPU#0! Call Trace: [<ffffffff802691b5>] show_trace+0x34/0x47 [<ffffffff802691da>] dump_stack+0x12/0x17 [<ffffffff802b4711>] softlockup_tick+0xdb/0xf6 [<ffffffff80293695>] update_process_times+0x42/0x68 [<ffffffff80273d62>] smp_local_timer_interrupt+0x23/0x47 [<ffffffff8027441e>] smp_apic_timer_interrupt+0x41/0x47 [<ffffffff8025cb82>] apic_timer_interrupt+0x66/0x6c DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c Leftover inexact backtrace: <IRQ> <EOI> [<ffffffff88056573>] :ext3:htree_dirblock_to_tree+0xa8/0xc9 [<ffffffff8805653d>] :ext3:htree_dirblock_to_tree+0x72/0xc9 [<ffffffff88056610>] :ext3:ext3_htree_fill_tree+0x7c/0x1cb [<ffffffff8804ee97>] :ext3:ext3_readdir+0x1a9/0x4da [<ffffffff802257f2>] filldir+0x0/0xb7 [<ffffffff80318dd6>] file_has_perm+0x94/0xa3 [<ffffffff802257f2>] filldir+0x0/0xb7 [<ffffffff802350a7>] vfs_readdir+0x77/0xa9 [<ffffffff80238841>] sys_getdents+0x75/0xbd [<ffffffff8025c121>] tracesys+0x71/0xdc [<ffffffff8025c181>] tracesys+0xd1/0xdc Version-Release number of selected component (if applicable): kernel-2.6.18-1.2747.fc6 How reproducible: Everytime < 2 minutes Steps to Reproduce: 1. http://people.redhat.com/sgrubb/files/fsfuzzer.tar.gz 2. build 3. ./fsfuzz ext3
Not sure if this is a recurrance of the same bug, but I can get this trace when fuzzing the iso9660 file system: BUG: soft lockup detected on CPU#0! Call Trace: [<ffffffff802691b5>] show_trace+0x34/0x47 [<ffffffff802691da>] dump_stack+0x12/0x17 [<ffffffff802b4711>] softlockup_tick+0xdb/0xf6 [<ffffffff80293695>] update_process_times+0x42/0x68 [<ffffffff80273d62>] smp_local_timer_interrupt+0x23/0x47 [<ffffffff8027441e>] smp_apic_timer_interrupt+0x41/0x47 [<ffffffff8025cb82>] apic_timer_interrupt+0x66/0x6c DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c Leftover inexact backtrace: <IRQ> <EOI> [<ffffffff80215ac4>] __bitmap_empty+0x28/0x64 [<ffffffff80222f63>] flush_tlb_others+0x96/0xbf [<ffffffff80272c23>] flush_tlb_mm+0xe3/0xea [<ffffffff8021260c>] unmap_region+0x104/0x12b [<ffffffff8021176a>] do_munmap+0x1fd/0x27a [<ffffffff802620a4>] __down_write_nested+0x34/0x96 [<ffffffff80215b40>] sys_munmap+0x40/0x5a [<ffffffff8025c181>] tracesys+0xd1/0xdc
Steve says he can reproduce it in 20 seconds! Eric, I won't have time to look at this until next week, and this isn't my area of expertise. If you have time, feel free to take this bug. =)
Eh, not sure it's my area of expertise either (based on the 2nd comment...) but I'll investigate a bit and see what I can see. There are some patches in -mm that might help debug this, see also bug 210390: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=210390#c4
does the box actually lock up after this? Or is this just the softlockup detector being noisy ? Booting with nosoftlockup as a boot option would be a good datapoint.
Hmmm I tested w/ 2.6.18-1.2726.fc6 and didn't see it... let me update & give it another whirl.
Anything special about the machine you hit it on, Steve? I can't hit it.
The machine is a dual core processor. When I boot with the nosoftlockup and run the test, it loops forever on this: Oct 13 13:19:50 localhost kernel: EXT3-fs error (device loop0): ext3_readdir: directory #2 contains a hole at offset 654184448 Oct 13 13:19:50 localhost kernel: EXT3-fs error (device loop0): ext3_readdir: directory #2 contains a hole at offset 654188544 over and over and over. The offset is slightly different.
that loop may be indicative of a filesystem corruption that you found that ext3 is not handling. This Is Not The Bug You Are Looking For. :) Since you can hit it, any chance you can run w/ the patches I pointed to in bug 210390? (not that I've tried applying them or building them yet...) -- When a spinlock lockup occurs, arrange for the NMI code to emit an all-cpu backtrace, so we get to see which CPU is holding the lock, and where. --
None of those patches apply...it'll take some work. I guess there were changes to nmi system in 2.6.19.
Ah, bummer. I don't mind taking a crack at it, I have another bug assigned to me that could use them. I'll see what I can do.
Ok, I've got a patch that builds attached to bug 210390 now...
Eric, Against which kernel does that patch apply? [jmoyer@segfault linux-2.6.18.i686]$ patch -sp1 < /tmp/all-cpu-backtrace-patch-take2.patch 1 out of 4 hunks FAILED -- saving rejects to file arch/i386/kernel/nmi.c.rej 1 out of 4 hunks FAILED -- saving rejects to file arch/x86_64/kernel/nmi.c.rej 1 out of 1 hunk FAILED -- saving rejects to file include/linux/nmi.h.rej That's from the current -devel branch.
Hm, I just built it against kernel-2.6.18-1.2784.fc6.src.rpm and it applied cleanly & built... ("it" being https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=138620 from bug 210390)
I tried the very same patch against 2786! I reverted to 2784 and that works. I'm building a test kernel now.
Steve, can you provide me with a copy of one of the ext3 images that causes this problem?
Ok, in the original case, we had corrupted directory entries, and the code was looping in htree_dirblock_to_tree() if we tried to do an ls of the mounted image: for (; de < top; de = ext3_next_entry(de)) { err = ext3fs_dirhash(de->name, de->name_len, hinfo); if ((hinfo->hash < start_hash) || ((hinfo->hash == start_hash) && (hinfo->minor_hash < start_minor_hash))) continue; if (de->inode == 0) continue; if ((err = ext3_htree_store_dirent(dir_file, hinfo->hash, hinfo->minor_hash, de)) != 0) { brelse(bh); return err; } count++; } The first (corrupted) entry looked like: de d10c3000, inode 2, rec_len 230, name_len 1 which is corrupted anyway, but we got past it, and moved on 230/0xe6 bytes to: de d10c30e6, inode 0, rec_len 0, name_len 0 now, ext3_next_entry(de) stops moving, since the length is 0 and the loop goes infinite. I added a call to ext3_check_dir_entry for each de we find, and it quickly catches this corruption: [root@magnesium ~]# mount -o loop ext3.32.img mnt/ [root@magnesium ~]# ls mnt ls: reading directory mnt: Invalid argument [root@magnesium ~]# dmesg ... EXT3-fs error (device loop0): htree_dirblock_to_tree: bad entry in directory #2: rec_len % 4 != 0 - offset=0, inode=2, rec_len=230, name_len=1 Aborting journal on device loop0. ext3_abort called. EXT3-fs error (device loop0): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only the ext3_readdir path was already calling ext3_check_dir_entry() later on, I'll see if that check needs to stay or if it should just be moved up here.
Ho hum, now with that patch in place, I hit the problem you saw in Comment #7 ...
Sent to rhkernel-list on 10/23/2006
We have this fixed in FC5/FC6, and rawhide will pick it up when I build a rebased 2.6.19, so I'll close this out.