+++ This bug was initially created as a clone of Bug #213901 +++ When I was copying > 10GB directory from FC5 or FC6 NFS server, kernel crashed: Nov 3 11:04:50 gnu-4 kernel: general protection fault: 0000 [1] SMP Nov 3 11:04:50 gnu-4 kernel: last sysfs file: /block/sda/sda1/size Nov 3 11:04:50 gnu-4 kernel: CPU 1 Nov 3 11:04:50 gnu-4 kernel: Modules linked in: nfs fscache i915 drm nfsd exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_mod video sbs i2c_ec button battery asus_acpi ac ipv6 lp parport_pc parport snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore ohci1394 e1000 snd_page_alloc sr_mod shpchp cdrom ieee1394 i2c_i801 sg uhci_hcd ehci_hcd floppy i2c_core intel_rng serio_raw pcspkr ext3 jbd ahci libata sd_mod scsi_mod Nov 3 11:04:50 gnu-4 kernel: Pid: 205, comm: pdflush Not tainted 2.6.18-1.2200.fc5 #1 Nov 3 11:04:50 gnu-4 kernel: RIP: 0010:[<ffffffff880924c3>] [<ffffffff880924c3>] :ext3:walk_page_buffers+0x34/0x8b Nov 3 11:04:50 gnu-4 kernel: RSP: 0018:ffff81012bef3b20 EFLAGS: 00010286 Nov 3 11:04:50 gnu-4 kernel: RAX: 0000000000000000 RBX: 00000000ff0f375a RCX: 0000000000001000 Nov 3 11:04:50 gnu-4 kernel: RDX: 00000000ff0f375a RSI: ff0f375aff0f375a RDI: ffff810111c968d0 Nov 3 11:04:50 gnu-4 kernel: RBP: 00000000fe1e6eb4 R08: 0000000000000000 R09: ffffffff8809251a Nov 3 11:04:50 gnu-4 kernel: R10: ffff8100a4d77250 R11: 0000000000000060 R12: 00000000ff0f375a Nov 3 11:04:50 gnu-4 kernel: R13: ffff810014f87f70 R14: ff0f375aff0f375a R15: 0000000000000000 Nov 3 11:04:50 gnu-4 kernel: FS: 0000000000000000(0000) GS:ffff81012bcbe9c0(0000) knlGS:0000000000000000 Nov 3 11:04:50 gnu-4 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Nov 3 11:04:50 gnu-4 kernel: CR2: 00002aaaaaac6000 CR3: 0000000111cae000 CR4: 00000000000006e0 Nov 3 11:04:50 gnu-4 kernel: Process pdflush (pid: 205, threadinfo ffff81012bef2000, task ffff81012b52b040) Nov 3 11:04:50 gnu-4 kernel: Stack: ffffffff8809251a 0000000000001000 ffff810111c968d0 ffff81000153f180 Nov 3 11:04:50 gnu-4 kernel: ffff810111c968d0 0000000011c968d0 ffff810014f87f70 ffff81012bef3dd0 Nov 3 11:04:50 gnu-4 kernel: ffff81012af596a0 ffffffff8809550e ffff81000153f180 ffff81012bef3dd0 Nov 3 11:04:50 gnu-4 kernel: Call Trace: Nov 3 11:04:50 gnu-4 kernel: [<ffffffff8809550e>] :ext3:ext3_ordered_writepage+0xdf/0x198 Nov 3 11:04:50 gnu-4 kernel: [<ffffffff8021c812>] mpage_writepages+0x1d0/0x395 Nov 3 11:04:50 gnu-4 kernel: [<ffffffff8025a03a>] do_writepages+0x2c/0x32 Nov 3 11:04:50 gnu-4 kernel: [<ffffffff8022ffd4>] __writeback_single_inode+0x1ac/0x326 Nov 3 11:04:50 gnu-4 kernel: [<ffffffff80220e40>] sync_sb_inodes+0x1b1/0x272 Nov 3 11:04:50 gnu-4 kernel: [<ffffffff8024fc06>] writeback_inodes+0x95/0xee Nov 3 11:04:50 gnu-4 kernel: [<ffffffff802bd9ed>] wb_kupdate+0x9e/0x113 Nov 3 11:04:50 gnu-4 kernel: [<ffffffff802556c5>] pdflush+0x14b/0x1f6 Nov 3 11:04:50 gnu-4 kernel: [<ffffffff80232b05>] kthread+0xf6/0x12a Nov 3 11:04:50 gnu-4 kernel: [<ffffffff8025d065>] child_rip+0xa/0x11 Nov 3 11:04:50 gnu-4 kernel: DWARF2 unwinder stuck at child_rip+0xa/0x11 Nov 3 11:04:50 gnu-4 kernel: Leftover inexact backtrace: Nov 3 11:04:50 gnu-4 kernel: [<ffffffff8029b9f0>] keventd_create_kthread+0x0/0x66 Nov 3 11:04:50 gnu-4 kernel: [<ffffffff80232a0f>] kthread+0x0/0x12a Nov 3 11:04:50 gnu-4 kernel: [<ffffffff8025d05b>] child_rip+0x0/0x11
Not sure if NFS is really at fault, or some other bit in the stack. Most of the trace there looks ext3-related, but that may just be where we happened to be when something else blew up. I've got a few boxes running the same kernel that transfer several gigabytes of data (lots of large video files) to and fro via nfs-atop-ext3 every day without a problem. Eric? Steve?
Whatever is at fault, it appears to be the same issue as bug 213901. Did this bug show up under 2.6.18-1.2798.fc6 (bug 213901) on the same hardware or different hardware?
hrmph, dying in ext3's walk_page_buffers... the other was in journal_commit_transaction, either way looks like it could be ext-related. How reproducible is it? Can you get a dump?
looking at the disassembly, I -think- this looks like a page with a corrupted buffer. We wind up trying to follow a bh->b_this_page where bh is not an address (ff0f375aff0f375a) One other thing that bugs me is I think %rdx should contain the "from" variable passed in, and it's never changed, only tested.... but it contains 00000000ff0f375a (half of the above bad bh...) but the calling function only ever passes in 0. But, I could still learn a thing or two about x86_64 assy so I could be missing something. A dump would help here, I think.
This bug and bug 213937 are on the same hardware. Kernel 2.6.18-1.2739.el5 from RHEL 5 beta survived the same operation. My bugs may be related to http://people.redhat.com/esandeen/traces/ Have those ext3 bugs been fixed?
Yes, it could certainly be related... I didn't realize that the kernels you were testing didn't yet have that fix in them. :) The ext3 traces were from bug 209647 - and that fix for that should be in RHEL5 beta as of kernel-2.6.18-1.2739.el5 and later, so if it survived when you tested it, that's good news. The fix is -not- in 2.6.18-1.2200.fc5, but should be in the next update. I'm not certain when it will show up in a released FC6 kernel, but probably the next update if not already. How reliably could you reproduce it on the older kernels?
It happens every time when I copy more 10GB directory tree with "cp -af ..." under FC5/FC6 kernel.
Good to know, thanks. If it's 100% with those kernels, and 0% with the latest RHEL5 (and soon-to-be-released FC5/FC6 updates...) then it most likely is the same problem. Since you've seen it pass on RHEL5, can you test with later FC6 kernels too? 1.2798 and beyond in FC6 should have the fix too, see http://download.fedora.redhat.com/pub/fedora/linux/core/development/x86_64/os/Fedora/RPMS/ for example... Thanks, -Eric
See comment #2. 1.2798 in FC6 has a similar problem.
ah, sorry, missed that. So you can hit this 100% of the time on 2.6.18-1.2798.fc6 but 0% of the time on kernel-2.6.18-1.2739.el5? odd... they both should have the previously mentioned ext3 fix in them. this bug and Bug #213901 do have slightly different signatures, they may not be the same thing... I'll see if I can hit this here, if so maybe I'll just do a search through the kernels, first, to see when it showed up. Thanks, -Eric
I am quite certain that both 2.6.18-1.2798.fc6 and 2.6.18-1.2200.fc5 panic when I copy big tree while kernel-2.6.18-1.2739.el5 is OK. Looking at kernel changelog, it doesn't look like kernel-2.6.18-1.2798.fc6 has the fix in kernel-2.6.18-1.2739.el5. Does 1.2814 FC6 kernel in devel have the fix in ELF? [root@gnu-2 yum.repos.d]# rpm -q --changelog kernel-2.6.18-1.2798.fc6 | head -10 * Mon Oct 16 2006 Dave Jones <davej> - Silence another noisy boot-time printk. (#210810) - Remove broken VIA quirk that prevented booting on some EPIAs (#210817) - Fix JBD crash with 1K blocksize filesystems. (#209005) [root@gnu-4 export]# rpm -q --changelog kernel-2.6.18-1.2739.el5 | head -10 * Thu Oct 26 2006 Don Zickus <dzickus> [2.6.18-1.2739.el5] - SHPCHP driver doesn't work (Keiichiro Tokunaga) [210478] - ext3/jbd panic (Eric Sandeen) [209647] - Oops in nfs_cancel_commit_list (Jeff Layton) [210679] - kernel Soft lockup detected on corrupted ext3 filesystem (Eric Sandeen) [212053] - CIFS doesn't work (Steve Dickson) [211070]
the changelogs don't quite look the same, but: - Fix JBD crash with 1K blocksize filesystems. (#209005) and - ext3/jbd panic (Eric Sandeen) [209647] are actually the same issue, bug 209005 and bug 209647 are clones, one for fc6 and one for rhel5. I'm not sure what you mean by "the fix in ELF?" Oh... EL5? Yes, it does: [esandeen@host esandeen]$ rpm -qp --changelog kernel-2.6.18-1.2814.fc6.src.rpm | grep "209005\|209647" - Fix JBD crash with 1K blocksize filesystems. (#209005)
*** Bug 213901 has been marked as a duplicate of this bug. ***
2.6.18-1.2837 from FC6 fixes the problem for me.
Good news, thanks. Out of curiosity, does the filesystem being served have a block size < page size?
No, I am using the default 4K block size on Intel64.
should be fixed in 2.6.18-1.2239.fc5 now in updates.