Bug 213937

Summary:	2.6.18-1.2200.fc5 kernel crashes when copying files over NFS
Product:	[Fedora] Fedora	Reporter:	H.J. Lu <hongjiu.lu>
Component:	kernel	Assignee:	Dave Jones <davej>
Status:	CLOSED ERRATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5	CC:	esandeen, jarod, pfrields, steved, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-11-12 05:48:06 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description H.J. Lu 2006-11-03 20:51:24 UTC

+++ This bug was initially created as a clone of Bug #213901 +++

When I was copying > 10GB directory from FC5 or FC6 NFS server, kernel
crashed:

Nov  3 11:04:50 gnu-4 kernel: general protection fault: 0000 [1] SMP
Nov  3 11:04:50 gnu-4 kernel: last sysfs file: /block/sda/sda1/size
Nov  3 11:04:50 gnu-4 kernel: CPU 1
Nov  3 11:04:50 gnu-4 kernel: Modules linked in: nfs fscache i915 drm nfsd
exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror
dm_mod video sbs i2c_ec button battery asus_acpi ac ipv6 lp parport_pc parport
snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore
ohci1394 e1000 snd_page_alloc sr_mod shpchp cdrom ieee1394 i2c_i801 sg uhci_hcd
ehci_hcd floppy i2c_core intel_rng serio_raw pcspkr ext3 jbd ahci libata sd_mod
scsi_mod
Nov  3 11:04:50 gnu-4 kernel: Pid: 205, comm: pdflush Not tainted
2.6.18-1.2200.fc5 #1
Nov  3 11:04:50 gnu-4 kernel: RIP: 0010:[<ffffffff880924c3>] 
[<ffffffff880924c3>] :ext3:walk_page_buffers+0x34/0x8b
Nov  3 11:04:50 gnu-4 kernel: RSP: 0018:ffff81012bef3b20  EFLAGS: 00010286
Nov  3 11:04:50 gnu-4 kernel: RAX: 0000000000000000 RBX: 00000000ff0f375a RCX:
0000000000001000
Nov  3 11:04:50 gnu-4 kernel: RDX: 00000000ff0f375a RSI: ff0f375aff0f375a RDI:
ffff810111c968d0
Nov  3 11:04:50 gnu-4 kernel: RBP: 00000000fe1e6eb4 R08: 0000000000000000 R09:
ffffffff8809251a
Nov  3 11:04:50 gnu-4 kernel: R10: ffff8100a4d77250 R11: 0000000000000060 R12:
00000000ff0f375a
Nov  3 11:04:50 gnu-4 kernel: R13: ffff810014f87f70 R14: ff0f375aff0f375a R15:
0000000000000000
Nov  3 11:04:50 gnu-4 kernel: FS:  0000000000000000(0000)
GS:ffff81012bcbe9c0(0000) knlGS:0000000000000000
Nov  3 11:04:50 gnu-4 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Nov  3 11:04:50 gnu-4 kernel: CR2: 00002aaaaaac6000 CR3: 0000000111cae000 CR4:
00000000000006e0
Nov  3 11:04:50 gnu-4 kernel: Process pdflush (pid: 205, threadinfo
ffff81012bef2000, task ffff81012b52b040)
Nov  3 11:04:50 gnu-4 kernel: Stack:  ffffffff8809251a 0000000000001000
ffff810111c968d0 ffff81000153f180
Nov  3 11:04:50 gnu-4 kernel:  ffff810111c968d0 0000000011c968d0
ffff810014f87f70 ffff81012bef3dd0
Nov  3 11:04:50 gnu-4 kernel:  ffff81012af596a0 ffffffff8809550e
ffff81000153f180 ffff81012bef3dd0
Nov  3 11:04:50 gnu-4 kernel: Call Trace:
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff8809550e>]
:ext3:ext3_ordered_writepage+0xdf/0x198
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff8021c812>] mpage_writepages+0x1d0/0x395
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff8025a03a>] do_writepages+0x2c/0x32
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff8022ffd4>]
__writeback_single_inode+0x1ac/0x326
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff80220e40>] sync_sb_inodes+0x1b1/0x272
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff8024fc06>] writeback_inodes+0x95/0xee
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff802bd9ed>] wb_kupdate+0x9e/0x113
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff802556c5>] pdflush+0x14b/0x1f6
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff80232b05>] kthread+0xf6/0x12a
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff8025d065>] child_rip+0xa/0x11
Nov  3 11:04:50 gnu-4 kernel: DWARF2 unwinder stuck at child_rip+0xa/0x11
Nov  3 11:04:50 gnu-4 kernel: Leftover inexact backtrace:
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff8029b9f0>] keventd_create_kthread+0x0/0x66
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff80232a0f>] kthread+0x0/0x12a
Nov  3 11:04:50 gnu-4 kernel:  [<ffffffff8025d05b>] child_rip+0x0/0x11

Comment 1 Jarod Wilson 2006-11-05 01:38:55 UTC

Not sure if NFS is really at fault, or some other bit in the stack. Most of the
trace there looks ext3-related, but that may just be where we happened to be
when something else blew up. I've got a few boxes running the same kernel that
transfer several gigabytes of data (lots of large video files) to and fro via
nfs-atop-ext3 every day without a problem. Eric? Steve?

Comment 2 Jarod Wilson 2006-11-05 01:47:36 UTC

Whatever is at fault, it appears to be the same issue as bug 213901. Did this
bug show up under 2.6.18-1.2798.fc6 (bug 213901) on the same hardware or
different hardware?

Comment 3 Eric Sandeen 2006-11-05 02:02:08 UTC

hrmph, dying in ext3's walk_page_buffers... the other was in
journal_commit_transaction, either way looks like it could be ext-related.

How reproducible is it?

Can you get a dump?

Comment 4 Eric Sandeen 2006-11-05 04:41:10 UTC

looking at the disassembly, I -think- this looks like a page with a corrupted
buffer.  We wind up trying to follow a bh->b_this_page where bh is not an
address (ff0f375aff0f375a)


One other thing that bugs me is I think %rdx should contain the "from" variable
passed in, and it's never changed, only tested.... but it contains
00000000ff0f375a (half of the above bad bh...) but the calling function only
ever passes in 0.  But, I could still learn a thing or two about x86_64 assy so
I could be missing something.  A dump would help here, I think.

Comment 5 H.J. Lu 2006-11-05 19:06:01 UTC

This bug and bug 213937 are on the same hardware. Kernel 2.6.18-1.2739.el5 from
RHEL 5 beta survived the same operation. My bugs may be related to

http://people.redhat.com/esandeen/traces/

Have those ext3 bugs been fixed?

Comment 6 Eric Sandeen 2006-11-06 02:58:30 UTC

Yes, it could certainly be related... I didn't realize that the kernels you were
testing didn't yet have that fix in them. :)

The ext3 traces were from bug 209647 - and that fix for that should be in RHEL5
beta as of kernel-2.6.18-1.2739.el5 and later, so if it survived when you tested
it, that's good news.  The fix is -not- in 2.6.18-1.2200.fc5, but should be in
the next update.  I'm not certain when it will show up in a released FC6 kernel,
but probably the next update if not already.

How reliably could you reproduce it on the older kernels?

Comment 7 H.J. Lu 2006-11-06 04:20:00 UTC

It happens every time when I copy more 10GB directory tree with
"cp -af ..." under FC5/FC6 kernel.

Comment 8 Eric Sandeen 2006-11-06 04:27:35 UTC

Good to know, thanks.  If it's 100% with those kernels, and 0% with the latest
RHEL5 (and soon-to-be-released FC5/FC6 updates...) then it most likely is the
same problem.  Since you've seen it pass on RHEL5, can you test with later FC6
kernels too?  1.2798 and beyond in FC6 should have the fix too, see
http://download.fedora.redhat.com/pub/fedora/linux/core/development/x86_64/os/Fedora/RPMS/
for example...

Thanks,
-Eric

Comment 9 H.J. Lu 2006-11-06 04:46:14 UTC

See comment #2. 1.2798 in FC6 has a similar problem.

Comment 10 Eric Sandeen 2006-11-06 05:07:57 UTC

ah, sorry, missed that.  So you can hit this 100% of the time on
2.6.18-1.2798.fc6 but 0% of the time on kernel-2.6.18-1.2739.el5?  odd...  they
both should have the previously mentioned ext3 fix in them.  this bug and Bug
#213901 do have slightly different signatures, they may not be the same thing...

I'll see if I can hit this here, if so maybe I'll just do a search through the
kernels, first, to see when it showed up.

Thanks,
-Eric

Comment 11 H.J. Lu 2006-11-06 17:15:28 UTC

I am quite certain that both 2.6.18-1.2798.fc6 and 2.6.18-1.2200.fc5
panic when I copy big tree while kernel-2.6.18-1.2739.el5 is OK. Looking
at kernel changelog, it doesn't look like kernel-2.6.18-1.2798.fc6 has
the fix in kernel-2.6.18-1.2739.el5. Does 1.2814 FC6 kernel in devel
have the fix in ELF?

[root@gnu-2 yum.repos.d]# rpm -q --changelog kernel-2.6.18-1.2798.fc6 | head -10
* Mon Oct 16 2006 Dave Jones <davej> 
- Silence another noisy boot-time printk. (#210810)
- Remove broken VIA quirk that prevented booting on some EPIAs (#210817)
- Fix JBD crash with 1K blocksize filesystems. (#209005)

[root@gnu-4 export]# rpm -q --changelog kernel-2.6.18-1.2739.el5 | head -10
* Thu Oct 26 2006 Don Zickus <dzickus> [2.6.18-1.2739.el5] 
- SHPCHP driver doesn't work (Keiichiro Tokunaga) [210478] 
- ext3/jbd panic (Eric Sandeen) [209647] 
- Oops in nfs_cancel_commit_list (Jeff Layton) [210679] 
- kernel Soft lockup detected on corrupted ext3 filesystem (Eric Sandeen) [212053] 
- CIFS doesn't work (Steve Dickson) [211070]

Comment 12 Eric Sandeen 2006-11-06 17:53:18 UTC

the changelogs don't quite look the same, but:

- Fix JBD crash with 1K blocksize filesystems. (#209005)
and
- ext3/jbd panic (Eric Sandeen) [209647] 

are actually the same issue, bug 209005 and bug 209647 are clones, one for fc6
and one for rhel5.

I'm not sure what you mean by "the fix in ELF?"  Oh... EL5?  Yes, it does:

[esandeen@host esandeen]$ rpm -qp --changelog kernel-2.6.18-1.2814.fc6.src.rpm |
grep "209005\|209647"
- Fix JBD crash with 1K blocksize filesystems. (#209005)

Comment 13 H.J. Lu 2006-11-07 16:51:49 UTC

*** Bug 213901 has been marked as a duplicate of this bug. ***

Comment 14 H.J. Lu 2006-11-07 16:53:50 UTC

2.6.18-1.2837 from FC6 fixes the problem for me.

Comment 15 Eric Sandeen 2006-11-07 19:01:49 UTC

Good news, thanks.  Out of curiosity, does the filesystem being served have a
block size < page size?

Comment 16 H.J. Lu 2006-11-07 19:11:27 UTC

No, I am using the default 4K block size on Intel64.

Comment 17 Dave Jones 2006-11-12 05:48:06 UTC

should be fixed in 2.6.18-1.2239.fc5 now in updates.