Bug 121978 - xfs + lvm2 + nfsv4 might be causing a stack overflow
Summary: xfs + lvm2 + nfsv4 might be causing a stack overflow
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-04-29 14:14 UTC by Carl-Johan Kjellander
Modified: 2007-11-30 22:10 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-08-20 06:24:54 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
oops picture #1 (568.59 KB, image/jpeg)
2004-04-29 14:16 UTC, Carl-Johan Kjellander
no flags Details
oops picture #2 (553.29 KB, image/jpeg)
2004-04-29 14:16 UTC, Carl-Johan Kjellander
no flags Details
oops picture #3 (563.36 KB, image/jpeg)
2004-04-29 14:17 UTC, Carl-Johan Kjellander
no flags Details
oops picture #4 (530.71 KB, image/jpeg)
2004-04-29 14:18 UTC, Carl-Johan Kjellander
no flags Details
oops number 2 pic #1 (553.86 KB, text/plain)
2004-04-29 14:25 UTC, Carl-Johan Kjellander
no flags Details
oops number 2 pic #2 (553.25 KB, text/plain)
2004-04-29 14:26 UTC, Carl-Johan Kjellander
no flags Details
oops number 2 pic #1, right content type (553.86 KB, image/jpeg)
2004-04-29 14:27 UTC, Carl-Johan Kjellander
no flags Details
oops number 2 pic #2, right content type (553.25 KB, image/jpeg)
2004-04-29 14:27 UTC, Carl-Johan Kjellander
no flags Details
oops number 2 pic #3 (544.42 KB, image/jpeg)
2004-04-29 14:28 UTC, Carl-Johan Kjellander
no flags Details
oops number 2 pic #4 (537.13 KB, image/jpeg)
2004-04-29 14:29 UTC, Carl-Johan Kjellander
no flags Details
tcmpdump of the nfsv4 traffic that caused the crash (135.61 KB, application/octet-stream)
2004-05-06 15:09 UTC, Carl-Johan Kjellander
no flags Details
reduce stack usage of nfsd4_proc_compound (7.99 KB, patch)
2004-05-07 16:13 UTC, J. Bruce Fields
no flags Details | Diff

Description Carl-Johan Kjellander 2004-04-29 14:14:35 UTC
Description of problem:
Got an oops today with kernel-2.6.5-1.339.i686 in an interupt
handler. So no sync and nothing in the logs. I'm attaching
pretty pictures of of the oops.

It seems to be an infinite loop in do_page_fault:

do_page_fault+0x266/0x43e
do_page_fault+0x0/0x43e
rw_vm+0x1c/0x425
rw_vm+0x1c/0x425
rw_vm+0x1c/0x425
rw_vm+0x1c/0x425
get_user_size+0x30/057
rw_vm+0x1c/0x425
__is_prefetch+0x1a7/0x295
rw_vm+0x1c/0x425
do_page_fault+0x266/0x43e
do_page_fault+0x0/0x43e
rw_vm+0x1c/0x425
rw_vm+0x1c/0x425
rw_vm+0x1c/0x425
rw_vm+0x1c/0x425
get_user_size+0x30/057
rw_vm+0x1c/0x425
__is_prefetch+0x1a7/0x295
rw_vm+0x1c/0x425
do_page_fault+0x266/0x43e
do_page_fault+0x0/0x43e
rw_vm+0x1c/0x425
rw_vm+0x1c/0x425
rw_vm+0x1c/0x425
rw_vm+0x1c/0x425
get_user_size+0x30/057
rw_vm+0x1c/0x425
__is_prefetch+0x1a7/0x295
rw_vm+0x1c/0x425
...and so on

Code: 8b 40 68 c7 44 24 04 60 d8 32 02 85 c0 0f 44 44 24 04 31 c9 
                                                         maybe e9

Version-Release number of selected component (if applicable):
kernel-2.6.5-1.339.i686

kernel /vmlinuz-2.6.5-1.339 ro root=LABEL=/2 rhgb quiet vga=3847 selinux=0

How reproducible:
Only seen once.
  
Additional info:
Running on an Athlon
Disks are 4 SATA-drives on a 3ware hwraid:
00:08.0 RAID bus controller: 3ware Inc 3ware 7000-series ATA-RAID (rev 01)
Disks are all lvm2
Filesystem xfs on top of lvm2
(And running NFSv4 but there was no NFSv4-activity at the time of the
crash.)

Comment 1 Carl-Johan Kjellander 2004-04-29 14:16:00 UTC
Created attachment 99777 [details]
oops picture #1

Comment 2 Carl-Johan Kjellander 2004-04-29 14:16:53 UTC
Created attachment 99778 [details]
oops picture #2

Comment 3 Carl-Johan Kjellander 2004-04-29 14:17:33 UTC
Created attachment 99780 [details]
oops picture #3

Comment 4 Carl-Johan Kjellander 2004-04-29 14:18:10 UTC
Created attachment 99781 [details]
oops picture #4

Comment 5 Carl-Johan Kjellander 2004-04-29 14:23:45 UTC
Actually, I may have seen this crash yesterday. It's also in 
rw_vm and do_page_fault.

I'm attaching more images from that crash.

Comment 6 Carl-Johan Kjellander 2004-04-29 14:25:35 UTC
Created attachment 99782 [details]
oops number 2 pic #1

Comment 7 Carl-Johan Kjellander 2004-04-29 14:26:09 UTC
Created attachment 99783 [details]
oops number 2 pic #2

Comment 8 Carl-Johan Kjellander 2004-04-29 14:27:02 UTC
Created attachment 99784 [details]
oops number 2 pic #1, right content type

Comment 9 Carl-Johan Kjellander 2004-04-29 14:27:40 UTC
Created attachment 99785 [details]
oops number 2 pic #2, right content type

Comment 10 Carl-Johan Kjellander 2004-04-29 14:28:28 UTC
Created attachment 99786 [details]
oops number 2 pic #3

Comment 11 Carl-Johan Kjellander 2004-04-29 14:29:09 UTC
Created attachment 99787 [details]
oops number 2 pic #4

Comment 12 Carl-Johan Kjellander 2004-05-06 14:56:08 UTC
I can reproduce this bug easily on kernel-2.6.5-1.349.

And I got a nullmodem so I don't have to take pictures of the monitor.

Here's the first crash of the day. It was a bit annoying cause this
was scrolling very fast across the screen and didn't seem to want to
stop at all. So multiply this stack data by 1000 or something:

 =======================
 [<0211935a>] do_page_fault+0x26e/0x446
 [<021cc839>] __delay+0x9/0xa
 [<0211b539>] __wake_up_common+0x33/0x57
 [<0211b5e7>] __wake_up+0x8a/0xee
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069c1>] dump_stack+0x11/0x13
 [<02107f99>] do_IRQ+0x45/0x303
 [<022256ac>] __make_request+0x612/0x61c
 [<0222582c>] generic_make_request+0x176/0x186
 [<02164a11>] bio_clone+0xd/0x7e
 [<4288e275>] __map_bio+0x34/0xa7 [dm_mod]
 [<4288e468>] __clone_and_map+0xc0/0x2c3 [dm_mod]
 [<4288e702>] __split_bio+0x97/0xfc [dm_mod]
 [<4288e7f3>] dm_request+0x8c/0x9f [dm_mod]
 [<0222582c>] generic_make_request+0x176/0x186
 [<0211cc73>] autoremove_wake_function+0x0/0x28
 [<022258e0>] submit_bio+0xa4/0xac
 [<02164b18>] __bio_add_page+0x62/0x100
 [<02164bd3>] bio_add_page+0x1d/0x21
 [<4298725e>] _pagebuf_ioapply+0x1f2/0x24b [xfs]
 [<42987398>] pagebuf_iorequest+0xe1/0x118 [xfs]
 [<0211b4fa>] default_wake_function+0x0/0xc
 [<0211b4fa>] default_wake_function+0x0/0xc
 [<4298697a>] pagebuf_associate_memory+0x126/0x160 [xfs]
 [<4296c0ef>] xlog_bdstrat_cb+0x16/0x45 [xfs]
 [<4296cb8a>] xlog_sync+0x1dd/0x396 [xfs]
 [<4296faa2>] xlog_state_sync+0x2cc/0x52a [xfs]
 [<42931a27>] xfs_alloc_ag_vextent_near+0x843/0x8e3 [xfs]
 [<42931a27>] xfs_alloc_ag_vextent_near+0x843/0x8e3 [xfs]
 [<4296b15f>] xfs_log_force+0x30/0x3b [xfs]
 [<42933551>] xfs_alloc_search_busy+0x16a/0x1dc [xfs]
 [<4293104f>] xfs_alloc_ag_vextent+0x9b/0xd1 [xfs]
 [<42932f90>] xfs_alloc_vextent+0x2bf/0x3e1 [xfs]
 [<42940662>] xfs_bmap_alloc+0x12cd/0x159c [xfs]
 [<42976ee5>] xfs_mod_incore_sb+0x8e/0xed [xfs]
 [<4294926f>] xfs_bmbt_get_state+0xa/0x18 [xfs]
 [<42941fab>] xfs_bmap_do_search_extents+0x2c8/0x2ec [xfs]
 [<429438a5>] xfs_bmapi+0x771/0x104d [xfs]
 [<4294926f>] xfs_bmbt_get_state+0xa/0x18 [xfs]
 [<42941fab>] xfs_bmap_do_search_extents+0x2c8/0x2ec [xfs]
 [<42968fed>] xfs_iomap_write_allocate+0x244/0x3a0 [xfs]
 [<429684db>] xfs_iomap+0x23b/0x3eb [xfs]
 [<429685f0>] xfs_iomap+0x350/0x3eb [xfs]
 [<4298bc8d>] xfs_bmap+0x1a/0x1e [xfs]
 [<42984735>] xfs_map_blocks+0x61/0x186 [xfs]
 [<42985244>] xfs_page_state_convert+0x201/0x3e6 [xfs]
 [<021cadbf>] radix_tree_gang_lookup_tag+0x3d/0x57
 [<0213d4e5>] find_get_pages_tag+0x93/0x129
 [<4298590e>] linvfs_writepage+0x8f/0xc0 [xfs]
 [<021877ed>] mpage_writepages+0x142/0x271
 [<4298587f>] linvfs_writepage+0x0/0xc0 [xfs]
 [<428bfcc0>] export_encode_fh+0x0/0x160 [exportfs]
 [<0213ca0b>] __filemap_fdatawrite+0x46/0x4e
 [<021b9cf5>] file_alloc_security+0x26/0x77
 [<02160337>] open_private_file+0x99/0xb3
 [<4b6e690f>] nfsd_sync+0x62/0x9b [nfsd]
 [<4b6e715f>] nfsd_commit+0x7f/0x9c [nfsd]
 [<0217827e>] dput+0x18/0x4de
 [<4b6f0ebb>] nfsd4_proc_compound+0x204/0x13f0 [nfsd]
 [<021b97b3>] avc_has_perm_noaudit+0x257/0x488
 [<021b9a23>] avc_has_perm+0x3f/0x49
 [<022a6951>] dst_output+0x0/0x1c
 [<0228ae52>] dev_queue_xmit+0x1f8/0x55d
 [<022a6a7f>] ip_finish_output2+0x112/0x15f
 [<0215ca9e>] put_user_size+0x1c/0x2d
 [<02288764>] memcpy_toiovec+0x27/0x49
 [<022aaa39>] cleanup_rbuf+0xb3/0xd5
 [<022ab2f8>] tcp_recvmsg+0x5fb/0x636
 [<4b6f67b2>] nfs4svc_decode_compoundargs+0x0/0x90 [nfsd]
 [<4b6e3a20>] nfsd_dispatch+0xbf/0x163 [nfsd]
 [<4b6809c9>] svc_process+0x323/0x562 [sunrpc]
 [<4b6e3683>] nfsd+0x3ae/0x68c [nfsd]
 [<4b6e32d5>] nfsd+0x0/0x68c [nfsd]
 [<021041d9>] kernel_thread_helper+0x5/0xb
 =======================



Comment 13 Carl-Johan Kjellander 2004-05-06 15:04:38 UTC
And the second crash with 349 and as last time this was repeatingly
scrolling across the screen.

I'm also attaching a tcpdump of all the trafic back and forth
between the client and server leading to this second crash,
as J. Bruce Fields wanted.

 =======================
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<02207e59>] complement_pos+0xf/0x128
 [<021cc839>] __delay+0x9/0xa
 [<0221ddad>] serial8250_console_write+0x176/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0221dc37>] serial8250_console_write+0x0/0x1bc
 [<0211fa19>] __call_console_drivers+0x36/0x42
 [<0211fb33>] call_console_drivers+0xbe/0xe3
 [<021190ec>] do_page_fault+0x0/0x446
 [<021068d6>] show_trace+0x3e/0x97
 [<021069aa>] show_stack+0x7b/0x81
 [<02106a9c>] show_registers+0xd9/0x172
 [<02106c84>] die+0xea/0x1b1
 [<021193e3>] do_page_fault+0x2f7/0x446
 [<0215c19c>] rw_vm+0x1c/0x425
 =======================


Comment 14 Carl-Johan Kjellander 2004-05-06 15:09:40 UTC
Created attachment 100044 [details]
tcmpdump of the nfsv4 traffic that caused the crash

Comment 15 J. Bruce Fields 2004-05-07 16:13:30 UTC
Created attachment 100082 [details]
reduce stack usage of nfsd4_proc_compound

Comment 16 J. Bruce Fields 2004-05-07 16:20:30 UTC
Urp, never used bugzilla before, so apologies in advance for screwing
anything up.  Every nfsv4 request goes through nfsd4_proc_compound,
which is currently using about 1k of stack (due to two large local
variables and one collosal switch statement which calls some inlines,
the stack usage of the switch being the *sum* of the cases, not the
maximum....) The above attached patch may not be the right solution,
but if Carl-Johan Kjellander can verify that it fixes this crash, then
that'll at least be confirmation that this is a stack usage problem,
and then I'll work on getting a better patch into 2.6....

Comment 17 Dave Jones 2004-06-15 00:59:29 UTC
did this get fixed in the final/errata kernels ?


Comment 18 J. Bruce Fields 2004-06-16 18:41:37 UTC
Dave Jones asked:
> did this get fixed in the final/errata kernels ?

I'm not testing redhat kernels, but I only submitted the patch above
to Neil just now (sorry for the delay).  So it'll be at least
2.6.8-rc1 till it shows up in Linus's kernel.

Comment 19 J. Bruce Fields 2004-06-16 18:44:07 UTC
> I'm not testing redhat kernels, but I only submitted the patch above
> to Neil just now (sorry for the delay).  So it'll be at least
> 2.6.8-rc1 till it shows up in Linus's kernel.

(Though it would be nice to know whether this actually solves the
original submittor's problem, which I haven't reproduced.  I'm just
assuming it's a stack overflow.  The patch above reduces the stack
usage of nfs4_proc_compound from 1020 bytes to 72.)


Comment 20 Carl-Johan Kjellander 2004-08-19 22:28:12 UTC
Dave, it seems to be fixed in the latest kernel.

I'm not able to reproduce the crashes anymore.

# uname -r
2.6.7-1.494.2.2



Note You need to log in before you can comment on or make changes to this bug.