Bug 531493 (NFSserver-crash) - NFS{3,4} export of ext4 can hang kernel (unless mounted -nodelalloc)
Summary: NFS{3,4} export of ext4 can hang kernel (unless mounted -nodelalloc)
Keywords:
Status: CLOSED WONTFIX
Alias: NFSserver-crash
Product: Fedora
Classification: Fedora
Component: kernel
Version: 11
Hardware: All
OS: Linux
low
urgent
Target Milestone: ---
Assignee: Eric Sandeen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-10-28 15:40 UTC by Bert DeKnuydt
Modified: 2010-06-28 15:20 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-06-28 15:20:22 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Bert DeKnuydt 2009-10-28 15:40:51 UTC
Description of problem:

One can crash a NFS-server by simply making a sufficiently 
large file.

Version-Release number of selected component (if applicable):

Fedora 11 patched till Oct 28th 2009.

kernels 2.6.30.5-43.fc11 and beyond: crash
kernels 2.6.29.6-213.fc11 and earlier: no problems
 
How reproducible:

Always

Steps to Reproduce:
1. mount a filesystem over nfs (mount cv01:/ /mnt)
2. dd if=/dev/zero of=/mnt/dump.dmp (or anything generating a suff. large file)
3. Wait; usually for 200MB to 20GB (long before FS fills up); see server hang
  
Actual results:

The NFS server hangs: the machines still pings, but is completely frozen.
There is absolutely nothing in the logfiles. Boot is clean, FS recovers clean and file is there with more or less expected size.

Expected results:

No crash.  

Additional info:

FS exported on NFS server is ext4 in LVM; network is ipv4 over GBit;

Things tested:

local mount over NFS on the server: crashes server too 
mounting ext4 on server with or without barriers makes no difference
mounting in client over NFS3 of NFS4, makes no difference (nsf3 seems to crash faster)
exporting NFS on server 'sync' seems to delay (but not avoid) the crash.  
kernel-2.6.30.9-94.fc11 from Koji has the problem too

Comment 1 Bert DeKnuydt 2009-10-28 16:30:43 UTC
Tested up to 2.6.31-0.94.rc4.fc12 from Koji.  Have problem too.

Tested mounting with quota switched off: makes no difference.

Comment 2 Bert DeKnuydt 2009-10-29 15:23:19 UTC
Tested on Debian kernel 2.6.30-2-686 vers. 2.6.30-8: not affected

Tested on Fedora 11, but with vanilla 2.6.30.9: Crashed too.

Comment 3 Bert DeKnuydt 2009-10-30 15:37:41 UTC
Happens only if underlying FS is ext4.

If the ext4 is mounted '-nodelalloc', the problem disappears.
That is a sufficient (temporary) workaround for me.

Comment 4 Eric Sandeen 2009-11-05 03:07:00 UTC
I'll try to reproduce, but when things hang, doing:

# echo w > /proc/sysrq-trigger
# dmesg > dmesg.out

will give us traces of all the stuck tasks.

Comment 5 Bert DeKnuydt 2009-11-05 14:44:17 UTC
Remote or on the console, nothing goes; so I cannot easily save the
result to a file... (If really needed, I can attach a serial console)

So with Alt-SysRq-w:

That gives something along like (typed over from screen)

<Alt-SysRq-w>
hald-addon-storage
automount
nscd
nfsd4
nfsd
master


After some time (minutes) of apparent idleness, this trace appears:
(again, typed over, so ignore the typos)

spin_unlock_bh
rpc_execute
rpc_execute
rpc_run_task
nfs_write_rpcsetup
lookup_tag
ext4_get_blocks_wrap
mpage_da_map_blocks
mpage_da_write_page
write_cache_pages
mpage_da_writepage
ext4_da_writepages
da_write_pages
writeback_single_inode
dm_any_congested
generic_sb_inodes
writeback_inodes
background_writeout
pdflush
background_writeout
pdflush
kthread
kthread
kthread_helper

I tried it several times: the 10 top function names change each time.

Anything else I can do?

Comment 6 Bert DeKnuydt 2009-11-09 11:50:47 UTC
2.6.30.9-96.fc11 hangs too, after just a couple of GB written.

Comment 7 Eric Sandeen 2009-11-13 21:44:19 UTC
Ok, trying to reproduce this now; sorry for the delay, juggling a few bugs lately.

-Eric

Comment 8 Bert DeKnuydt 2009-12-02 12:24:51 UTC
Cannot reproduce with 2.6.30.9-99.fc11.  So maybe this was related with

  * Mon Nov 16 2009 Eric Sandeen <sandeen> 2.6.30.9-97
  - Fix ext4 preallocation-related corruption (#513221)

2.6.31.6-145.fc12 is still affected though.

Comment 9 Eric Sandeen 2009-12-02 16:44:30 UTC
The patch for f11 came from 2.6.31, so I doubt that's the fix, if the fc12 kernel still has the problem.  Also, nfs shouldn't be doing any preallocation AFAIK.

I haven't yet been able to reproduce this but will keep trying ...

-Eric

Comment 10 Alan Brown 2009-12-07 16:21:20 UTC
This also seems to manifest on RHEL5.4 GFS

Comment 11 Bert DeKnuydt 2009-12-10 16:14:54 UTC
2.6.31.6-162.fc12.i686.PAE (Fedora 12) is affected.

Comment 12 Bert DeKnuydt 2009-12-11 09:01:12 UTC
2.6.31.6-166.fc12.i686 (Fedora 12) is affected.

Comment 13 Eric Sandeen 2009-12-15 15:51:13 UTC
Alan, GFS is likely a completely different bug, please escalate the RHEL5.4 issue though your support contacts.

Bert, thanks for the updates; I don't expect that incremental fc12 kernels -will- fix it, because the root cause has not yet been identified and fixed.

If only I could reproduce it ... If there is any possible way to get sysrq-t output off the box when it's stuck, in unedited format, that would be helpful.

Thanks,
-eric

Comment 14 Bug Zapper 2010-04-28 10:59:54 UTC
This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 15 Bug Zapper 2010-06-28 15:20:22 UTC
Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.