Red Hat Bugzilla – Bug 605884
NFSv4 hangs on large deletes
Last modified: 2010-12-03 08:47:46 EST
Description of problem:
NFSv4 mount will hang, seemingly on both client & server, during moderate read/write loads when a massive delete is initiated.
Version-Release number of selected component (if applicable):
client: kernel-PAE-126.96.36.199-99.fc12.i686 nfs-utils-1.2.1-5.fc12.i686
server: kernel-PAE-188.8.131.52-115.fc12.i686 nfs-utils-1.2.1-5.fc12.i686
Nearly always, in some for or other for many Fedora releases, going back years. Something always goes wrong when doing massive deletes over NFS. This latest result is a bit different than usual.
Steps to Reproduce:
1. Have a moderately (mid-low) loaded NFSv4 client/server doing R/W's
2. rm some large files (200GB+) and maybe some large dirs of small files over NFS from the client
All apps using the NFS server on the client hang after 30-90 secs. rm hangs. Client says:
Jun 18 23:10:18 pog kernel: nfs: server 192.168.100.2 not responding, still trying
Server is more interesting now that I'm at F12:
Jun 18 23:12:56 piles kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 18 23:12:56 piles kernel: nfsd D 00050fc6 0 1989 2 0x00000000
Jun 18 23:12:56 piles kernel: f39e5e90 00000046 48a28aed 00050fc6 00002010 00000000 f382426c 00000000
Jun 18 23:12:56 piles kernel: c0a81354 c0a85e60 f382426c c0a85e60 c0a85e60 f32acd80 f39e5e70 c0590243
Jun 18 23:12:56 piles kernel: 00000000 00000000 00050fc6 f3823fc0 c0589f63 e484ce9c 00000013 e484ce9c
Jun 18 23:12:56 piles kernel: Call Trace:
Jun 18 23:12:56 piles kernel: [<c0590243>] ? selinux_inode_permission+0x34/0x3c
Jun 18 23:12:56 piles kernel: [<c0589f63>] ? security_inode_permission+0x1f/0x25
Jun 18 23:12:56 piles kernel: [<c07a571e>] __mutex_lock_common+0xdc/0x12b
Jun 18 23:12:56 piles kernel: [<c07a5784>] __mutex_lock_slowpath+0x17/0x1a
Jun 18 23:12:56 piles kernel: [<c07a5873>] ? mutex_lock+0x30/0x3e
Jun 18 23:12:56 piles kernel: [<c07a5873>] mutex_lock+0x30/0x3e
Jun 18 23:12:56 piles kernel: [<f98ccf10>] ? fh_verify+0x488/0x4dc [nfsd]
Jun 18 23:12:56 piles kernel: [<f98cd784>] fh_lock_nested+0x6b/0xdb [nfsd]
Jun 18 23:12:56 piles kernel: [<f98cd9e5>] nfsd_unlink+0x60/0x149 [nfsd]
Jun 18 23:12:56 piles kernel: [<f98d7265>] nfsd4_remove+0x34/0x68 [nfsd]
Jun 18 23:12:56 piles kernel: [<f98d7231>] ? nfsd4_remove+0x0/0x68 [nfsd]
Jun 18 23:12:56 piles kernel: [<f98d6ef0>] nfsd4_proc_compound+0x1de/0x362 [nfsd]
Jun 18 23:12:56 piles kernel: [<f98ca312>] nfsd_dispatch+0xd6/0x1a2 [nfsd]
Jun 18 23:12:56 piles kernel: [<f97f650c>] svc_process+0x3ba/0x5a7 [sunrpc]
Jun 18 23:12:56 piles kernel: [<f98ca7cb>] nfsd+0xdb/0x11a [nfsd]
Jun 18 23:12:56 piles kernel: [<f98ca6f0>] ? nfsd+0x0/0x11a [nfsd]
Jun 18 23:12:56 piles kernel: [<c045b461>] kthread+0x64/0x69
Jun 18 23:12:56 piles kernel: [<c045b3fd>] ? kthread+0x0/0x69
Jun 18 23:12:56 piles kernel: [<c0409cc7>] kernel_thread_helper+0x7/0x10
(many more exactly like the above followed, with different ps id)
I then did service nfs restart, and also nfslock, rpcbind, rpcidmapd restart
After restarts I get:
Jun 18 23:15:45 piles rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
Jun 18 23:17:33 piles kernel: rpc-srv/tcp: nfsd: got error -32 when sending 56 bytes - shutting down socket
Jun 18 23:17:33 piles kernel: nfsd: last server has exited, flushing export cache
All NFS & related ps's are then dead at that point. Doing the service starts again, they come back up ok and the client magically recovers where it left off about 2-5 mins later. I had ^C'd the rm so it wouldn't screw up again.
NFS should be as reliable as a local FS for simple file operations under normal load. Hangs / ps deaths should not occur.
My NFS export is a 5TB ext3 FS which is 90% full and so a 200GB delete will always take a bit of time, local or over NFS (yes, next time I will use XFS, but for now I'm stuck with ext3). Something may be timing out but these situations need to be handled gracefully and not crash the server nfsd.
The delete I was attempting was of a few 10-30 GB dirs of smaller files along with a couple of 10-250GB files. It appears to have hung between deleting a 250GB file and a 12GB one. The 250GB file appears to have been deleted successfully.
I have selinux = permissive on the server and disabled on the client. I am soon disabling it on the server too. No sense adding an extra variable to the equation.
This isn't my first issue with NFS issues. See bug 486264 which is a more client-oriented hang (this one appears more server oriented).
This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '12'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 12's end of life.
Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 12 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora please change the 'version' of this
bug to the applicable version. If you are unable to change the version,
please add a comment here and someone will do it for you.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The process we are following is described here:
Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.
Thank you for reporting this bug and we are sorry it could not be fixed.