605884 – NFSv4 hangs on large deletes

Bug 605884 - NFSv4 hangs on large deletes

Summary: NFSv4 hangs on large deletes

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	12
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-19 05:20 UTC by Trevor Cordes
Modified:	2010-12-03 13:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-12-03 13:47:46 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Trevor Cordes 2010-06-19 05:20:16 UTC

Description of problem:
NFSv4 mount will hang, seemingly on both client & server, during moderate read/write loads when a massive delete is initiated.

Version-Release number of selected component (if applicable):
client: kernel-PAE-2.6.32.11-99.fc12.i686  nfs-utils-1.2.1-5.fc12.i686
server: kernel-PAE-2.6.32.12-115.fc12.i686 nfs-utils-1.2.1-5.fc12.i686

How reproducible:
Nearly always, in some for or other for many Fedora releases, going back years.  Something always goes wrong when doing massive deletes over NFS.  This latest result is a bit different than usual.

Steps to Reproduce:
1. Have a moderately (mid-low) loaded NFSv4 client/server doing R/W's
2. rm some large files (200GB+) and maybe some large dirs of small files over NFS from the client
  
Actual results:
All apps using the NFS server on the client hang after 30-90 secs.  rm hangs.  Client says:
Jun 18 23:10:18 pog kernel: nfs: server 192.168.100.2 not responding, still trying

Server is more interesting now that I'm at F12:
Jun 18 23:12:56 piles kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 18 23:12:56 piles kernel: nfsd          D 00050fc6     0  1989	2 0x00000000
Jun 18 23:12:56 piles kernel: f39e5e90 00000046 48a28aed 00050fc6 00002010 00000000 f382426c 00000000
Jun 18 23:12:56 piles kernel: c0a81354 c0a85e60 f382426c c0a85e60 c0a85e60 f32acd80 f39e5e70 c0590243
Jun 18 23:12:56 piles kernel: 00000000 00000000 00050fc6 f3823fc0 c0589f63 e484ce9c 00000013 e484ce9c
Jun 18 23:12:56 piles kernel: Call Trace:
Jun 18 23:12:56 piles kernel: [<c0590243>] ? selinux_inode_permission+0x34/0x3c
Jun 18 23:12:56 piles kernel: [<c0589f63>] ? security_inode_permission+0x1f/0x25
Jun 18 23:12:56 piles kernel: [<c07a571e>] __mutex_lock_common+0xdc/0x12b
Jun 18 23:12:56 piles kernel: [<c07a5784>] __mutex_lock_slowpath+0x17/0x1a
Jun 18 23:12:56 piles kernel: [<c07a5873>] ? mutex_lock+0x30/0x3e
Jun 18 23:12:56 piles kernel: [<c07a5873>] mutex_lock+0x30/0x3e
Jun 18 23:12:56 piles kernel: [<f98ccf10>] ? fh_verify+0x488/0x4dc [nfsd]
Jun 18 23:12:56 piles kernel: [<f98cd784>] fh_lock_nested+0x6b/0xdb [nfsd]
Jun 18 23:12:56 piles kernel: [<f98cd9e5>] nfsd_unlink+0x60/0x149 [nfsd]
Jun 18 23:12:56 piles kernel: [<f98d7265>] nfsd4_remove+0x34/0x68 [nfsd]
Jun 18 23:12:56 piles kernel: [<f98d7231>] ? nfsd4_remove+0x0/0x68 [nfsd]
Jun 18 23:12:56 piles kernel: [<f98d6ef0>] nfsd4_proc_compound+0x1de/0x362 [nfsd]
Jun 18 23:12:56 piles kernel: [<f98ca312>] nfsd_dispatch+0xd6/0x1a2 [nfsd]
Jun 18 23:12:56 piles kernel: [<f97f650c>] svc_process+0x3ba/0x5a7 [sunrpc]
Jun 18 23:12:56 piles kernel: [<f98ca7cb>] nfsd+0xdb/0x11a [nfsd]
Jun 18 23:12:56 piles kernel: [<f98ca6f0>] ? nfsd+0x0/0x11a [nfsd]
Jun 18 23:12:56 piles kernel: [<c045b461>] kthread+0x64/0x69
Jun 18 23:12:56 piles kernel: [<c045b3fd>] ? kthread+0x0/0x69
Jun 18 23:12:56 piles kernel: [<c0409cc7>] kernel_thread_helper+0x7/0x10
(many more exactly like the above followed, with different ps id)

I then did service nfs restart, and also nfslock, rpcbind, rpcidmapd restart

After restarts I get:
Jun 18 23:15:45 piles rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
Jun 18 23:17:33 piles kernel: rpc-srv/tcp: nfsd: got error -32 when sending 56 bytes - shutting down socket
Jun 18 23:17:33 piles kernel: nfsd: last server has exited, flushing export cache

All NFS & related ps's are then dead at that point.  Doing the service starts again, they come back up ok and the client magically recovers where it left off about 2-5 mins later.  I had ^C'd the rm so it wouldn't screw up again.

Expected results:
NFS should be as reliable as a local FS for simple file operations under normal load.  Hangs / ps deaths should not occur.

Additional info:
My NFS export is a 5TB ext3 FS which is 90% full and so a 200GB delete will always take a bit of time, local or over NFS (yes, next time I will use XFS, but for now I'm stuck with ext3).  Something may be timing out but these situations need to be handled gracefully and not crash the server nfsd.

The delete I was attempting was of a few 10-30 GB dirs of smaller files along with a couple of 10-250GB files.  It appears to have hung between deleting a 250GB file and a 12GB one.  The 250GB file appears to have been deleted successfully.

I have selinux = permissive on the server and disabled on the client.  I am soon disabling it on the server too.  No sense adding an extra variable to the equation.

This isn't my first issue with NFS issues.  See bug 486264 which is a more client-oriented hang (this one appears more server oriented).

Comment 1 Bug Zapper 2010-11-03 13:03:44 UTC

This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 2 Bug Zapper 2010-12-03 13:47:46 UTC

Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.