Red Hat Bugzilla – Bug 402581
Deadlock while performing nfs operations.
Last modified: 2011-05-24 05:57:48 EDT
A user is running stress tests on ppc64 machines running on kernel 2.6.9-65 and
they are facing deadlock situations. It is possible to login to the machine via
ssh and interact with it, but even trivial commands (e.g. ls /) hang in an
Using Sysrq T, a deadlock situation was detected which appears to be caused
because of the changes introduced for patch from bz 228292. Pasting the user's
Ok, looking over the latest stack dumps, I find the following two:
events/1 D 0000000000000000 8320 19 1 480 20 18 (L-TLB)
rpciod D 0000000000000000 8592 3128 1 3129 3093 (L-TLB)
So what is happening is that nfs4_close_state has scheduled
nfs4_close_state_work. That is now being worked on, but had to sleep and wait
for rpciod. The problem is that in the meantime, rpciod started to deal with
shutting down a transport, requiring flushing it's workqueue. Since the
workqueue is sleeping on rpciod, the workqueue will never finish flushing, and
thus will never wake up rpciod, which means rpciod will never complete the work
nfs4_close_state_work was waiting for...
nfs4_close_state_work was introduced by the fix for bz 228292, so I think we
need to revisit that bug fix. Perhaps one solution would be to create a
separate work queue for either nfs or sunrpc (or both).
Thanks. I think we may end up having to use a different workqueue for
nfs4_close_state_work. Not sure if we'll require an entirely separate one, but
the kevents workqueue might not be suitable...
I presume this is happening in a race between nfs4 closes while unmounting a
different filesystem. Do they have a way to reproduce this at will?
Now that I've looked over this a bit, I think I see how this deadlocked. They
had a race while doing a close on a nfs4 file while at the same time tearing
down an rpc_client.
The rpc client was either a one-shot client or just one that was dead, but it
could even have been completely unrelated to the nfs4 connection. Either way,
rpciod called flush_scheduled_work() while a nfsv4 close was in progress and
this deadlock occurred.
The easiest fix is to create a new workqueue and queue nfs4_close_state_work to
that workqueue. We could also queue it to an existing workqueue that won't be
subject to the same deadlock, though I don't know of any appropriate candidates
We could also consider backing out the patch that added nfs4_close_state_work in
the first place and doing something closer to what upstream did to fix it. That
would be a significant change, however.
A reproducer for this would also be nice. I ran quite a bit of stress testing
with this patch and never hit this deadlock, so if they can offer a way to
reproduce it more readily that would be helpful.
Created attachment 271811 [details]
patch -- add new workqueue for handling nfs4 scheduled work
In some of my earlier passes in bug 228292, I had a dedicated workqueue for
handling scheduled work. It worked well, but at the time I didn't see the need
for a dedicated workqueue. That need seems to now be evident.
This patch adds that back and changes nfs4_close_state to queue its work to
that workqueue. I've done some testing with fsstress and it seems to work as
expected. Having a way to reproduce the customer's deadlock (or having them
test it) would be ideal.
I'll see about adding this patch to my test kernel in the near future.
I'm building some test kernels with this patch now and should hopefully have
them up on my people page a little later today. Once I do, it would be nice if
they could test them and see if they can still reproduce this deadlock
(presuming that it is easily reproducible).
Once I have them up, I'll reset this to NEEDINFO...
I've built some kernels with the above patch and put them on my people page:
...if the customer is willing and able, can they test it someplace non-critical
and let me know if the problem is still reproducible?
If this does fix it, I'm still not sure how acceptable this patch will be, but
knowing whether it fixes it is a good way to sanity check the diagnosis.
Thanks for the analysis, Frank.
Actually...it's not reclaiming locks, it's reclaiming open state. The concept is
similar though. Something happened to make the state of open files get out of
sync between the client and server, and they fell into state recovery mode. A
server crash or reboot looks likely here. It seems like the state reclaimer
thread only gets spawned when allocating a new NFS client.
A long list of open states doesn't necessarily mean that the list was corrupt.
IIRC, each open file will have an open_state. If the client happened to have a
large number of open files then the whole list could be valid. Do we know what
actually made it crash here?
This error is interesting:
nfs4_reclaim_open_state: unhandled error -116. Zeroing state
I think -116 is -ESTALE, so perhaps it's trying to reclaim a bunch of stale NFS
file handles? Maybe the server rebooted and the fsid of the exported filesystem
Either way, I think this bug isn't likely to be related to the original
deadlock. It should probably get a new BZ.
If you open one, let me know the number and I'll plan to grab it.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Committed in 68.27.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
(In reply to comment #9)
> This error is interesting:
> nfs4_reclaim_open_state: unhandled error -116. Zeroing state
> I think -116 is -ESTALE, so perhaps it's trying to reclaim a bunch of stale NFS
> file handles? Maybe the server rebooted and the fsid of the exported filesystem
> Either way, I think this bug isn't likely to be related to the original
> deadlock. It should probably get a new BZ.
Jeff, have you opened an entry for this one? We are seeing it on RHEL5 with kernel-2.6.18-194.32.1.el5.
Another potentially related:
BUG: soft lockup - CPU#15 stuck for 10s! [184.108.40.206-r:30868]
Modules linked in: md5 autofs4 hidp nfs fscache nfs_acl rfcomm l2cap bluetooth rpcsec_gss_krb5 auth_rpcgss testmgr_cipher testmgr aead crypto_blkcipher crypto_algapi des lockd sunrpc ip_conntrack_netbios_ns iptable_filter ipt_MASQUERADE iptable_nat ip_nat ip_conntrack nfnetlink ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport nvidia(PU) joydev shpchp sg mlx4_core(U) igb i2c_i801 8021q i2c_core serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 30868, comm: 220.127.116.11-r Tainted: P 2.6.18-164.11.1.el5 #1
RIP: 0010:[<ffffffff89165575>] [<ffffffff89165575>] :nfs:nfs4_reclaim_open_state+0x135/0x150
RSP: 0018:ffff81013a985e90 EFLAGS: 00000246
RAX: ffff810083c76bc0 RBX: ffff81011d7b9340 RCX: ffffffff80309c28
RDX: ffff8101664a10c8 RSI: ffff8101664a10c0 RDI: ffffffff89182f50
RBP: 0000000000000007 R08: ffffffff80309c28 R09: ffffffffffffff10
R10: ffff81013a985b30 R11: 0000000000000280 R12: ffff8102bfc27dc0
R13: 00000002676f2d3d R14: ffff81013a985b30 R15: ffff8103106fe5c0
FS: 0000000000000000(0000) GS:ffff81033fcc10c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaab1e4000 CR3: 0000000000201000 CR4: 00000000000006e0
This particular issue seems to be closer to bz 526888 which should have been fixed in your version of the kernel. There have been other significant changes since 2.6.18-164.11.1.el5. Could you please check with a later version of the kernel. If you still see the problem with that issue, please open a case with Red Hat Support.