Bug 702509

Summary: apparent deadlock/hang after nfs4_reclaim_open_state messages
Product: Red Hat Enterprise Linux 5 Reporter: Kevan Carstensen <kacarstensen>
Component: kernelAssignee: nfs-maint
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: medium    
Version: 5.6CC: bfields, dhowells, harshula, jlayton, rdassen, rwheeler, sprabhu, steved
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-18 19:50:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Kevan Carstensen 2011-05-05 22:13:51 UTC
Description of problem:

The kernel logs something like the following:

May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
May  3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state

and then (apparently; in both cases the failure was detected by our watchdog and reset before we could investigate it) the system becomes unresponsive, eventually being reset by the hardware watchdog after about a minute. This has happened twice on the machine in question. We recently upgraded a bunch of packages pn the system, among them the kernel. We were running kernel-2.6.18-194.26.1.el5.x86_64 before. We're running kernel-2.6.18-238.5.1.el5.x86_64 now. We have only ever seen this issue with the newer kernel.

The error output above is from the first event. The second event follows the same general pattern; a mixture of codes -116 and -10026, followed by a long stream of code -10026.

If you know of any way to work around this problem while it is being fixed, please let us know.

Version-Release number of selected component (if applicable):

kernel-2.6.18-238.5.1.el5.x86_64

How reproducible:

We don't know how to reproduce the problem. It has occurred twice on the same system, and on none of the other systems that we have running this kernel.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 J. Bruce Fields 2011-05-06 16:28:11 UTC
Those errors are ESTALE and NFS4ERR_BAD_SEQID, for what it's worth.

Unlike RHEL5, upstream and RHEL6 handle ESTALE in nfs4_reclaim_open_state; In wonder whether we need something like this?:

diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index f664bb7..6fbfec7 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -733,6 +733,7 @@ static int nfs4_reclaim_open_state(struct nfs4_state_recovery
                        case -ENOENT:
                        case -NFS4ERR_RECLAIM_BAD:
                        case -NFS4ERR_RECLAIM_CONFLICT:
+                       case -ESTALE:
                                /*
                                 * Open state on this file cannot be recovered
                                 * All we can do is revert to using the zero stat

Comment 2 J. Bruce Fields 2011-05-06 17:49:53 UTC
(Whoops, jlayton points out all that does is skip a printk!)

Comment 3 J. Bruce Fields 2011-05-07 00:20:16 UTC
If it was easy it might help to try kernels between -194 and -238 to see if we can pin down when the regression happened.

Also, when you say "The kernel logs something *like* the following"....  Is that a literal excerpt, is there anything interesting before it, and how does it continue?  (In particular, do those final "unhandled error -10026" warnings continue until logging fails completely, or are there only a few more?)

Finally, if you could contact support they could help with the debugging.

Comment 4 Kevan Carstensen 2011-05-07 01:13:13 UTC
That's a literal excerpt. There's nothing interesting before it, and the last statement in the excerpt is the last log message before the logging fails completely. Sorry for the ambiguity on that.

We've opened a support ticket; thanks for the suggestion.