Hide Forgot
Description of problem: The kernel logs something like the following: May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -116. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:39 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state May 3 16:53:40 gamboge kernel: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state and then (apparently; in both cases the failure was detected by our watchdog and reset before we could investigate it) the system becomes unresponsive, eventually being reset by the hardware watchdog after about a minute. This has happened twice on the machine in question. We recently upgraded a bunch of packages pn the system, among them the kernel. We were running kernel-2.6.18-194.26.1.el5.x86_64 before. We're running kernel-2.6.18-238.5.1.el5.x86_64 now. We have only ever seen this issue with the newer kernel. The error output above is from the first event. The second event follows the same general pattern; a mixture of codes -116 and -10026, followed by a long stream of code -10026. If you know of any way to work around this problem while it is being fixed, please let us know. Version-Release number of selected component (if applicable): kernel-2.6.18-238.5.1.el5.x86_64 How reproducible: We don't know how to reproduce the problem. It has occurred twice on the same system, and on none of the other systems that we have running this kernel. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Those errors are ESTALE and NFS4ERR_BAD_SEQID, for what it's worth. Unlike RHEL5, upstream and RHEL6 handle ESTALE in nfs4_reclaim_open_state; In wonder whether we need something like this?: diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c index f664bb7..6fbfec7 100644 --- a/fs/nfs/nfs4state.c +++ b/fs/nfs/nfs4state.c @@ -733,6 +733,7 @@ static int nfs4_reclaim_open_state(struct nfs4_state_recovery case -ENOENT: case -NFS4ERR_RECLAIM_BAD: case -NFS4ERR_RECLAIM_CONFLICT: + case -ESTALE: /* * Open state on this file cannot be recovered * All we can do is revert to using the zero stat
(Whoops, jlayton points out all that does is skip a printk!)
If it was easy it might help to try kernels between -194 and -238 to see if we can pin down when the regression happened. Also, when you say "The kernel logs something *like* the following".... Is that a literal excerpt, is there anything interesting before it, and how does it continue? (In particular, do those final "unhandled error -10026" warnings continue until logging fails completely, or are there only a few more?) Finally, if you could contact support they could help with the debugging.
That's a literal excerpt. There's nothing interesting before it, and the last statement in the excerpt is the last log message before the logging fails completely. Sorry for the ambiguity on that. We've opened a support ticket; thanks for the suggestion.