Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 575817

Summary:	nfsv4 hangs -- client/server deadlock between commit and delegation return
Product:	Red Hat Enterprise Linux 5	Reporter:	Jeff Layton <jlayton>
Component:	kernel	Assignee:	Jeff Layton <jlayton>
Status:	CLOSED ERRATA	QA Contact:	yanfu,wang <yanwang>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.4	CC:	bfields, jlayton, marcus, robh, rwheeler, simone, steved, yanwang
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	551028	Environment:
Last Closed:	2011-01-13 21:20:40 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jeff Layton 2010-03-22 13:48:50 UTC

+++ This bug was initially created as a clone of Bug #551028 +++

Description of problem:

We have nfsv4+krb5 server and client both running 2.6.18-164.9.1.el5 kernels.  We are seeing periodic 1-2 minute hangs of the client systems with the following being logged at the time of the hang:

CLIENT:  Dec 28 09:28:30 kernel: decode_op_hdr: reply buffer overflowed in line 2121.
SERVER:  Dec 28 09:27:28 kernel: nfs4_cb: server N.N.N.N not responding, timed out

  Notes: 1)  N.N.N.N is the IP address of the client
         2)  The timestamps show the client error being logged 1 minute after the server error.

My observations have been that the hang is typically 2 minutes long with the following approximate chronology:

t=0min:  The hang starts
t=1min:  The sever complains with the nfs4_cb error above
t=2min:  The client complains with the decode_op_hdr error and then recovers.

After 1-2 minutes, things recover and operate normally again.


How reproducible:

The problem is pretty variable but I can usually reproduce it immediately after a system reboot but it is hard to reproduce after it happens the first time.  I don't have a feel for whether it only happens once after a reboot or, perhaps, once for each user.  But, once a user sees this, they may not see it again (or only rarely) as long as they stay logged in.  And, it almost always happens when doing something with firefox which makes me suspicious of something related to file locking (perhaps with byte range locking???).  That's just a WAG since we've seen issues related to byte range locking when using cifs-mounted homedirs that only affected firefox.  But, it does seem that there is something that firefox is doing that triggers the hangs.


Steps to Reproduce:
1. reboot
2. login via the gui/gnome on a client system using nfsv4+krb5
3. start firefox.

  
Actual results:

The system will likely hang for 1-2 minutes shortly after starting firefox.

Expected results:

No hangs.


Additional info:

I should note that we have been seeing this problem ever since we moved from nfsv3 to nfsv4+krb5 last summer so the problem existed as far back as the last of the RHEL5.3 kernel and continues through the newly released 2.6.18-164.9.1 kernel.

--- Additional comment from jlayton on 2010-01-05 08:08:56 EST ---

Is there any firewalling between the client and server here?

--- Additional comment from robh.edu on 2010-01-05 08:27:50 EST ---

No, neither client nor server runs iptables (or any other type of firewall) and both are on the same subnet.

--- Additional comment from jlayton on 2010-01-05 08:46:07 EST ---

Ok...

The decode_op_hdr message means that there was a problem decoding the message from the server. That probably means that the server is sending a malformed reply here. I sort of suspect that it might be empty...

What might be helpful is to get a wire capture of one of these events so we can see what's happening there.

--- Additional comment from robh.edu on 2010-01-05 09:47:46 EST ---

Created an attachment (id=381769)
tcpdump output

The attached is the result of running "tcpdump -s0 -w nfsv4.pcap host 129.79.247.1" on the client (129.79.246.2) during a login and hanging firefox.  The ip address of the server is 129.79.247.1.

The corresponding error messages and timestamps were:

server: 09:06:29 129.79.247.1 kernel: nfs4_cb: server 129.79.246.2 not responding, timed out 
client: 09:07:31 129.79.246.2 kernel: decode_op_hdr: reply buffer overflowed in line 2121.

I believe the start time of firefox was about 9:05:36 and it hung very shortly thereafter.  It came back to life right about 9:07:30.

--- Additional comment from jlayton on 2010-01-07 10:44:18 EST ---

Around frame 6422, the server starts sending back NFS4ERR_DELAY on COMMIT calls. This is probably the source of the stalls. I suspect that the server needs to recall a delegation, but there's a problem talking to the callback channel.

In fact, the renew replies in frames 6581 and 6584 return an error of NFS4ERR_CB_PATH_DOWN.

I see the server sending callbacks to the socket on port 37337 at the time of the NFS4ERR_DELAY errors, but the client never seems to do a DELEGRETURN until much later. I suspect that means that the callback is malformed or something...

Unfortunately, wireshark is really bad at decoding these callbacks. I might need to fix that before we can diagnose this.

--- Additional comment from jlayton on 2010-02-10 15:29:15 EST ---

As luck would have it, someone added support to wireshark for this a week or so ago:

https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=4446

...I applied the patch and built a new wireshark, and now I can decode callbacks.

Good news/bad news:

The good news is that now I can reasonably look at the callback data

The bad news is that I don't see any real problems with it

...all of the callback traffic looks fine. The client gets the callbacks in frames 6423 and 6449 and ACKs them. There is then a long delay (with no retransmissions AFAICT) and then eventually replies to the calls in frames 6593 and 6594.

The interesting bit is that all of that happens a few seconds *before* that client side message got logged. Are the clocks all synchronized here?

--- Additional comment from robh.edu on 2010-02-12 16:34:56 EST ---

We are running ntpd on all of our systems and really don't think clocks should be out of synch.  I just checked the time on the ~200 rhel systems we manage and they were all within a few milliseconds of the time servers so definitely no rampant clock skew.   I'm curious if you are asking because you are wondering if the hangs might be related to a clock synch problem or just that the logging might be out of synch with the actual events?

--- Additional comment from jlayton on 2010-02-12 16:44:33 EST ---

The latter...

I'm just trying to ascertain whether the timestamps in the logs and the captures should all line up. It sounds like they should...

--- Additional comment from simone on 2010-02-26 06:01:34 EST ---

I'm seeing this on a 2.6.18-190.el5 kernel that I installed to fix https://bugzilla.redhat.com/show_bug.cgi?id=529227
what happens is that the machine occasionally still hangs (differently from 529227) and after a reboot this message gets logged.
Is there any kind of test I can do to help?

--- Additional comment from jlayton on 2010-03-16 13:20:49 EDT ---

Ok, looking at the capture a little deeper, it looks like there may be something of a chicken and egg problem here...

6421  41.223408 129.79.246.2 -> 129.79.247.1 NFS V4 COMP Call <EMPTY> PUTFH;COMMIT COMMIT;GETATTR GETATTR
6422  41.223633 129.79.247.1 -> 129.79.246.2 NFS V4 COMP Reply (Call In 6421) <EMPTY> PUTFH;COMMIT COMMIT(10008)

...prior to the above COMMIT call, there was a large flush of WRITEs. The WRITEs were all ok, but the COMMIT returns NFS4ERR_DELAY. The filehandle for the COMMIT is 0x8bd720c1.

6423  41.223659 129.79.247.1 -> 129.79.246.2 NFS V1 CB_COMPOUND Call <EMPTY> CB_RECALL

...just after the server returns the NFS4ERR_DELAY call, it recalls the delegation for filehandle 0x8bd720c1 (same one the client is trying to do the COMMIT call on).

6426  56.223098 129.79.246.2 -> 129.79.247.1 NFS V4 COMP Call <EMPTY> PUTFH;COMMIT COMMIT;GETATTR GETATTR
6427  56.223317 129.79.247.1 -> 129.79.246.2 NFS V4 COMP Reply (Call In 6426) <EMPTY> PUTFH;COMMIT COMMIT(10008)

...of course, before the client can return the delegation it needs to finish writing back the dirty data and that entails finishing the COMMIT.

So we seem to have a problem -- the server won't satisfy the COMMIT call that the client needs, probably until the delegation is returned. The client can't return the delegation until the COMMIT succeeds. This back and forth continues for a while.

6580 135.035487 129.79.246.2 -> 129.79.247.1 NFS V4 COMP Call <EMPTY> RENEW RENEW
6581 135.035679 129.79.247.1 -> 129.79.246.2 NFS V4 COMP Reply (Call In 6580) <EMPTY> RENEW RENEW(10048)

...eventually, the client tries to RENEW its clientid, but since the server isn't getting a response it returns an NFS4ERR_CB_PATH_DOWN. The client then reattempts the COMMIT after that, and it succeeds.

6589 146.221255 129.79.246.2 -> 129.79.247.1 NFS V4 COMP Call <EMPTY> PUTFH;COMMIT COMMIT;GETATTR GETATTR
6590 146.221488 129.79.247.1 -> 129.79.246.2 NFS V4 COMP Reply (Call In 6589) <EMPTY> PUTFH;COMMIT COMMIT;GETATTR GETATTR

...since the COMMIT succeeds, the client tries to return the delegation:

6593 146.221669 129.79.246.2 -> 129.79.247.1 NFS V1 CB_COMPOUND Reply (Call In 6423) <EMPTY> CB_RECALL
6594 146.221697 129.79.246.2 -> 129.79.247.1 NFS V1 CB_COMPOUND Reply (Call In 6449) <EMPTY> CB_RECALL
6595 146.221775 129.79.246.2 -> 129.79.247.1 NFS V4 COMP Call <EMPTY> PUTFH;DELEGRETURN DELEGRETURN;GETATTR GETATTR
6600 146.222037 129.79.247.1 -> 129.79.246.2 NFS V4 COMP Reply (Call In 6595) <EMPTY> PUTFH;DELEGRETURN DELEGRETURN;GETATTR GETATTR
6601 146.222050 129.79.247.1 -> 129.79.246.2 NFS V4 COMP Reply (Call In 6598) <EMPTY> PUTFH;DELEGRETURN DELEGRETURN(10025)

...but the server returns NFS4ERR_BAD_STATEID, I think since the RENEW wasn't successful.

As best I can tell, the problem here is likely the server's fault. It should be allowing the COMMIT to proceed, but maybe there was something wrong with the COMMIT call? I'll need to look more closely at the wire traffic and the NFSv4 RFC.

--- Additional comment from jlayton on 2010-03-16 13:23:03 EDT ---

Actually, correction...the first DELEGRETURN succeeds. The second gets the error because the first one has already caused the stateid to be invalidated.

--- Additional comment from jlayton on 2010-03-16 15:08:40 EDT ---

Delegations are handled via leases, and a recall occurs when the lease is broken. The nfsd_commit however does a nfsd_open and that causes a break_lease call. There's not much that can be done there though -- a COMMIT call on the wire doesn't provide any open stateid info, just a filehandle. So the server has no way to know if it is a fsync() call on an open file descriptor on the client, or some random person on the box ran /bin/sync.

From what I can tell too, upstream code is not significantly different in this regard, so I expect that upstream kernels probably have a similar issue.

--- Additional comment from simone on 2010-03-17 05:59:15 EDT ---

I am seeing this regularly (multiple times a day) in the following setup:

Solaris 10 NFS server
Centos 5.4 Linux client 
relevant /etc/fstab line:
server:/storage nfs4 bg,intr,nosuid,actimeo=2,rsize=8192,wsize=8192

the low actimeo it's because the server changes metadata that the clients must see asap.

--- Additional comment from jlayton on 2010-03-17 13:22:45 EDT ---

Cc'ing Bruce to see if he has any thoughts.

Bruce, any thoughts on the analysis in comment #10? It seems like the client is doing the right thing here and is trying to issue writeback before returning the delegation. Shouldn't the server be allowing that to continue?

--- Additional comment from jlayton on 2010-03-17 13:24:43 EDT ---

(In reply to comment #13)
> I am seeing this regularly (multiple times a day) in the following setup:
> 
> Solaris 10 NFS server
> Centos 5.4 Linux client 
> relevant /etc/fstab line:
> server:/storage nfs4 bg,intr,nosuid,actimeo=2,rsize=8192,wsize=8192
> 
> the low actimeo it's because the server changes metadata that the clients must
> see asap.    

I think you need to proceed carefully here and not assume that this problem is the same as the one you're seeing. You'll need to do some analysis to determine that.

If you're not comfortable doing that on your own, then you may want to open a support case with RH support and have them help you with it.

--- Additional comment from bfields on 2010-03-17 14:22:56 EDT ---

In reply to #14: The code looks like it is vulnerable to the deadlock you describe.

Coincidentally, this came up recently: http://marc.info/?l=linux-nfs&m=126480953224756&w=2

We could use a variant of nfsd_open that omits the lease_break().

Or we might just skip the open entirely and pass NULL for the file descriptor.

--- Additional comment from jlayton on 2010-03-17 15:01:51 EDT ---

The open for the COMMIT case doesn't seem to buy much. From vfs_fsync_range:

-----------[snip]-----------
        if (file) {
                mapping = file->f_mapping;
                fop = file->f_op;
        } else {
                mapping = dentry->d_inode->i_mapping;
                fop = dentry->d_inode->i_fop;
        }
-----------[snip]-----------

RFC3530 is definitely not clear on whether you need to recall a delegation on COMMIT. In most cases, if someone is issuing a COMMIT then they are likely the holder of the delegation. There are probably scenarios where that's not the case, but in those situations would it be harmful to service a commit while someone else holds the delegation? It doesn't seem like it...

--- Additional comment from bfields on 2010-03-17 15:16:26 EDT ---

Agreed.

There may have been some other reason for wanting a file descriptor there that I'm forgetting, but if so hopefully posting a patch will be enough to jog someone's memory....

--- Additional comment from marcus.br on 2010-03-17 17:57:03 EDT ---

Hello guys,

I discovered one way to reproduce "decode_op_hdr: reply buffer overflowed in line 2121". Using LTP test case ltp-full-20100228/testcases/network/nfs/nfslock01.

Steps:
copy nfslock01 programs to /nfs
cd /nfs
mkdir test
/nfs/nfs_flock_dgen /nfs/test/nfs_flock_data 63 16384 0
cp /nfs/test/nfs_flock_data /nfs/test/nfs_flock_idata
/nfs/nfs_flock_frk /nfs/nfs_flock /nfs/test/nfs_flock_idata
CTRL+C
rm -rf test <- This does not work, but it's another lock problem
Kill connection from nfs server, in my case EMC Celerra.
rm -rf test <- Now works
mkdir test

Best regards

--- Additional comment from simone on 2010-03-18 06:59:06 EDT ---

(In reply to comment #15)
> I think you need to proceed carefully here and not assume that this problem is
> the same as the one you're seeing. You'll need to do some analysis to determine
> that

you're right, my main problem is https://bugzilla.redhat.com/show_bug.cgi?id=526888 but I am also seeing this. I think they are related. 



(In reply to comment #19)
> I discovered one way to reproduce "decode_op_hdr: reply buffer overflowed in
> line 2121". Using LTP test case
> ltp-full-20100228/testcases/network/nfs/nfslock01.
> 
> Steps:

where does the "buffer overflowed" should happen?

--- Additional comment from marcus.br on 2010-03-18 09:37:56 EDT ---

(In reply to comment #20)
> (In reply to comment #19)
> > I discovered one way to reproduce "decode_op_hdr: reply buffer overflowed in
> > line 2121". Using LTP test case
> > ltp-full-20100228/testcases/network/nfs/nfslock01.
> > 
> > Steps:
> 
> where does the "buffer overflowed" should happen?    

Last step. "mkdir test".

Best regards

--- Additional comment from jlayton on 2010-03-18 15:42:56 EDT ---

Ok, I think we're potentially looking at multiple problems here.

The problem in the capture that Rob provided, I think we understand now. It's a deadlock of sorts and will likely be fixed by a server patch that makes the server not open the file before calling fsync and doing the commit. I've pushed a patch upstream to fix that and am awaiting feedback on it.

Backporting that patch to RHEL5 is a little more work, but should be doable.

The other problem that causes the decode_op_hdr message to pop seems like something else entirely. There's no correlation in the message popping and COMMIT calls in the capture as far as I can tell.

I think we'll have to do some investigation to track that down. What we'll probably have to do is roll a patch to instrument the kernel such that it gives us more info about what's causing that message to pop.

--- Additional comment from simone on 2010-03-19 05:00:41 EDT ---

Jeff: I'm willing to test that instrumented patch.

--- Additional comment from jlayton on 2010-03-19 08:13:26 EDT ---

Created an attachment (id=401230)
patch -- nfsd shouldn't break lease when servicing a COMMIT

I think this patch should help Rob's issue. It's similar in concept to a patch that I've just pushed upstream. It's a server side patch and doesn't affect client behavior at all.

Rob, would you be able to test this out someplace non-critical and let me know if it helps?

--- Additional comment from jlayton on 2010-03-19 09:26:58 EDT ---

Created an attachment (id=401245)
client side debug patch

We'll probably need to eventually split this bug into two. I think the problem with the decode_op_hdr messages is probably separate from the COMMIT deadlock issue. For now though, we'll keep the bugs together until we know more.

This patch changes the spot where we seem to see these messages pop. It makes it dump a bit more info so that we know what operation it expected to decode at the time, and does a dump_stack() too.

There's no ratelimiting in this patch, so you may get quite a bit of ringbuffer spam. The ideal thing would be to run whatever testcase you have that causes these messages to pop. Once it does, attach the log messages that this patch generates to the bug. It would also be helpful to sniff traffic at the same time so that we can try to correlate these events with actual traffic on the wire.

--- Additional comment from jlayton on 2010-03-19 09:27:43 EDT ---

Simone, if you can reproduce this problem on a kernel with the patch in comment #25, it would be helpful.

--- Additional comment from marcus.br on 2010-03-19 13:43:00 EDT ---

(In reply to comment #25)

Jeff, I have tested your patch with my testcase, look:

decode_op_hdr: reply buffer overflowed, expected = 9
Call Trace:
 [<ffffffff8844088a>] :nfs:decode_op_hdr+0x2c/0x61
 [<ffffffff884419f6>] :nfs:decode_getfattr+0x2b/0xc5b
 [<ffffffff88442690>] :nfs:nfs4_xdr_dec_delegreturn+0x0/0x69
 [<ffffffff8837dca0>] :sunrpc:rpc_async_schedule+0x0/0x5
 [<ffffffff884426ee>] :nfs:nfs4_xdr_dec_delegreturn+0x5e/0x69
 [<ffffffff88378915>] :sunrpc:call_decode+0x4bf/0x511
 [<ffffffff8837dae4>] :sunrpc:__rpc_execute+0x92/0x24e
 [<ffffffff8004dc37>] run_workqueue+0x94/0xe4
 [<ffffffff8004a472>] worker_thread+0x0/0x122
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a562>] worker_thread+0xf0/0x122
 [<ffffffff8008e16d>] default_wake_function+0x0/0xe
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032bdc>] kthread+0xfe/0x132
 [<ffffffff8005efb1>] child_rip+0xa/0x11
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032ade>] kthread+0x0/0x132
 [<ffffffff8005efa7>] child_rip+0x0/0x11

decode_op_hdr: reply buffer overflowed, expected = 9
Call Trace:
 [<ffffffff8844088a>] :nfs:decode_op_hdr+0x2c/0x61
 [<ffffffff884419f6>] :nfs:decode_getfattr+0x2b/0xc5b
 [<ffffffff88442690>] :nfs:nfs4_xdr_dec_delegreturn+0x0/0x69
 [<ffffffff8837dca0>] :sunrpc:rpc_async_schedule+0x0/0x5
 [<ffffffff884426ee>] :nfs:nfs4_xdr_dec_delegreturn+0x5e/0x69
 [<ffffffff88378915>] :sunrpc:call_decode+0x4bf/0x511
 [<ffffffff8837dae4>] :sunrpc:__rpc_execute+0x92/0x24e
 [<ffffffff8004dc37>] run_workqueue+0x94/0xe4
 [<ffffffff8004a472>] worker_thread+0x0/0x122
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a562>] worker_thread+0xf0/0x122
 [<ffffffff8008e16d>] default_wake_function+0x0/0xe
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032bdc>] kthread+0xfe/0x132
 [<ffffffff8005efb1>] child_rip+0xa/0x11
 [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032ade>] kthread+0x0/0x132
 [<ffffffff8005efa7>] child_rip+0x0/0x1

Best regards

--- Additional comment from jlayton on 2010-03-19 13:54:41 EDT ---

Ok, so in both cases, the issue was in response to a GETATTR call:

    OP_GETATTR = 9

...but apparently the packet was malformed. Did you happen to get a wire capture while reproducing this? It would be helpful to know what the response from the server looked like. What we'll probably need to do is roll up a more extensive debug patch that makes it print out the xid in the RPC packet when this problem occurs. That might also help track down the packets that are causing this message to pop.

One thing I've noticed in the capture provided by Rob is some strange RPC replies coming from the server. Those seem to be for program 0 and wireshark has a hard time decoding them. In most cases, I don't see any request that matches the XID, but I wonder whether occasionally those strange packets happen to match XID's that are in flight.

--- Additional comment from marcus.br on 2010-03-19 14:17:03 EDT ---

Created an attachment (id=401309)
tcpdump

Jeff, here is the tcpdump of this problem.

Regards

--- Additional comment from jlayton on 2010-03-19 14:59:37 EDT ---

Created an attachment (id=401320)
patch -- don't try to decode a GETATTR if DELEGRETURN failed

Thanks, Marcus.

I think I see the problem. nfs4_xdr_dec_delegreturn has this:

-------------[snip]----------------
        status = decode_putfh(&xdr);
        if (status != 0)
                goto out;
        status = decode_delegreturn(&xdr);
        decode_getfattr(&xdr, res->fattr, res->server,
                        !RPC_IS_ASYNC(rqstp->rq_task));
-------------[snip]----------------

...if the DELEGRETURN returns an error (as the capture shows) then there won't be a GETATTR call in the response to decode, and the printk will pop. This is an upstream bug too, but it looks like the printk doesn't actually pop there.

In truth, I think this is really a cosmetic fix. The error from decode_getfattr is ignored so the only real effect of the bug is printk spam.

Regardless, we might as well fix it. Can you test the patch and let me know if it silences the printk's?

--- Additional comment from jlayton on 2010-03-19 15:06:18 EDT ---

From Marcus' capture, it looks like there were some issues with locking that caused triggered some NFS4ERR_DELAY returns from the server. I suspect that that is really what you'll be interested in fixing. That's a much different problem from what we see in Rob's capture however, so I suggest opening a new ticket if you're still having trouble with NFSv4 after applying that patch.

--- Additional comment from marcus.br on 2010-03-20 23:19:12 EDT ---

(In reply to comment #31)
> From Marcus' capture, it looks like there were some issues with locking that
> caused triggered some NFS4ERR_DELAY returns from the server. I suspect that
> that is really what you'll be interested in fixing. That's a much different
> problem from what we see in Rob's capture however, so I suggest opening a new
> ticket if you're still having trouble with NFSv4 after applying that patch.    

You are right Jeff, printk problems gone and it's important to not confuse with another problem. Now locking issues are reported in BZ575505. Thanks

--- Additional comment from jlayton on 2010-03-21 07:19:52 EDT ---

Thanks Marcus. I've sent a patch upstream since this bug still exists there. The printk is now a dprintk there, but it's still something that ought to be fixed.

Comment 1 Jeff Layton 2010-03-22 13:50:49 UTC

Splitting this bug off from the initial bug to track the problem that Rob reported -- the commit / delegation recall deadlock. The original bug will be used to track the issue with delegreturn decoder ignoring DELEGRETURN errors and trying to decode the GETATTR piece of the compound reply.

Comment 2 Jeff Layton 2010-03-22 13:51:42 UTC

Still waiting for feedback from Rob on the patch to fix this problem, so setting this to NEEDINFO.

Comment 3 Jeff Layton 2010-03-26 11:40:22 UTC

I've put the patch here into my test kernels:

    https://bugzilla.redhat.com/attachment.cgi?id=401320&action=diff

...please test them on a non-critical NFS server and report back as to whether they fix the problem.

    http://people.redhat.com/jlayton/

Comment 5 RHEL Program Management 2010-06-29 14:29:31 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Jarod Wilson 2010-07-12 15:45:18 UTC

in kernel-2.6.18-206.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 9 yanfu,wang 2010-10-28 09:09:34 UTC

hard to reproduce, just do code review and confirm patch apply is sane.

Comment 11 errata-xmlrpc 2011-01-13 21:20:40 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html