|Summary:||nfsv3 on krb5i printk's many RPC request reserved XXX but used YYY|
|Product:||Red Hat Enterprise Linux 5||Reporter:||Jeff Layton <jlayton>|
|Component:||kernel||Assignee:||Jeff Layton <jlayton>|
|Status:||CLOSED ERRATA||QA Contact:||Martin Jenner <mjenner>|
|Version:||5.0||CC:||k.georgiou, staubach, steved|
|Fixed In Version:||RHBA-2007-0959||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2007-11-07 19:48:36 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
Description Jeff Layton 2007-05-02 10:42:50 UTC
+++ This bug was initially created as a clone of Bug #237374 +++ +++ This bug was initially created as a clone of Bug #227726 +++ When running the cthon04 test suite on NFSv3 w/ sec=krb5i, we get a lot of messages that look like: RPC request reserved 172 but used 216 RPC request reserved 164 but used 208 RPC request reserved 172 but used 216 RPC request reserved 164 but used 208 RPC request reserved 196 but used 216 ...and eventually get an oops in svcauth_gss_release. Here's one such oops, though some other have crashed in kernel BUG at net/sunrpc/auth_gss/svcauth_gss.c:1001! invalid operand: 0000 [#1] SMP Modules linked in: nfsd exportfs lockd nfs_acl md5 ipv6 autofs4 rpcsec_gss_krb5 auth_rpcgss des sunrpc dm_mirror dm_mod xennet ext3 jbd xenblk sd_mod scsi_mod CPU: 0 EIP: 0061:[<e0af1b61>] Not tainted VLI EFLAGS: 00010202 (2.6.9-44.ELxenU) EIP is at svcauth_gss_release+0x130/0x301 [auth_rpcgss] eax: 00000198 ebx: d3d30044 ecx: de205f0c edx: 00000008 esi: c159cc64 edi: 00000070 ebp: 00000048 esp: de205f14 ds: 007b es: 007b ss: 0068 Process nfsd (pid: 1840, threadinfo=de205000 task=de091360) Stack: 00000010 ffffffea dae85aa0 00000001 c159cc00 d3d30048 00000068 d3d300b0 00000008 c016e2e3 d58012ac 00000000 00000070 00000070 e0c05af9 d7b799c0 e0c2df40 e0af6144 00000000 d3d30048 c159cc64 e0b5fe2b c159cc00 e0c2e404 Call Trace: [<c016e2e3>] dput+0x38/0x1ab [<e0c05af9>] fh_put+0x177/0x188 [nfsd] [<e0b5fe2b>] svc_authorise+0x1a/0x4f [sunrpc] [<e0b5da36>] svc_process+0x4b7/0x6fd [sunrpc] [<e0c03469>] nfsd+0x1ce/0x33e [nfsd] [<e0c0329b>] nfsd+0x0/0x33e [nfsd] [<c0105341>] kernel_thread_helper+0x5/0xb Code: 07 00 83 c4 10 85 c0 74 08 0f 0b e5 03 8d 28 af e0 83 7e 18 00 75 23 8b 56 0c 8d 82 90 01 00 00 3d ff 0f 00 00 77 13 85 d2 74 08 <0f> 0b e9 03 8d 28 af e0 89 f3 e9 85 00 00 00 83 7e 08 00 8d 5e ...though we had another that looked like this: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c0233406 1de91000 -> *pde = 00000000:0e6c9001 1de95000 -> *pme = 00000000:00000000 Oops: 0002 [#1] SMP Modules linked in: nfsd exportfs lockd nfs_acl md5 ipv6 autofs4 rpcsec_gss_krb5 auth_rpcgss des sunrpc dm_mirror dm_mod xennet ext3 jbd xenblk sd_mod scsi_mod CPU: 0 EIP: 0061:[<c0233406>] Not tainted VLI EFLAGS: 00010246 (2.6.9-44.ELxenU) EIP is at do_tcp_sendpages+0x307/0x9e7 eax: dbc81240 ebx: dc139e40 ecx: 00000000 edx: 00000000 esi: 0000002c edi: dbc81458 ebp: 00000002 esp: ddf80e7c ds: 007b es: 007b ss: 0068 Process nfsd (pid: 1876, threadinfo=ddf80000 task=df3bd1a0) Stack: 00000000 0000002c 00000000 00000000 00000000 000005a8 00000bb8 00000000 dbc81240 0000002c 00000000 c0233b32 dbc81240 ddf80ed8 00000000 0000002c 00000000 00000000 00000000 0000002c dae94c80 e0b5e236 dae94c80 00000000 Call Trace: [<c0233b32>] tcp_sendpage+0x4c/0x5f [<e0b5e236>] svc_sendto+0x1b7/0x208 [sunrpc] [<e0b5ff7c>] auth_domain_put+0xe/0x11 [sunrpc] [<e0af1d19>] svcauth_gss_release+0x2e8/0x301 [auth_rpcgss] [<e0b5ef61>] svc_tcp_sendto+0x63/0xa9 [sunrpc] [<e0b5f5e2>] svc_send+0xbd/0xf4 [sunrpc] [<e0b5da44>] svc_process+0x4c5/0x6fd [sunrpc] [<e0c03469>] nfsd+0x1ce/0x33e [nfsd] [<e0c0329b>] nfsd+0x0/0x33e [nfsd] [<c0105341>] kernel_thread_helper+0x5/0xb Code: ff 83 c4 0c 85 c0 0f 84 5a 02 00 00 83 3c 24 00 74 15 8b 93 ac 00 00 00 0f b7 44 ea 0e 8d 04 06 66 89 44 ea 0e eb 33 8b 54 24 0c <f0> ff 42 04 8b 83 ac 00 00 00 8b 4c 24 0c 8d 04 e8 89 48 10 8d I think they're related, but it's hard to be certain. My first read of this makes me think that we aren't reserving a large enough buffer for the response when using krb5i with nfsv3, with some operations. -- Additional comment from firstname.lastname@example.org on 2007-02-07 20:15 EST -- This problem also seems to exist on recent upstream kernels so it's probably also a problem on FC-6/RHEL-5. -- Additional comment from email@example.com on 2007-04-18 13:49 EST -- Finally got a chance to start looking at this again. Mounted a fc7 rawhide server (2.6.20-2925.5.fc7xen) on a RHEL4 client: host:/export on /mnt/krb5i type nfs (rw,sec=krb5i,addr=xxx.yyy.xxx.yyy) When I do simple operations in the filesystem (ls, etc), I get messages like this: RPC request reserved 164 but used 208 RPC request reserved 164 but used 208 RPC request reserved 172 but used 216 RPC request reserved 164 but used 208 RPC request reserved 172 but used 216 RPC request reserved 164 but used 208 These all seem to be 44 bytes more than they should be. Not yet sure of the significance of 44 bytes here. -- Additional comment from firstname.lastname@example.org on 2007-04-19 14:35 EST -- Had a look on the wire at what's happening at the time when these messages pop up: This message: RPC request reserved 164 but used 208 ...seems to correspond to a GETATTR call. On the wire, the GETATTR packets are 274 bytes total, and the RPC+NFS payload is 208 bytes. This one: RPC request reserved 172 but used 216 ...comes from an ACCESS call. On the wire, this is 282 bytes total, and the RPC+NFS payload is 216 bytes. So the kernel seems to be consistently underestimating the actual rq_reserved size that we need when using krb5i with nfs3. This code is pretty complex though, so still tracking down the actual cause... -- Additional comment from email@example.com on 2007-04-20 07:11 EST -- In comparing the same ACCESS packet in a krb5 vs. a krb5i mount, the size difference is actually in the NFS section of the packet (as decoded by wireshark). The RPC sections of both are the exact same size (68 bytes). The main differences seem to be that the krb5i packet has an extra 8 bytes at the beginning of the NFS section: 4 bytes for GSS Data Length 4 bytes for GSS Sequence Number This part seems to be accounted for in the iov_len passed to svc_reserve. The part that's not accounted for is the 44 byte (!) GSS Checksum at the very end of the packet. I think that's the likely problem here... -- Additional comment from firstname.lastname@example.org on 2007-04-20 09:02 EST -- At least one issue seems to be that after writing out the first part of the response header, svc_process() calls this: /* un-reserve some of the out-queue now that we have a * better idea of reply size */ if (procp->pc_xdrressize) svc_reserve(rqstp, procp->pc_xdrressize<<2); ...but, this does not take into account the size of the checksum at the end of the response. This is probably also a problem for krb5p. We don't seem to see this issue with NFSv4, but my guess is that's because the "reserved" size for a compound packet is likely large enough to encompass this in most cases. I did a quick hack to add 44 bytes to the end of the second argument in the call above, and that got rid of most of these errors. I still see some, so apparently there's another related issue someplace... -- Additional comment from email@example.com on 2007-04-20 09:20 EST -- The other related issue is the svc_reserve calls in nfsd/nfs3proc.c and nfsd/nfsproc.c, when I add 44 bytes to the "space" arg in those calls, then the errors go away completely. My thinking at this point is to add a wrapper of some sort for svc_reserve that adds an appropriate amount for the checksum in this situation, and call that wrapper instead of svc_reserve in these 3 places. -- Additional comment from firstname.lastname@example.org on 2007-04-21 10:08 EST -- Posted a patch upstream for this. Awaiting feedback on it now... -- Additional comment from email@example.com on 2007-04-23 07:23 EST -- Created an attachment (id=153272) proposed patch This patch is what I sent upstream. Andrew Morton picked it up for -mm. Awaiting feedback at this point. -- Additional comment from firstname.lastname@example.org on 2007-04-25 16:08 EST -- This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. -- Additional comment from email@example.com on 2007-05-01 19:26 EST -- Actually, Neil Brown pointed out that the panics shown here are probably not related to the printk's showing up after all. Apparently these are soft reservations and if we overflow them, it shouldn't be a big deal. He has posted a patch that should address these panics. I'm having trouble reproducing the original panic, however. I think they may be arch-specific somehow, since I could generate them fairly easily before on i686, but have yet to see one on x86_64. Working on getting an i686 test rig that I can use to hopefully verify this...
Comment 1 Jeff Layton 2007-05-02 10:44:37 UTC
To reiterate, Neil Brown pointed out that the "RPC reserved XXX but used YYY" errors are likely unrelated to the panics seen under these conditions. So I'm opening this bug to address these messages and will leave bug 237374 to address the oopses.
Comment 2 Jeff Layton 2007-05-02 10:58:23 UTC
Created attachment 153929 [details] patch -- backport of upstream patch . The only difference is the move of #define RPC_MAX_AUTH_SIZE to msg_prot.h. This was done upstream as part of a different patch, but this patch will require it.
Comment 3 RHEL Product and Program Management 2007-05-02 11:03:56 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Comment 5 RHEL Product and Program Management 2007-05-08 13:41:57 UTC
This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST.
Comment 6 Jeff Layton 2007-05-25 11:50:23 UTC
*** Bug 241341 has been marked as a duplicate of this bug. ***
Comment 7 Steve Dickson 2007-05-25 12:07:34 UTC
*** Bug 241341 has been marked as a duplicate of this bug. ***
Comment 8 Don Zickus 2007-06-12 19:00:46 UTC
in 2.6.18-24.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 11 errata-xmlrpc 2007-11-07 19:48:36 UTC
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html