Created attachment 733285 [details] Packet capture of failed NFS FSINFO RPC I'm trying to get Oracle's DNFS working against gluster's internal NFS and I've run into a snag. After Oracle mounts the exported NFS filesystem the FSINFO call fails. Let's look with wireshark: »Remote Procedure Call, Type:Call XID:0x47349477 Program: MOUNT (100005) Mount Service [Program Version: 3] [V3 Procedure: MNT (1)] Path: /gv0/fleming3/db0/ALTUS_data »Remote Procedure Call, Type:Reply XID:0x47349477 Reply State: accepted (0) Mount Service [Program Version: 3] [V3 Procedure: MNT (1)] Status: OK (0) fhandle length: 34 [hash (CRC-32): 0x10650fe6] [Name: 192.168.10.1:/gv0/fleming3/db0/ALTUS_data] filehandle: 3a4f20117b487f884f169490a0349afacf71331965f573144e93b289a395148edfe5 »Remote Procedure Call, Type:Call XID:0x47349478 Program: NFS (100003) Program Version: 3 Procedure: FSINFO (19) Network File System, FSINFO Call DH:0x10650fe6 [Program Version: 3] [V3 Procedure: FSINFO (19)] object length: 34 [hash (CRC-32): 0x10650fe6] [Name: 192.168.10.1:/gv0/fleming3/db0/ALTUS_data] filehandle: 3a4f20117b487f884f169490a0349afacf71331965f573144e93b289a395148edfe5 »Remote Procedure Call, Type:Reply XID:0x47349478 Reply State: accepted (0) Accept State: procedure can't decode params (4) ARGH. Not sure what's going on here - wireshark is perfectly happy to decode those params. If I do a regular filesystem mount from Linux, the result is: »Remote Procedure Call, Type:Call XID:0x266eda62 Program: MOUNT (100005) Mount Service [Program Version: 3] [V3 Procedure: MNT (1)] Path: /gv0/fleming1/db0/ALTUS_data »Remote Procedure Call, Type:Reply XID:0x266eda62 Reply State: accepted (0) Mount Service [Program Version: 3] [V3 Procedure: MNT (1)] Status: OK (0) fhandle length: 34 [hash (CRC-32): 0xb2ae682f] [Name: 192.168.10.1:/gv0/fleming1/db0/ALTUS_data] filehandle: 3a4f20117b487f884f169490a0349afacf71e8bd0e2198c34cb88a0231dbeb071035 »Remote Procedure Call, Type:Call XID:0x68b3c756 Program: NFS (100003) Procedure: FSINFO (19) Network File System, FSINFO Call DH:0xb2ae682f [Program Version: 3] [V3 Procedure: FSINFO (19)] object length: 34 [hash (CRC-32): 0xb2ae682f] [Name: 192.168.10.1:/gv0/fleming1/db0/ALTUS_data] filehandle: 3a4f20117b487f884f169490a0349afacf71e8bd0e2198c34cb88a0231dbeb071035 »Remote Procedure Call, Type:Reply XID:0x68b3c756 Reply State: accepted (0) Network File System, FSINFO Reply [Program Version: 3] [V3 Procedure: FSINFO (19)] Status: NFS3_OK (0) obj_attributes Directory mode:0755 uid:500 gid:1000 rtmax: 65536 rtpref: 65536 rtmult: 4096 wtmax: 65536 wtpref: 65536 wtmult: 4096 dtpref: 65536 maxfilesize: 1125899906842624 time delta: 1.000000000 seconds Properties: 0x0000001b So for some reason, gluster is happy with Linux's request but not Oracle's. All I get out of gluster is: [2013-04-08 12:54:32.206312] E [nfs3.c:4741:nfs3svc_fsinfo] 0-nfs-nfsv3: Error decoding arguments I've attached abridged packet captures and text explanations of the packets (thanks to wireshark). Can someone please look at this and determine if it's gluster's parsing of the RPC call to blame, or if it's Oracle? This is the same setup on which I reported the NFS race condition bug. It does have that patch applied. Details: http://lists.gnu.org/archive/html/gluster-devel/2013-04/msg00014.html
Created attachment 733286 [details] Good FSINFO RPC from Linux
Created attachment 733287 [details] Text summary of failed FSINFO RPC
Created attachment 733288 [details] Text summary of successful FSINFO RPC
Niels de Vos <ndevos> points out in http://lists.gnu.org/archive/html/gluster-devel/2013-04/msg00050.html: « XDR (http://tools.ietf.org/html/rfc4506, the encoding used for the RPC protocol) uses 'blocks' for alignment. A fhandle byte array that is 34-bytes long, needs to be (34 / 4 + 1)*4 = 36 bytes in size. The 'length' given in the structure tells the consumer to ignore the two tailing bytes. The NFSv3 specification (http://tools.ietf.org/html/rfc1813#page-21) defines the nfs_fh3 as a opaque (not bytes) structure. My guess is that this (untested) change would fix it, can you try that? » It didn't :) Looks like Niels may have identified the problem, still need to fix it however.
New proposal sent to Michael with gluster-devel@ on CC: xdr_nfs_fh3 (XDR *xdrs, nfs_fh3 *objp) { uint32_t size; if (!xdr_int (xdrs, &size)) if (!xdr_opaque (xdrs, (u_int *)&objp->data.data_val, size)) return FALSE return TRUE; }
Created attachment 735043 [details] Proposed patch for testing 23:51 < ndevos> Supermathie: ah, I've thought of the error in my suggestion - that function is used to encode and decode 23:52 < ndevos> which means, that the size parameter must be set correctly - the .data_len attribute contain the size when encoding, and should be overwritten when decoding 23:53 < ndevos> KERBOOM happens when an idea is only half looked at :-/ Maybe something the attached patch works better? It should encode/decode both the length and the fhandle value. Compile tested only.
Created attachment 735301 [details] Updated patch This patch does not break the Linux NFS client. I wonder if this makes it possible to use the Oracle DNF client.
What happens when gluster accepts the bad RPC in the FSINFO handler is that things continue on, but that same bad XDR blocking keeps coming in and causes the glusterfs NFS daemon to crash. Test cases need to be added to gluster to be more robust in handling this situation. Regarding Oracle, I'm able to work around the problem by expanding the size of the FD so that it happens to be congruent to 0mod4 bytes: https://github.com/Supermathie/glusterfs/commit/95880cf71375cb4b04a1b645598c7570c5087de7 I'm morally opposed to submitting this for inclusion in Gluster however - Oracle needs to fix their code! I'm inclined to leave this bug open as a request for better robustness in the handling of bad XDR encoding in incoming RPCs - they shouldn't be crashing Gluster's NFS.
REVIEW: http://review.gluster.org/4918 (Expand gluster's NFS FD header to 4 bytes) posted (#2) for review on master by Anand Avati (avati)
COMMIT: http://review.gluster.org/4918 committed in master by Anand Avati (avati) ------ commit 39a1eaf38d64f66dfa74c6843dc9266f40dd4645 Author: Michael Brown <michael> Date: Tue Apr 30 11:34:57 2013 -0400 Expand gluster's NFS FD header to 4 bytes * https://bugzilla.redhat.com/show_bug.cgi?id=950121 * Oracle's DNFS does not properly XDR encoding on NFS FDs that are not congruent to 0mod4 bytes long * This patch is a workaround to support Oracle's buggy code Change-Id: Ic621e2cd679a86aa9a06ed9ca684925e1e0ec43f BUG: 950121 Signed-off-by: Michael Brown <michael> Reviewed-on: http://review.gluster.org/4918 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Anand Avati <avati>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.0, please reopen this bug report. glusterfs-3.5.0 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user