Description of problem: When a user with a large number of group memberships (>100) attempts to access a mounted gluster volume they get error: cannot open directory .: Transport endpoint is not connected The mnt-gluster.log shows: [2013-08-21 20:37:18.486289] W [xdr-rpcclnt.c:79:rpc_request_to_xdr] 0-rpc: failed to encode call msg [2013-08-21 20:37:18.486302] E [rpc-clnt.c:1251:rpc_clnt_record_build_record] 0-gv0-client-0: Failed to build record header [2013-08-21 20:37:18.486310] W [rpc-clnt.c:1311:rpc_clnt_record] 0-gv0-client-0: cannot build rpc-record [2013-08-21 20:37:18.486317] W [rpc-clnt.c:1452:rpc_clnt_submit] 0-gv0-client-0: cannot build rpc-record [2013-08-21 20:37:18.486327] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gv0-client-0: remote operation failed: Transport endpoint is not connected. Path: /Distributions (00000000-0000-0000-0000-000000000000) I attached gdb to the glusterfs and was able to get it to break into rpc-clnt in rpc_clnt_record_build_record. The problem happens in the call to xdr_sizeof (around line 1232 I believe). The glibc library has a limit of 400 Bytes for auth data when encoding this specific rpc message. (in glibc you can look at sunrpc/rpc/auth.h) And as a result xdr_sizeof returns 0 for users with large numbers of group memberships. This may be an issue that should be fixed in glibc, but since I'm rather a novice with this code I thought I'd file it as a bug here first. This is an issue for us when integrating gluster with samba using active directory authentication where sometimes users belong to over 100 groups. Version-Release number of selected component (if applicable): How reproducible: I am able to easily reproduce it easily with our user directory, but it relies on active directory. I imagine it would also be reproducible if you create >100 groups on a single system and add a user to all of them, although I haven't had time to try it yet. Then try to access a mounted gluster volume. Steps to Reproduce: I believe the following would reproduce, I have not had time to try this yet. 1.Create >100 groups 2.Add a user to all these groups 3.From the user attempt to access a mounted gluster volume. Actual results: cannot open directory .: Transport endpoint is not connected When trying to access a gluster mount. Expected results: Be able to access the gluster mount. Additional info: I'm currently building glibc with a larger MAX_AUTH_BYTES value in sunrpc/rpc/auth.h to see if the issue resolves, I'll post here once it finishes building... As I mentioned this may be an issue to fix in glibc rather than gluster.
Rebuilding glibc with MAX_AUTH_BYTES set to 1024 (may be excessive) allows users to access a mounted gluster volume.
REVIEW: http://review.gluster.org/5695 (rpc: fix typo which refers glibc macro) posted (#1) for review on master by Anand Avati (avati)
Hah! turns out to be a long standing bug.. This was a harmless typo initially when we used RPCSVC_MAX_AUTH_DATA everywhere else and that value was also 400. When we replaced RPCSVC_MAX_AUTH_DATA with GF_MAX_AUTH_DATA as 2048, this location was left out (sed does not detect typos!), and the harmless typo became harmful :-) Please test the patch http://review.gluster.org/5695 and vote on it. Thanks.
COMMIT: http://review.gluster.org/5695 committed in master by Vijay Bellur (vbellur) ------ commit d64df6a92c2492812ef7c23cc133f5d7a113ec42 Author: Anand Avati <avati> Date: Thu Aug 22 14:14:22 2013 -0700 rpc: fix typo which refers glibc macro A typo which read MAX_AUTH_BYTES instead of GF_MAX_AUTH_BYTES was picking the value 400 instead of the larger 2048. This causes failures when number of aux group ids is a large number. Change-Id: Idb8d59aee2690fd53e24c2e09f58a16fe387ef27 BUG: 1000131 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/5695 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Amar Tumballi <amarts> Reviewed-by: Vijay Bellur <vbellur>
You folks rock! I didn't expect such a fast response on this issue. Unfortunately the patch doesn't fix the problem. (The fix you made looked like it was also wrong, but not this specific issue). I'm a novice with the gluster code base but if I'm reading things right in rpc/rpc-lib/src/rpc-clnt.c you make the following call on line 1232: xdr_size = xdr_sizeof ((xdrproc_t)xdr_callmsg, &request); xdr_callmsg is defined in glibc in sunrpc/rpc_cmsg.c (grep of gluster didn't find any redefinition). It looks like xdr_sizeof builds up a list of operations to perform and then calls the passed in function (xdr_callmsg). The source for this in turn does: if (cmsg->rm_call.cb_cred.oa_length > MAX_AUTH_BYTES) { return (FALSE); } if (cmsg->rm_call.cb_verf.oa_length > MAX_AUTH_BYTES) { return (FALSE); } This causes xdr_sizeof to return 0. And there goes my problem. So even though you use GF_MAX_AUTH_BYTES in gluster code, a glibc function gets called that uses MAX_AUTH_BYTES. I may totally be reading this wrong, since I just started looking at this code for the first time the day before last, but that is where I believe the problem lies.
Sorry, I misspoke. The fix you made was RIGHT (not wrong), that code was originally incorrect too, but the patch does not fix my issue.
REVIEW: http://review.gluster.org/5854 (rpc: fix typo which refers glibc macro) posted (#1) for review on release-3.4 by Anand Avati (avati)
REVIEW: http://review.gluster.org/5854 (rpc: fix typo which refers glibc macro) posted (#2) for review on release-3.4 by Anand Avati (avati)
REVIEW: http://review.gluster.org/5854 (rpc: fix typo which refers glibc macro) posted (#3) for review on release-3.4 by Anand Avati (avati)
COMMIT: http://review.gluster.org/5854 committed in release-3.4 by Vijay Bellur (vbellur) ------ commit f43a223ad1e53041f46b351aa260203ea0685613 Author: Anand Avati <avati> Date: Thu Aug 22 14:14:22 2013 -0700 rpc: fix typo which refers glibc macro A typo which read MAX_AUTH_BYTES instead of GF_MAX_AUTH_BYTES was picking the value 400 instead of the larger 2048. This causes failures when number of aux group ids is a large number. Change-Id: Idb8d59aee2690fd53e24c2e09f58a16fe387ef27 BUG: 1000131 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/5854 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.4.3, please reopen this bug report. glusterfs-3.4.3 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should already be or become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. The fix for this bug likely to be included in all future GlusterFS releases i.e. release > 3.4.3. In the same line the recent release i.e. glusterfs-3.5.0 [3] likely to have the fix. You can verify this by reading the comments in this bug report and checking for comments mentioning "committed in release-3.5". [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/5978 [2] http://news.gmane.org/gmane.comp.file-systems.gluster.user [3] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137