Description of problem: We have been intermittently hitting 30 minute timeouts on getxattr/system.posix_acl_access calls on directories in a NFS v3 mount (w/ acl option) of a 10-node 40-brick gluster 3.4.0 volume. For example: $ strace -tt -T getfacl d6h_take1 ... 18:43:57.929225 lstat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 <0.257107> 18:43:58.186461 getxattr("d6h_take1", "system.posix_acl_access", 0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) <1806.296893> 19:14:04.483556 stat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 <0.642362> 19:14:05.126025 getxattr("d6h_take1", "system.posix_acl_default", 0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) <0.000024> 19:14:05.126114 stat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 <0.000010> ... Load on the servers was moderate, but all other i/o operations were snappy. While the above was hanging, getfacl worked nearly instantaneously on that directory on all bricks. When it finally hit the 30 minute timeout, gluster logged it in nfs.log: [2014-03-31 23:14:04.481154] E [rpc-clnt.c:207:call_bail] 0-holyscratch-client-36: bailing out frame type(GlusterFS 3.3) op(GETXATTR(18)) xid = 0x8168809x sent = 2014-03-31 22:43:58.442411. timeout = 1800 [2014-03-31 23:14:04.481233] W [client-rpc-fops.c:1112:client3_3_getxattr_cbk] 0-holyscratch-client-36: remote operation failed: Transport endpoint is not connected. Path: <gfid:b116fb01-b13d-448a-90d0-a8693a98698b> (b116fb01-b13d-448a-90d0-a8693a98698b). Key: (null) Subsequent getfacls on the client (on the same directory) returned quickly. Other than the above, we didn't see anything directly related in the nfs or brick logs or anything out of sorts with the gluster services. Specifically, searching in the 3-min window near when the original call took place (note brick logs are in UTC) only brings up this one warning, over half a minute before and I'm assuming unrelated: [root@adminhost ~]# pdsh -w holyscratch[01-10] '(zcat /var/log/glusterfs/bricks/*.gz; cat /var/log/glusterfs/bricks/*.log /var/log/glusterfs/bricks/*.log-????????) | grep "2014-03-31 22:4[345]" || true' holyscratch03: [2014-03-31 22:43:14.840549] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (1a72c107-cf53-458a-81d4-62baa809a0b2) is not found. anonymous fd creation failed Likewise at the 30 minute timeout: [root@adminhost ~]# pdsh -w holyscratch[01-10] '(zcat /var/log/glusterfs/bricks/*.gz; cat /var/log/glusterfs/bricks/*.log /var/log/glusterfs/bricks/*.log-????????) | grep "2014-03-31 23:1[345]" || true' holyscratch02: [2014-03-31 23:15:11.998911] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (e51e5b92-555a-4e23-9363-c16e59b8503b) is not found. anonymous fd creation failed A couple other errors raise eyebrows, but these are different directories (neighbors of the example above) and at different times: holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:30:47.794454] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /some_lab/some_user/d9_take2_BGI/Diffreg. holes=1 overlaps=0 (repeated 8x in 12 mins) holyscratch08: /var/log/glusterfs/bricks/holyscratch08_03-brick.log:[2014-03-31 00:57:51.973565] E [posix-helpers.c:696:posix_handle_pair] 0-holyscratch-posix: /holyscratch08_03/brick/some_lab/some_user/d9_take2_BGI/cuffdiffRN.txt: key:system.posix_acl_access error:Invalid argument holyscratch08: /var/log/glusterfs/bricks/holyscratch08_03-brick.log:[2014-03-31 01:18:12.345818] E [posix-helpers.c:696:posix_handle_pair] 0-holyscratch-posix: /holyscratch08_03/brick/some_lab/some_user/d9_take2_BGI/cuffdiffRN.txt: key:system.posix_acl_access error:Invalid argument holyscratch05: /var/log/glusterfs/bricks/holyscratch05_04-brick.log:[2014-03-31 21:16:37.057674] E [posix-helpers.c:696:posix_handle_pair] 0-holyscratch-posix: /holyscratch05_04/brick/some_lab/some_user/d9_take2_BGI/Diffreg/cuffdiffRN.txt: key:system.posix_acl_access error:Invalid argument See also the gluster-users thread: http://www.gluster.org/pipermail/gluster-users/2014-April/039861.html Version-Release number of selected component (if applicable): 3.4.0 How reproducible: intermittent (and also complicated by NFS caching, which is otherwise as boon) Steps to Reproduce: 1. getfacl or ls -l a directory; the issue may or may not present itself Actual results: 30 minutes to complete Expected results: realize ENODATA faster Additional info: gluster-users thread: http://www.gluster.org/pipermail/gluster-users/2014-April/039861.html
Would it be possible to upload the gluster logs from the machine that acts as the nfs server and the ones where file "d6h_take1" is present? When a 30 minute timeout is observed, it would be good to check syslog & dmesg on the servers that own the file to see if anything unusual is reported by the kernel.
Created attachment 886262 [details] logs from the time of the reported issue happened holyscratch06 was the NFS server being used in the example
Sure, I've attached the logs. Redacted a bit, but I think all the essential stuff is there. All gluster servers are NFS servers, too, but the given example happened on a client connected to holyscratch06. "d6h_take1" is a directory, so present on all bricks. I see a couple interesting things from syslog I did not notice before. These happened about a minute before my traced test, and may or may not have been my preliminary poking on the same paths: holyscratch04/syslog:Mar 31 18:42:47 holyscratch04 GlusterFS[59447]: [2014-03-31 22:42:47.465724] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-holyscratch-client-35: cannot lookup the saved frame corresponding to xid (2934269) holyscratch05/syslog:Mar 31 18:42:47 holyscratch05 GlusterFS[55388]: [2014-03-31 22:42:47.455307] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-holyscratch-client-35: cannot lookup the saved frame corresponding to xid (715115) holyscratch08/syslog:Mar 31 18:42:47 holyscratch08 GlusterFS[58355]: [2014-03-31 22:42:47.456727] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-holyscratch-client-35: cannot lookup the saved frame corresponding to xid (2049402) p.s. "holy" stands for Holyoke, MA in case you're wondering. The filesystem has not been deified :)
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5. This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs". If there is no response by the end of the month, this bug will get automatically closed.
GlusterFS 3.4.x has reached end-of-life. If this bug still exists in a later release please reopen this and change the version or open a new bug.