Bug 1086303 - 30-min posix_acl_access timeouts
Summary: 30-min posix_acl_access timeouts
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: unclassified
Version: 3.4.0
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Vijay Bellur
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-10 15:03 UTC by John Brunelle
Modified: 2015-10-07 12:11 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-10-07 12:11:20 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
logs from the time of the reported issue happened (28.97 KB, application/x-bzip)
2014-04-14 20:34 UTC, John Brunelle
no flags Details

Description John Brunelle 2014-04-10 15:03:44 UTC
Description of problem:

We have been intermittently hitting 30 minute timeouts on getxattr/system.posix_acl_access calls on directories in a NFS v3 mount (w/ acl option) of a 10-node 40-brick gluster 3.4.0 volume.  For example:

$ strace -tt -T getfacl d6h_take1
...
18:43:57.929225 lstat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 <0.257107>
18:43:58.186461 getxattr("d6h_take1", "system.posix_acl_access", 0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) <1806.296893>
19:14:04.483556 stat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 <0.642362>
19:14:05.126025 getxattr("d6h_take1", "system.posix_acl_default", 0x7fffdf2b9f50, 132) = -1 ENODATA (No data available) <0.000024>
19:14:05.126114 stat("d6h_take1", {st_mode=S_IFDIR|0755, st_size=7024, ...}) = 0 <0.000010>
...

Load on the servers was moderate, but all other i/o operations were snappy.  While the above was hanging, getfacl worked nearly instantaneously on that directory on all bricks.  When it finally hit the 30 minute timeout, gluster logged it in nfs.log:

[2014-03-31 23:14:04.481154] E [rpc-clnt.c:207:call_bail] 0-holyscratch-client-36: bailing out frame type(GlusterFS 3.3) op(GETXATTR(18)) xid = 0x8168809x sent = 2014-03-31 22:43:58.442411. timeout = 1800
[2014-03-31 23:14:04.481233] W [client-rpc-fops.c:1112:client3_3_getxattr_cbk] 0-holyscratch-client-36: remote operation failed: Transport endpoint is not connected. Path: <gfid:b116fb01-b13d-448a-90d0-a8693a98698b> (b116fb01-b13d-448a-90d0-a8693a98698b). Key: (null)

Subsequent getfacls on the client (on the same directory) returned quickly.

Other than the above, we didn't see anything directly related in the nfs or brick logs or anything out of sorts with the gluster services.  Specifically, searching in the 3-min window near when the original call took place (note brick logs are in UTC) only brings up this one warning, over half a minute before and I'm assuming unrelated:

[root@adminhost ~]# pdsh -w holyscratch[01-10] '(zcat /var/log/glusterfs/bricks/*.gz; cat /var/log/glusterfs/bricks/*.log /var/log/glusterfs/bricks/*.log-????????) | grep "2014-03-31 22:4[345]" || true'
holyscratch03: [2014-03-31 22:43:14.840549] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (1a72c107-cf53-458a-81d4-62baa809a0b2) is not found. anonymous fd creation failed

Likewise at the 30 minute timeout:

[root@adminhost ~]# pdsh -w holyscratch[01-10] '(zcat /var/log/glusterfs/bricks/*.gz; cat /var/log/glusterfs/bricks/*.log /var/log/glusterfs/bricks/*.log-????????) | grep "2014-03-31 23:1[345]" || true'
holyscratch02: [2014-03-31 23:15:11.998911] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (e51e5b92-555a-4e23-9363-c16e59b8503b) is not found. anonymous fd creation failed

A couple other errors raise eyebrows, but these are different directories (neighbors of the example above) and at different times:

holyscratch07: /var/log/glusterfs/nfs.log:[2014-03-31 19:30:47.794454] I [dht-layout.c:630:dht_layout_normalize] 0-holyscratch-dht: found anomalies in /some_lab/some_user/d9_take2_BGI/Diffreg. holes=1 overlaps=0
(repeated 8x in 12 mins)

holyscratch08: /var/log/glusterfs/bricks/holyscratch08_03-brick.log:[2014-03-31 00:57:51.973565] E [posix-helpers.c:696:posix_handle_pair] 0-holyscratch-posix: /holyscratch08_03/brick/some_lab/some_user/d9_take2_BGI/cuffdiffRN.txt: key:system.posix_acl_access error:Invalid argument
holyscratch08: /var/log/glusterfs/bricks/holyscratch08_03-brick.log:[2014-03-31 01:18:12.345818] E [posix-helpers.c:696:posix_handle_pair] 0-holyscratch-posix: /holyscratch08_03/brick/some_lab/some_user/d9_take2_BGI/cuffdiffRN.txt: key:system.posix_acl_access error:Invalid argument
holyscratch05: /var/log/glusterfs/bricks/holyscratch05_04-brick.log:[2014-03-31 21:16:37.057674] E [posix-helpers.c:696:posix_handle_pair] 0-holyscratch-posix: /holyscratch05_04/brick/some_lab/some_user/d9_take2_BGI/Diffreg/cuffdiffRN.txt: key:system.posix_acl_access error:Invalid argument

See also the gluster-users thread: http://www.gluster.org/pipermail/gluster-users/2014-April/039861.html


Version-Release number of selected component (if applicable):

3.4.0


How reproducible:

intermittent (and also complicated by NFS caching, which is otherwise as boon)


Steps to Reproduce:
1. getfacl or ls -l a directory; the issue may or may not present itself


Actual results:

30 minutes to complete


Expected results:

realize ENODATA faster


Additional info:

gluster-users thread: http://www.gluster.org/pipermail/gluster-users/2014-April/039861.html

Comment 1 Vijay Bellur 2014-04-12 03:33:42 UTC
Would it be possible to upload the gluster logs from the machine that acts as the nfs server and the ones where file "d6h_take1" is present? 

When a 30 minute timeout is observed, it would be good to check syslog & dmesg on the servers that own the file to see if anything unusual is reported by the kernel.

Comment 2 John Brunelle 2014-04-14 20:34:16 UTC
Created attachment 886262 [details]
logs from the time of the reported issue happened

holyscratch06 was the NFS server being used in the example

Comment 3 John Brunelle 2014-04-14 20:39:50 UTC
Sure, I've attached the logs.  Redacted a bit, but I think all the essential stuff is there.

All gluster servers are NFS servers, too, but the given example happened on a client connected to holyscratch06.  "d6h_take1" is a directory, so present on all bricks.

I see a couple interesting things from syslog I did not notice before.  These happened about a minute before my traced test, and may or may not have been my preliminary poking on the same paths:

holyscratch04/syslog:Mar 31 18:42:47 holyscratch04 GlusterFS[59447]: [2014-03-31 22:42:47.465724] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-holyscratch-client-35: cannot lookup the saved frame corresponding to xid (2934269)
holyscratch05/syslog:Mar 31 18:42:47 holyscratch05 GlusterFS[55388]: [2014-03-31 22:42:47.455307] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-holyscratch-client-35: cannot lookup the saved frame corresponding to xid (715115)
holyscratch08/syslog:Mar 31 18:42:47 holyscratch08 GlusterFS[58355]: [2014-03-31 22:42:47.456727] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-holyscratch-client-35: cannot lookup the saved frame corresponding to xid (2049402)

p.s. "holy" stands for Holyoke, MA in case you're wondering.  The filesystem has not been deified :)

Comment 4 Niels de Vos 2015-05-17 21:57:34 UTC
GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Comment 5 Kaleb KEITHLEY 2015-10-07 12:11:20 UTC
GlusterFS 3.4.x has reached end-of-life.

If this bug still exists in a later release please reopen this and change the version or open a new bug.


Note You need to log in before you can comment on or make changes to this bug.