1481931 – [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during rebalance of an erasure coded volume.

Bug 1481931 - [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during rebalance of an erasure coded volume.

Summary: [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	snapshot
Sub Component:
Version:	3.12
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Mohammed Rafi KC
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1480591
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-16 05:28 UTC by Mohammed Rafi KC
Modified:	2017-10-26 14:36 UTC (History)
CC List:	19 users (show)
Fixed In Version:	glusterfs-3.12.0
Clone Of:	1480591
Environment:
Last Closed:	2017-09-05 17:39:07 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Mohammed Rafi KC 2017-08-16 05:28:43 UTC

+++ This bug was initially created as a clone of Bug #1480591 +++

+++ This bug was initially created as a clone of Bug #1478716 +++

Description of problem:
-----------------------

Expanded my volume from 10*(4+2) to 12*(4+2) with linux untars and lookups from  6 gNFS mounts.

Triggered rebalance.


I/O errored out on 3/3 clients with ESTALE :


On gqac007(Mounted via gqas008.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot/dts: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors
</snip>



On gqac024(mounted via gqas013.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors

<snip>

On gqac008 (Mounted via gqas007.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors
</snip>

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-38

How reproducible:
-----------------

1/1


Actual results:
---------------

Application got affected.IO errored out with ESTALE.


Expected results:
-----------------

Successful IO completion.

Additional info:

Volume Name: butcher
Type: Distributed-Disperse
Volume ID: dbd4e36f-28de-445e-a5cd-c78d0a467fb6
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x (4 + 2) = 72
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick5: gqas003.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick6: gqas007.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick8: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick9: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick10: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick11: gqas003.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick12: gqas007.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick14: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick16: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick17: gqas003.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick18: gqas007.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick19: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick20: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick21: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick22: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick23: gqas003.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick24: gqas007.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick25: gqas013.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick26: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick27: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick28: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick29: gqas003.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick30: gqas007.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick31: gqas013.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick32: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick33: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick34: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick35: gqas003.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick36: gqas007.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick37: gqas013.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick38: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick39: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick40: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick41: gqas003.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick42: gqas007.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick43: gqas013.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick44: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick45: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick46: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick47: gqas003.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick48: gqas007.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick49: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick50: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick51: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick52: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick53: gqas003.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick54: gqas007.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick55: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick56: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick57: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick58: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick59: gqas003.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick60: gqas007.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick61: gqas013.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick62: gqas005.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick63: gqas006.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick64: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick65: gqas003.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick66: gqas007.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick67: gqas013.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick68: gqas005.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick69: gqas006.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick70: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick71: gqas003.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick72: gqas007.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.uss: enable
client.event-threads: 4
server.event-threads: 4
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: off
[root@gqas013 ~]#

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-06 09:26:26 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Ambarish on 2017-08-06 09:32:17 EDT ---

Sos reports(better to check here,than the system since I collected these almost as soon as the IO error,though I'll not reimage till a developer takes a look) :



http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716

--- Additional comment from Ambarish on 2017-08-06 09:36:15 EDT ---

At around the same time as the IO error,I see reply submission and setxattr failures, in logs not sure if it's related :

bricks8-brick.log:[2017-08-06 06:29:49.033474] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xb0bfb, Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (tcp.butcher-server)
bricks8-brick.log:[2017-08-06 06:29:49.033550] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x13808) [0x7fd3e2004808] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1f892) [0x7fd3e1baf892] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9406) [0x7fd3e1b99406] ) 0-: Reply submission failed
bricks8-brick.log:[2017-08-06 06:29:49.200438] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xb2403, Program: GlusterFS 3.3, ProgVers: 330, Proc: 4) to rpc-transport (tcp.butcher-server)
bricks8-brick.log:[2017-08-06 06:29:49.200479] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x15c82) [0x7fd3e2006c82] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1c7e2) [0x7fd3e1bac7e2] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9406) [0x7fd3e1b99406] ) 0-: Reply submission failed
bricks8-brick.log:[2017-08-06 06:31:52.403175] E [MSGID: 113001] [posix.c:5645:_posix_handle_xattr_keyvalue_pair] 0-butcher-posix: setxattr failed on /bricks8/brick/.glusterfs/b4/2b/b42b9e62-08f4-4227-b4ef-f3f2b0162603 while doing xattrop: key=trusted.glusterfs.quota.432b1d19-a74f-46fc-8aa5-02f489acb3b1.contri.1 [No such file or directory]
[root@gqas005 bricks]# 



And,a steady stream of ESTALEs in nfs.log on those 3 servers:

<snip>

[2017-08-06 06:33:03.317135] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.146:961) butcher : 6bff2f62-5a30-4dd3-81f4-d60b8e9d3daa
[2017-08-06 06:33:03.344081] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d1/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:33:03.344138] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.146:961) butcher : 6bff2f62-5a30-4dd3-81f4-d60b8e9d3daa
[2017-08-06 06:33:03.350853] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d1/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:33:03.350900] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.146:961) butcher : 6bff2f62-5a30-4dd3-81f4-d60b8e9d3daa


<snip>





[root@gqas008 glusterfs]# cat nfs.log |grep "Stale" | wc -l
126925

[root@gqas008 glusterfs]# cat nfs.log |grep "Stale" | wc -l
126925
[root@gqas008 glusterfs]#

[root@gqas007 glusterfs]# cat nfs.log |grep "Stale" | wc -l
120680
[root@gqas007 glusterfs]#

--- Additional comment from Atin Mukherjee on 2017-08-07 04:02:18 EDT ---

Soumya - could you take a look at this bug?

--- Additional comment from Soumya Koduri on 2017-08-07 05:19:23 EDT ---

From an initial look at the logs gqas008/nfs.log- 

[2017-08-06 06:30:07.378122] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.378151] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.378167] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ea9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.378618] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.378659] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.378684] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: eb9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.379205] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.379252] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.379271] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ec9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.379764] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.379801] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.379820] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ed9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.380342] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.380378] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.380396] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ee9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.380892] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.380928] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.380954] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ef9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.381463] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.381499] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.381516] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: f09433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.381953] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.381983] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9

There are LOOKUP failures on the same directory /d4/dir4 . But however IIRC, whenever there is any error sent by the brick server, client xlator also logs it. 

int
client3_3_lookup_cbk (struct rpc_req *req, struct iovec *iov, int count,
                      void *myframe)
{
        clnt_local_t    *local      = NULL;
        call_frame_t    *frame      = NULL;
        int              ret        = 0;
        gfs3_lookup_rsp  rsp        = {0,};
        struct iatt      stbuf      = {0,};
        struct iatt      postparent = {0,};

...
...
out:
        /* Restore the correct op_errno to rsp.op_errno */
        rsp.op_errno = op_errno;
        if (rsp.op_ret == -1) {
                /* any error other than ENOENT */
                if (!(local->loc.name && rsp.op_errno == ENOENT) &&
                    !(rsp.op_errno == ESTALE))
                        gf_msg (this->name, GF_LOG_WARNING, rsp.op_errno,
                                PC_MSG_REMOTE_OP_FAILED, "remote operation "
                                "failed. Path: %s (%s)",
                                local->loc.path,
                                loc_gfid_utoa (&local->loc));
                else
                        gf_msg_trace (this->name, 0, "not found on remote "
                                      "node");

        }


But since there are no errors logged from "client3_3_lookup_cbk()", this error may have got generated from the lower xlators. 

@Ambarish,

Will it be possible to collect pkt trace during the interval this error is seen on the server?


@Nithya/Susant,
Since this error is seen only while doing performing rebalance operation, do you suspect any code-path/issue which could result in ESTALE for lookup? Thanks!

--- Additional comment from Nithya Balachandran on 2017-08-08 01:31:45 EDT ---

One possibility is if the file in question has been migrated. However from the messages it looks like the ESTALE errors were returned for directories.

In the code snippet above:
                if (!(local->loc.name && rsp.op_errno == ENOENT) &&
                    !(rsp.op_errno == ESTALE))

It looks like ESTALE errs are also not logged. A better way might be to check the brick logs to see if there are any ESTALE errors logged there.

--- Additional comment from Soumya Koduri on 2017-08-08 02:04:46 EDT ---

(In reply to Nithya Balachandran from comment #6)
> One possibility is if the file in question has been migrated. However from
> the messages it looks like the ESTALE errors were returned for directories.
> 
> In the code snippet above:
>                 if (!(local->loc.name && rsp.op_errno == ENOENT) &&
>                     !(rsp.op_errno == ESTALE))
> 
> It looks like ESTALE errs are also not logged. A better way might be to
> check the brick logs to see if there are any ESTALE errors logged there.

Oops..Thanks Nithya. I do not see I/O errors in the brick logs.

@Ambarish,

As updated in the above comment, please collect packet trace on the gNFS server nodes for a small interval (<1min) when the errors are observed on mount point.

Also if possible, please attach gdb to gNFS process and collect backtrace

gdb -p <pid_gnfs>
b nfs3_fh_resolve_inode_lookup_cbk
c

When the breakpoint is hit, please collect bt and paste the same. Thanks!

--- Additional comment from Ambarish on 2017-08-08 12:26:18 EDT ---

Easily reproducible on add-brick over EC+gNFS.

Soumya,

I see errors coming from the same functions again in nfs.log :

[2017-08-08 15:34:36.057012] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d3/dir1/dir2: Stale file handle [Stale file handle]
[2017-08-08 15:34:36.057059] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.148:1016) butcher : a1061403-ed90-4fa0-ab37-ce6c8ae347ba


I couldn't fetch the BT though,coz NFS server gets restarted on add-brick.

tcpdumps are copied here :

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug08/

Also,
gqac005 mounted from gqas006
gqac007 from gqas008
gqac008 from gqas007

--- Additional comment from Ambarish on 2017-08-08 23:21:35 EDT ---

OK I have the BT too now(this time hit seconds after add brick/rebalance start),so I had some time to gdb into the newly retsarted NFS process with a new PID :

Breakpoint 1 at 0x7fd2c39933e0: file nfs3-helpers.c, line 3675.
(gdb) c
Continuing.
[Switching to Thread 0x7fd2c109a700 (LWP 6850)]

Breakpoint 1, nfs3_fh_resolve_inode_lookup_cbk (
    frame=0x7fd2c45515b0, cookie=0x7fd2c40db430,
    this=0x7fd2c40dcf20, op_ret=-1, op_errno=116,
    inode=0x7fd2a4003b00, buf=0x7fd2c1099920, xattr=0x0,
    postparent=0x7fd2c1099990) at nfs3-helpers.c:3675
3675    {
(gdb) bt
#0  nfs3_fh_resolve_inode_lookup_cbk (
    frame=0x7fd2c45515b0, cookie=0x7fd2c40db430,
    this=0x7fd2c40dcf20, op_ret=-1, op_errno=116,
    inode=0x7fd2a4003b00, buf=0x7fd2c1099920, xattr=0x0,
    postparent=0x7fd2c1099990) at nfs3-helpers.c:3675
#1  0x00007fd2c39621f0 in nfs_fop_lookup_cbk (
    frame=0x7fd2c45515b0, cookie=0x7fd2c40db430,
    this=<optimized out>, op_ret=<optimized out>,
    op_errno=<optimized out>, inode=<optimized out>,
    buf=0x7fd2c1099920, xattr=0x0,
    postparent=0x7fd2c1099990) at nfs-fops.c:420
#2  0x00007fd2c3bd82a9 in io_stats_lookup_cbk (
    frame=frame@entry=0x7fd2c4552d60,
    cookie=<optimized out>, this=<optimized out>,
    op_ret=op_ret@entry=-1, op_errno=op_errno@entry=116,
    inode=inode@entry=0x7fd2a4003b00,
    buf=buf@entry=0x7fd2c1099920, xdata=xdata@entry=0x0,
    postparent=postparent@entry=0x7fd2c1099990)
    at io-stats.c:2190
#3  0x00007fd2c3df42e6 in gf_svc_lookup_cbk (
    frame=frame@entry=0x7fd2c4552920,
    cookie=<optimized out>, this=<optimized out>,
    op_ret=-1, op_errno=116,
    inode=inode@entry=0x7fd2a4003b00,
    buf=buf@entry=0x7fd2c1099920, xdata=0x0,
    postparent=postparent@entry=0x7fd2c1099990)
    at snapview-client.c:328
#4  0x00007fd2c87772c7 in client3_3_lookup_cbk (
    req=<optimized out>, iov=<optimized out>,
    count=<optimized out>, myframe=0x7fd2c432e930)
    at client-rpc-fops.c:2947
---Type <return> to continue, or q <return> to quit---
#5  0x00007fd2d5c31840 in rpc_clnt_handle_reply (
    clnt=clnt@entry=0x7fd2c4105030,
    pollin=pollin@entry=0x7fd2a8065870) at rpc-clnt.c:794
#6  0x00007fd2d5c31b27 in rpc_clnt_notify (
    trans=<optimized out>, mydata=0x7fd2c4105060,
    event=<optimized out>, data=0x7fd2a8065870)
    at rpc-clnt.c:987
#7  0x00007fd2d5c2d9e3 in rpc_transport_notify (
    this=this@entry=0x7fd2c41051d0,
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
    data=data@entry=0x7fd2a8065870)
    at rpc-transport.c:538
#8  0x00007fd2caa5c3d6 in socket_event_poll_in (
    this=this@entry=0x7fd2c41051d0,
    notify_handled=<optimized out>) at socket.c:2306
#9  0x00007fd2caa5e97c in socket_event_handler (fd=110,
    idx=99, gen=1, data=0x7fd2c41051d0, poll_in=1,
    poll_out=0, poll_err=0) at socket.c:2458
#10 0x00007fd2d5ec31b6 in event_dispatch_epoll_handler (
    event=0x7fd2c1099e80, event_pool=0x55617a6db800)
    at event-epoll.c:572
#11 event_dispatch_epoll_worker (data=0x7fd2c4104a00)
    at event-epoll.c:648
#12 0x00007fd2d4cc7e25 in start_thread (
    arg=0x7fd2c109a700) at pthread_create.c:308
#13 0x00007fd2d459434d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb)
(gdb)
(gdb)

--- Additional comment from Raghavendra G on 2017-08-09 03:12:05 EDT ---

(In reply to Soumya Koduri from comment #7)
> (In reply to Nithya Balachandran from comment #6)
> > One possibility is if the file in question has been migrated. However from
> > the messages it looks like the ESTALE errors were returned for directories.
> > 
> > In the code snippet above:
> >                 if (!(local->loc.name && rsp.op_errno == ENOENT) &&
> >                     !(rsp.op_errno == ESTALE))
> > 
> > It looks like ESTALE errs are also not logged. A better way might be to
> > check the brick logs to see if there are any ESTALE errors logged there.
> 
> Oops..Thanks Nithya. I do not see I/O errors in the brick logs.
> 

Even bricks log ESTALE/ENOENT errors at debug-level. Was the log-level set appropriately to capture these errors? Looks like these are errors from storage/posix. There are some errors logged by storage/posix.

--- Additional comment from Atin Mukherjee on 2017-08-09 05:26:37 EDT ---

Is this a regression? Has the same scenario been tested in rhgs-3.2.0 bits?

--- Additional comment from Soumya Koduri on 2017-08-09 10:24:34 EDT ---

Many thanks to Ambarish, Rafi and Susant.

The update we have from today's debugging on Ambarish setup is that - we have observed that there are ESTALE errors being sent for nameless (gfid-based) LOOKUPs by snapview server. 

But there are no snaphots created nor does the client doing any operations on .snaps folder. So on theory, seems like these LOOKUPs should have got first performed on regular graph which is "write-behind/dht/ec/protocol_client" and only if dht returns ENOENT, this fop shall be sent to snapview server. 

Also we observed that the gfid on which lookup was sent to snapview server is of a directory and is present at the backend.  So going by theory maybe the LOOKUP failed on regular graph path and may be on bricks too. Unfortunately, there are no logs in nfs.log or bricks to confirm that.

So we have requested Ambarish to turn on log-level to DEBUG just before triggering add-brick which may help us in further debugging.

--- Additional comment from Ambarish on 2017-08-09 10:58:03 EDT ---

Couple other pointers :

> I could not repro this when I disabled features.uss (add brick + rebalance for a while)

> sosreports for Attempt 2 : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09/

> Generated Core for Attempt 2 : 


> sosreports with log level bumped up to Debug - http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09_Debug/

gqas008 and gqas013 are the servers from where the errored clients mounted from.

--- Additional comment from Ambarish on 2017-08-09 10:58:32 EDT ---

(In reply to Ambarish from comment #13)
> Couple other pointers :
> 
> > I could not repro this when I disabled features.uss (add brick + rebalance for a while)
> 
> > sosreports for Attempt 2 : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09/
> 
> > Generated Core for Attempt 2 : 

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09/

> > sosreports with log level bumped up to Debug - http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09_Debug/
> 
> gqas008 and gqas013 are the servers from where the errored clients mounted
> from.

--- Additional comment from Atin Mukherjee on 2017-08-09 11:30:55 EDT ---

(In reply to Atin Mukherjee from comment #11)
> Is this a regression? Has the same scenario been tested in rhgs-3.2.0 bits?

Setting back the needinfo.

--- Additional comment from Soumya Koduri on 2017-08-10 03:38:52 EDT ---

Thanks Ambarish.

So this issue arises only when uss is enabled. Request Susant to take a look at the debug logs (provided in comment#14).

Meanwhile I was re-looking at the pkt trace provided in the comment#8 ->
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug08/gqas006/

The pkt trace is a bit large but I think all the LOOKUP calls on "dir2" had op_ret=0 (success) in the response from regular bricks but failed with ESTALE when sent from snapd daemon (which was using port: 49162).

Below are few filters which I used to analyze the pkt trace
(tcp.dstport != 49162) && (glusterfs.bname == "dir2")
(glusterfs) && (glusterfs.proc == 27) 
((glusterfs) && (glusterfs.proc == 27)) && (gluster.op_errno != 0)

Since I couldn't find any LOOKUP failures (for 'dir2') sent from the brick servers, I am suspecting if the issue is in gf_svc_lookup if there could be cases in which it could have forwarded the LOOKUP to virtual namespace instead of first trying on regular graph.

Request Rafi to comment. 

From code inspection so far, maybe the checks in place are not sufficient, like

static int32_t
gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xdata)
{
       
...
...
 ret = svc_inode_ctx_get (this, loc->inode, &inode_type);
        if (!__is_root_gfid (loc->gfid)) {
                if (loc->parent) {
                        parent = inode_ref (loc->parent);
                        ret = svc_inode_ctx_get (this, loc->parent,
                                                 &parent_type);
                } else {
                        parent = inode_parent (loc->inode, loc->pargfid, NULL);
                        if (parent)
                                ret = svc_inode_ctx_get (this, parent,
                                                         &parent_type);'

>>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx?
                }
        }

...
...
        if (!loc->name) {
                if (gf_uuid_is_null (loc->inode->gfid)) {
                        subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                        wind = _gf_true;
                        goto out;
                } else {
                        if (inode_type >= 0)
                                subvolume = svc_get_subvolume (this,
                                                               inode_type);
                        else
                                subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                        wind = _gf_true;
                        goto out;
                }
        }

        if (strcmp (loc->name, priv->path)) {
                if (parent_type == NORMAL_INODE) {
                        subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                } else {
>>> Do we need to check if parent_type is actually set to VIRTUAL_INODE?

                        subvolume = SECOND_CHILD (this);
                        local->subvolume = subvolume;
                }
        } else {
                subvolume = SECOND_CHILD (this);
                local->subvolume = subvolume;
                if (parent_type == NORMAL_INODE) {
                        /* Indication of whether the lookup is happening on the
                           entry point or not, to the snapview-server.
                        */
                        SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno,
                                             new_xdata, priv, ret, out);
                }
        }

--- Additional comment from Soumya Koduri on 2017-08-10 05:03:37 EDT ---

From nfs.log provided in comment#14 -
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09_Debug/


Around the time when nfs xlator logged ESTALE errors- 

[2017-08-09 14:32:22.109572] D [MSGID: 0] [client-rpc-fops.c:2948:client3_3_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapd-client returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109605] D [snapview-client.c:289:gf_svc_lookup_cbk] 0-butcher-snapview-client: Lookup failed on snapview graph with error Stale file handle

>>>> If the lookup was sent to regular graph first, we should this error message recorded as well -"Lookup failed on normal graph with error... ".

>>>> But that doesn't seem to be the case


[2017-08-09 14:32:22.109634] D [MSGID: 0] [snapview-client.c:329:gf_svc_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapview-client returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109664] D [MSGID: 0] [io-stats.c:2191:io_stats_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109685] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir1: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109719] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:992) butcher : 8694b0d6-8fba-40e3-83b9-c45d0f1ca515
[2017-08-09 14:32:22.109747] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir1 => (XID: 6be86, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address))
[2017-08-09 14:32:22.110010] D [MSGID: 0] [nfs3-helpers.c:1646:nfs3_log_common_call] 0-nfs-nfsv3: XID: 106be86, ACCESS: args: FH: exportid 1ff111d1-32be-4590-9b9b-b1b756e29ba9, gfid 8694b0d6-8fba-40e3-83b9-c45d0f1ca515, mountid e4706d5b-0000-0000-0000-000000000000

--- Additional comment from Soumya Koduri on 2017-08-10 05:26 EDT ---



--- Additional comment from Soumya Koduri on 2017-08-10 05:27:19 EDT ---

@susant/rafi,

Attached the snippet of debug logs (nfs.log) from the time final graph was loaded (add-brick) for your reference.

--- Additional comment from Mohammed Rafi KC on 2017-08-10 06:25:04 EDT ---

(In reply to Soumya Koduri from comment #16)
> Thanks Ambarish.
> 
> So this issue arises only when uss is enabled. Request Susant to take a look
> at the debug logs (provided in comment#14).
> 
> Meanwhile I was re-looking at the pkt trace provided in the comment#8 ->
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug08/gqas006/
> 
> The pkt trace is a bit large but I think all the LOOKUP calls on "dir2" had
> op_ret=0 (success) in the response from regular bricks but failed with
> ESTALE when sent from snapd daemon (which was using port: 49162).
> 
> Below are few filters which I used to analyze the pkt trace
> (tcp.dstport != 49162) && (glusterfs.bname == "dir2")
> (glusterfs) && (glusterfs.proc == 27) 
> ((glusterfs) && (glusterfs.proc == 27)) && (gluster.op_errno != 0)
> 
> Since I couldn't find any LOOKUP failures (for 'dir2') sent from the brick
> servers, I am suspecting if the issue is in gf_svc_lookup if there could be
> cases in which it could have forwarded the LOOKUP to virtual namespace
> instead of first trying on regular graph.
> 
> Request Rafi to comment. 
> 
> From code inspection so far, maybe the checks in place are not sufficient,
> like
> 
> static int32_t
> gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t
> *xdata)
> {
>        
> ...
> ...
>  ret = svc_inode_ctx_get (this, loc->inode, &inode_type);
>         if (!__is_root_gfid (loc->gfid)) {
>                 if (loc->parent) {
>                         parent = inode_ref (loc->parent);
>                         ret = svc_inode_ctx_get (this, loc->parent,
>                                                  &parent_type);
>                 } else {
>                         parent = inode_parent (loc->inode, loc->pargfid,
> NULL);
>                         if (parent)
>                                 ret = svc_inode_ctx_get (this, parent,
>                                                          &parent_type);'
> 
> >>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx?

What could have happened here is that we might have assumed that a parent might have resolved before doing lookup on the inode. I understand that nfs-ganesh can link an inode with out having a parent, is this same case with gNFS also ?


Still with a fresh lookup this might fail.


>                 }
>         }
> 
> ...
> ...
>         if (!loc->name) {
>                 if (gf_uuid_is_null (loc->inode->gfid)) {
>                         subvolume = FIRST_CHILD (this);
>                         local->subvolume = subvolume;
>                         wind = _gf_true;
>                         goto out;
>                 } else {
>                         if (inode_type >= 0)
>                                 subvolume = svc_get_subvolume (this,
>                                                                inode_type);
>                         else
>                                 subvolume = FIRST_CHILD (this);
>                         local->subvolume = subvolume;
>                         wind = _gf_true;
>                         goto out;
>                 }
>         }
> 
>         if (strcmp (loc->name, priv->path)) {
>                 if (parent_type == NORMAL_INODE) {
>                         subvolume = FIRST_CHILD (this);
>                         local->subvolume = subvolume;
>                 } else {
> >>> Do we need to check if parent_type is actually set to VIRTUAL_INODE?
> 
>                         subvolume = SECOND_CHILD (this);
>                         local->subvolume = subvolume;
>                 }
>         } else {
>                 subvolume = SECOND_CHILD (this);
>                 local->subvolume = subvolume;
>                 if (parent_type == NORMAL_INODE) {
>                         /* Indication of whether the lookup is happening on
> the
>                            entry point or not, to the snapview-server.
>                         */
>                         SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno,
>                                              new_xdata, priv, ret, out);
>                 }
>         }


Your suggestion is absolutely valid, considering the above mentioned point. But I was thinking that gNFS won't have loc->name filled. I'm not sure how much extend this statement is true, because during the resolve or from any other lookup from nfs xlator may have loc->name filled.

--- Additional comment from Soumya Koduri on 2017-08-10 08:06:07 EDT ---


> > 
> > From code inspection so far, maybe the checks in place are not sufficient,
> > like
> > 
> > static int32_t
> > gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t
> > *xdata)
> > {
> >        
> > ...
> > ...
> >  ret = svc_inode_ctx_get (this, loc->inode, &inode_type);
> >         if (!__is_root_gfid (loc->gfid)) {
> >                 if (loc->parent) {
> >                         parent = inode_ref (loc->parent);
> >                         ret = svc_inode_ctx_get (this, loc->parent,
> >                                                  &parent_type);
> >                 } else {
> >                         parent = inode_parent (loc->inode, loc->pargfid,
> > NULL);
> >                         if (parent)
> >                                 ret = svc_inode_ctx_get (this, parent,
> >                                                          &parent_type);'
> > 
> > >>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx?
> 
> What could have happened here is that we might have assumed that a parent
> might have resolved before doing lookup on the inode. I understand that
> nfs-ganesh can link an inode with out having a parent, is this same case
> with gNFS also ?
> 

Yes I think for nameless lookups, like in ganesha, loc->parent may not be set or linked by gNFS too. Again I am not sure.

What if inode is linked from underlying layers and svc hasn't set the ctx value?

> >         if (!loc->name) {
> >                 if (gf_uuid_is_null (loc->inode->gfid)) {
> >                         subvolume = FIRST_CHILD (this);
> >                         local->subvolume = subvolume;
> >                         wind = _gf_true;
> >                         goto out;
> >                 } else {
> >                         if (inode_type >= 0)
> >                                 subvolume = svc_get_subvolume (this,
> >                                                                inode_type);
> >                         else
> >                                 subvolume = FIRST_CHILD (this);
> >                         local->subvolume = subvolume;
> >                         wind = _gf_true;
> >                         goto out;
> >                 }

Here the decision seems to be taken based on inode_type which is again read from inode_ctx which is not valid if loc->inode is not a linked inode.

> >         }
> > 
> >         if (strcmp (loc->name, priv->path)) {
> >                 if (parent_type == NORMAL_INODE) {
> >                         subvolume = FIRST_CHILD (this);
> >                         local->subvolume = subvolume;
> >                 } else {
> > >>> Do we need to check if parent_type is actually set to VIRTUAL_INODE?
> > 
> >                         subvolume = SECOND_CHILD (this);
> >                         local->subvolume = subvolume;
> >                 }
> >         } else {
> >                 subvolume = SECOND_CHILD (this);
> >                 local->subvolume = subvolume;
> >                 if (parent_type == NORMAL_INODE) {
> >                         /* Indication of whether the lookup is happening on
> > the
> >                            entry point or not, to the snapview-server.
> >                         */
> >                         SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno,
> >                                              new_xdata, priv, ret, out);
> >                 }
> >         }
> 
> 
> Your suggestion is absolutely valid, considering the above mentioned point.
> But I was thinking that gNFS won't have loc->name filled. I'm not sure how
> much extend this statement is true, because during the resolve or from any
> other lookup from nfs xlator may have loc->name filled.

--- Additional comment from Soumya Koduri on 2017-08-10 08:07:07 EDT ---

(In reply to Atin Mukherjee from comment #15)
> (In reply to Atin Mukherjee from comment #11)
> > Is this a regression? Has the same scenario been tested in rhgs-3.2.0 bits?
> 
> Setting back the needinfo.

This code path doesn't seem to have changed in rhgs 3.3.0. Adding back needinfo set by Atin.

--- Additional comment from Ambarish on 2017-08-10 08:15:53 EDT ---

(In reply to Atin Mukherjee from comment #15)
> (In reply to Atin Mukherjee from comment #11)
> > Is this a regression? Has the same scenario been tested in rhgs-3.2.0 bits?
> 
> Setting back the needinfo.



The test was run on 3.2 (same config),but I am not sure how easily the bug can be hit on 3.2 bits.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-11 07:35:38 EDT ---

Since this bug has been approved for the RHGS 3.3.0 release of Red Hat Gluster Storage 3, through release flag 'rhgs-3.3.0+', and through the Internal Whiteboard entry of '3.3.0', the Target Release is being automatically set to 'RHGS 3.3.0'

--- Additional comment from Mohammed Rafi KC on 2017-08-11 08:47:26 EDT ---

Expanded my volume from 10*(4+2) to 12*(4+2) with linux untars and lookups from  6 gNFS mounts.

Triggered rebalance.


I/O errored out on 3/3 clients with ESTALE :


On gqac007(Mounted via gqas008.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot/dts: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors
</snip>



On gqac024(mounted via gqas013.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors

<snip>

On gqac008 (Mounted via gqas007.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors
</snip>

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-38

How reproducible:
-----------------

1/1


Actual results:
---------------

Application got affected.IO errored out with ESTALE.


Expected results:
-----------------

Successful IO completion.

Additional info:

Volume Name: butcher
Type: Distributed-Disperse
Volume ID: dbd4e36f-28de-445e-a5cd-c78d0a467fb6
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x (4 + 2) = 72
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick5: gqas003.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick6: gqas007.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick8: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick9: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick10: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick11: gqas003.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick12: gqas007.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick14: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick16: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick17: gqas003.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick18: gqas007.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick19: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick20: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick21: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick22: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick23: gqas003.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick24: gqas007.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick25: gqas013.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick26: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick27: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick28: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick29: gqas003.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick30: gqas007.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick31: gqas013.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick32: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick33: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick34: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick35: gqas003.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick36: gqas007.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick37: gqas013.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick38: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick39: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick40: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick41: gqas003.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick42: gqas007.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick43: gqas013.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick44: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick45: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick46: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick47: gqas003.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick48: gqas007.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick49: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick50: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick51: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick52: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick53: gqas003.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick54: gqas007.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick55: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick56: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick57: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick58: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick59: gqas003.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick60: gqas007.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick61: gqas013.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick62: gqas005.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick63: gqas006.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick64: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick65: gqas003.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick66: gqas007.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick67: gqas013.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick68: gqas005.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick69: gqas006.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick70: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick71: gqas003.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick72: gqas007.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.uss: enable
client.event-threads: 4
server.event-threads: 4
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: off


Below are few filters which I used to analyze the pkt trace
(tcp.dstport != 49162) && (glusterfs.bname == "dir2")
(glusterfs) && (glusterfs.proc == 27) 
((glusterfs) && (glusterfs.proc == 27)) && (gluster.op_errno != 0)

Since I couldn't find any LOOKUP failures (for 'dir2') sent from the brick servers, I am suspecting if the issue is in gf_svc_lookup if there could be cases in which it could have forwarded the LOOKUP to virtual namespace instead of first trying on regular graph.

Request Rafi to comment. 

From code inspection so far, maybe the checks in place are not sufficient, like

static int32_t
gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xdata)
{
       
...
...
 ret = svc_inode_ctx_get (this, loc->inode, &inode_type);
        if (!__is_root_gfid (loc->gfid)) {
                if (loc->parent) {
                        parent = inode_ref (loc->parent);
                        ret = svc_inode_ctx_get (this, loc->parent,
                                                 &parent_type);
                } else {
                        parent = inode_parent (loc->inode, loc->pargfid, NULL);
                        if (parent)
                                ret = svc_inode_ctx_get (this, parent,
                                                         &parent_type);'

>>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx?
                }
        }

...
...
        if (!loc->name) {
                if (gf_uuid_is_null (loc->inode->gfid)) {
                        subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                        wind = _gf_true;
                        goto out;
                } else {
                        if (inode_type >= 0)
                                subvolume = svc_get_subvolume (this,
                                                               inode_type);
                        else
                                subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                        wind = _gf_true;
                        goto out;
                }
        }

        if (strcmp (loc->name, priv->path)) {
                if (parent_type == NORMAL_INODE) {
                        subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                } else {
>>> Do we need to check if parent_type is actually set to VIRTUAL_INODE?

                        subvolume = SECOND_CHILD (this);
                        local->subvolume = subvolume;
                }
        } else {
                subvolume = SECOND_CHILD (this);
                local->subvolume = subvolume;
                if (parent_type == NORMAL_INODE) {
                        /* Indication of whether the lookup is happening on the
                           entry point or not, to the snapview-server.
                        */
                        SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno,
                                             new_xdata, priv, ret, out);
                }
        }

--- Additional comment from Soumya Koduri on 2017-08-10 05:03:37 EDT ---

From nfs.log provided in comment#14 -
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09_Debug/


Around the time when nfs xlator logged ESTALE errors- 

[2017-08-09 14:32:22.109572] D [MSGID: 0] [client-rpc-fops.c:2948:client3_3_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapd-client returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109605] D [snapview-client.c:289:gf_svc_lookup_cbk] 0-butcher-snapview-client: Lookup failed on snapview graph with error Stale file handle

>>>> If the lookup was sent to regular graph first, we should this error message recorded as well -"Lookup failed on normal graph with error... ".

>>>> But that doesn't seem to be the case


[2017-08-09 14:32:22.109634] D [MSGID: 0] [snapview-client.c:329:gf_svc_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapview-client returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109664] D [MSGID: 0] [io-stats.c:2191:io_stats_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109685] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir1: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109719] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:992) butcher : 8694b0d6-8fba-40e3-83b9-c45d0f1ca515
[2017-08-09 14:32:22.109747] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir1 => (XID: 6be86, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address))
[2017-08-09 14:32:22.110010] D [MSGID: 0] [nfs3-helpers.c:1646:nfs3_log_common_call] 0-nfs-nfsv3: XID: 106be86, ACCESS: args: FH: exportid 1ff111d1-32be-4590-9b9b-b1b756e29ba9, gfid 8694b0d6-8fba-40e3-83b9-c45d0f1ca515, mountid e4706d5b-0000-0000-0000-000000000000

--- Additional comment from Worker Ant on 2017-08-11 08:56:08 EDT ---

REVIEW: https://review.gluster.org/18028 (uss/svc: Send the first lookup to the normal graph) posted (#1) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Worker Ant on 2017-08-14 03:30:24 EDT ---

COMMIT: https://review.gluster.org/18028 committed in master by mohammed rafi  kc (rkavunga) 
------
commit d396d358d4f0cfe87693179cfd13eb2a84ce62c2
Author: Mohammed Rafi KC <rkavunga>
Date:   Fri Aug 11 17:50:52 2017 +0530

    uss/svc: Send the first lookup to the normal graph
    
    When parent inode is not looked up before, the inode type
    won't be set. In this scenario we should send the lookup
    to the normal graph first.
    
    Change-Id: I5d3e53f5c1c33a17be32204f3114749d9b1e6db8
    BUG: 1480591
    Signed-off-by: Mohammed Rafi KC <rkavunga>
    Reviewed-on: https://review.gluster.org/18028
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra Bhat <raghavendra>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: soumya k <skoduri>

Comment 1 Worker Ant 2017-08-16 05:29:17 UTC

REVIEW: https://review.gluster.org/18042 (uss/svc: Send the first lookup to the normal graph) posted (#1) for review on release-3.12 by mohammed rafi  kc (rkavunga)

Comment 2 Worker Ant 2017-08-21 14:29:28 UTC

COMMIT: https://review.gluster.org/18042 committed in release-3.12 by Shyamsundar Ranganathan (srangana) 
------
commit 93d56da29c28a2e0161bfb8a87bd2f72c4ef4db1
Author: Mohammed Rafi KC <rkavunga>
Date:   Fri Aug 11 17:50:52 2017 +0530

    uss/svc: Send the first lookup to the normal graph
    
    When parent inode is not looked up before, the inode type
    won't be set. In this scenario we should send the lookup
    to the normal graph first.
    
    Backport of>
    >Change-Id: I5d3e53f5c1c33a17be32204f3114749d9b1e6db8
    >BUG: 1480591
    >Signed-off-by: Mohammed Rafi KC <rkavunga>
    >Reviewed-on: https://review.gluster.org/18028
    >Smoke: Gluster Build System <jenkins.org>
    >Reviewed-by: Raghavendra Bhat <raghavendra>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >Reviewed-by: soumya k <skoduri>
    
    (cherry picked from commit d396d358d4f0cfe87693179cfd13eb2a84ce62c2)
    
    Change-Id: I5d3e53f5c1c33a17be32204f3114749d9b1e6db8
    BUG: 1481931
    Signed-off-by: Mohammed Rafi KC <rkavunga>
    Reviewed-on: https://review.gluster.org/18042
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Amar Tumballi <amarts>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

Comment 3 Shyamsundar 2017-09-05 17:39:07 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.