1478716 – [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during rebalance of an erasure coded volume.

Bug 1478716 - [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during rebalance of an erasure coded volume.

Summary: [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-nfs
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Mohammed Rafi KC
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:	1480591
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-08-06 13:26 UTC by Ambarish
Modified:	2017-09-21 05:04 UTC (History)
CC List:	17 users (show)
Fixed In Version:	glusterfs-3.8.4-41
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1480591 (view as bug list)
Environment:
Last Closed:	2017-09-21 05:04:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
nfs_final_graph (3.88 MB, text/plain) 2017-08-10 09:26 UTC, Soumya Koduri	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Ambarish 2017-08-06 13:26:24 UTC

Description of problem:
-----------------------

Expanded my volume from 10*(4+2) to 12*(4+2) with linux untars and lookups from  6 gNFS mounts.

Triggered rebalance.


I/O errored out on 3/3 clients with ESTALE :


On gqac007(Mounted via gqas008.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot/dts: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors
</snip>



On gqac024(mounted via gqas013.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors

<snip>

On gqac008 (Mounted via gqas007.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors
</snip>

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-38

How reproducible:
-----------------

1/1


Actual results:
---------------

Application got affected.IO errored out with ESTALE.


Expected results:
-----------------

Successful IO completion.

Additional info:

Volume Name: butcher
Type: Distributed-Disperse
Volume ID: dbd4e36f-28de-445e-a5cd-c78d0a467fb6
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x (4 + 2) = 72
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick5: gqas003.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick6: gqas007.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick8: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick9: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick10: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick11: gqas003.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick12: gqas007.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick14: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick16: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick17: gqas003.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick18: gqas007.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick19: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick20: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick21: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick22: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick23: gqas003.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick24: gqas007.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick25: gqas013.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick26: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick27: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick28: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick29: gqas003.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick30: gqas007.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick31: gqas013.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick32: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick33: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick34: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick35: gqas003.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick36: gqas007.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick37: gqas013.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick38: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick39: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick40: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick41: gqas003.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick42: gqas007.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick43: gqas013.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick44: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick45: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick46: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick47: gqas003.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick48: gqas007.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick49: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick50: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick51: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick52: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick53: gqas003.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick54: gqas007.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick55: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick56: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick57: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick58: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick59: gqas003.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick60: gqas007.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick61: gqas013.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick62: gqas005.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick63: gqas006.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick64: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick65: gqas003.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick66: gqas007.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick67: gqas013.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick68: gqas005.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick69: gqas006.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick70: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick71: gqas003.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick72: gqas007.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.uss: enable
client.event-threads: 4
server.event-threads: 4
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: off
[root@gqas013 ~]#

Comment 3 Ambarish 2017-08-06 13:36:15 UTC

At around the same time as the IO error,I see reply submission and setxattr failures, in logs not sure if it's related :

bricks8-brick.log:[2017-08-06 06:29:49.033474] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xb0bfb, Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (tcp.butcher-server)
bricks8-brick.log:[2017-08-06 06:29:49.033550] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x13808) [0x7fd3e2004808] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1f892) [0x7fd3e1baf892] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9406) [0x7fd3e1b99406] ) 0-: Reply submission failed
bricks8-brick.log:[2017-08-06 06:29:49.200438] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xb2403, Program: GlusterFS 3.3, ProgVers: 330, Proc: 4) to rpc-transport (tcp.butcher-server)
bricks8-brick.log:[2017-08-06 06:29:49.200479] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x15c82) [0x7fd3e2006c82] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1c7e2) [0x7fd3e1bac7e2] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9406) [0x7fd3e1b99406] ) 0-: Reply submission failed
bricks8-brick.log:[2017-08-06 06:31:52.403175] E [MSGID: 113001] [posix.c:5645:_posix_handle_xattr_keyvalue_pair] 0-butcher-posix: setxattr failed on /bricks8/brick/.glusterfs/b4/2b/b42b9e62-08f4-4227-b4ef-f3f2b0162603 while doing xattrop: key=trusted.glusterfs.quota.432b1d19-a74f-46fc-8aa5-02f489acb3b1.contri.1 [No such file or directory]
[root@gqas005 bricks]# 



And,a steady stream of ESTALEs in nfs.log on those 3 servers:

<snip>

[2017-08-06 06:33:03.317135] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.146:961) butcher : 6bff2f62-5a30-4dd3-81f4-d60b8e9d3daa
[2017-08-06 06:33:03.344081] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d1/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:33:03.344138] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.146:961) butcher : 6bff2f62-5a30-4dd3-81f4-d60b8e9d3daa
[2017-08-06 06:33:03.350853] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d1/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:33:03.350900] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.146:961) butcher : 6bff2f62-5a30-4dd3-81f4-d60b8e9d3daa


<snip>





[root@gqas008 glusterfs]# cat nfs.log |grep "Stale" | wc -l
126925

[root@gqas008 glusterfs]# cat nfs.log |grep "Stale" | wc -l
126925
[root@gqas008 glusterfs]#

[root@gqas007 glusterfs]# cat nfs.log |grep "Stale" | wc -l
120680
[root@gqas007 glusterfs]#

Comment 5 Soumya Koduri 2017-08-07 09:19:23 UTC

From an initial look at the logs gqas008/nfs.log- 

[2017-08-06 06:30:07.378122] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.378151] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.378167] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ea9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.378618] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.378659] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.378684] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: eb9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.379205] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.379252] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.379271] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ec9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.379764] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.379801] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.379820] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ed9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.380342] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.380378] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.380396] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ee9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.380892] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.380928] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.380954] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ef9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.381463] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.381499] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9
[2017-08-06 06:30:07.381516] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: f09433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument]
[2017-08-06 06:30:07.381953] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle]
[2017-08-06 06:30:07.381983] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9

There are LOOKUP failures on the same directory /d4/dir4 . But however IIRC, whenever there is any error sent by the brick server, client xlator also logs it. 

int
client3_3_lookup_cbk (struct rpc_req *req, struct iovec *iov, int count,
                      void *myframe)
{
        clnt_local_t    *local      = NULL;
        call_frame_t    *frame      = NULL;
        int              ret        = 0;
        gfs3_lookup_rsp  rsp        = {0,};
        struct iatt      stbuf      = {0,};
        struct iatt      postparent = {0,};

...
...
out:
        /* Restore the correct op_errno to rsp.op_errno */
        rsp.op_errno = op_errno;
        if (rsp.op_ret == -1) {
                /* any error other than ENOENT */
                if (!(local->loc.name && rsp.op_errno == ENOENT) &&
                    !(rsp.op_errno == ESTALE))
                        gf_msg (this->name, GF_LOG_WARNING, rsp.op_errno,
                                PC_MSG_REMOTE_OP_FAILED, "remote operation "
                                "failed. Path: %s (%s)",
                                local->loc.path,
                                loc_gfid_utoa (&local->loc));
                else
                        gf_msg_trace (this->name, 0, "not found on remote "
                                      "node");

        }


But since there are no errors logged from "client3_3_lookup_cbk()", this error may have got generated from the lower xlators. 

@Ambarish,

Will it be possible to collect pkt trace during the interval this error is seen on the server?


@Nithya/Susant,
Since this error is seen only while doing performing rebalance operation, do you suspect any code-path/issue which could result in ESTALE for lookup? Thanks!

Comment 6 Nithya Balachandran 2017-08-08 05:31:45 UTC

One possibility is if the file in question has been migrated. However from the messages it looks like the ESTALE errors were returned for directories.

In the code snippet above:
                if (!(local->loc.name && rsp.op_errno == ENOENT) &&
                    !(rsp.op_errno == ESTALE))

It looks like ESTALE errs are also not logged. A better way might be to check the brick logs to see if there are any ESTALE errors logged there.

Comment 7 Soumya Koduri 2017-08-08 06:04:46 UTC

(In reply to Nithya Balachandran from comment #6)
> One possibility is if the file in question has been migrated. However from
> the messages it looks like the ESTALE errors were returned for directories.
> 
> In the code snippet above:
>                 if (!(local->loc.name && rsp.op_errno == ENOENT) &&
>                     !(rsp.op_errno == ESTALE))
> 
> It looks like ESTALE errs are also not logged. A better way might be to
> check the brick logs to see if there are any ESTALE errors logged there.

Oops..Thanks Nithya. I do not see I/O errors in the brick logs.

@Ambarish,

As updated in the above comment, please collect packet trace on the gNFS server nodes for a small interval (<1min) when the errors are observed on mount point.

Also if possible, please attach gdb to gNFS process and collect backtrace

gdb -p <pid_gnfs>
b nfs3_fh_resolve_inode_lookup_cbk
c

When the breakpoint is hit, please collect bt and paste the same. Thanks!

Comment 10 Raghavendra G 2017-08-09 07:12:05 UTC

(In reply to Soumya Koduri from comment #7)
> (In reply to Nithya Balachandran from comment #6)
> > One possibility is if the file in question has been migrated. However from
> > the messages it looks like the ESTALE errors were returned for directories.
> > 
> > In the code snippet above:
> >                 if (!(local->loc.name && rsp.op_errno == ENOENT) &&
> >                     !(rsp.op_errno == ESTALE))
> > 
> > It looks like ESTALE errs are also not logged. A better way might be to
> > check the brick logs to see if there are any ESTALE errors logged there.
> 
> Oops..Thanks Nithya. I do not see I/O errors in the brick logs.
> 

Even bricks log ESTALE/ENOENT errors at debug-level. Was the log-level set appropriately to capture these errors? Looks like these are errors from storage/posix. There are some errors logged by storage/posix.

Comment 12 Soumya Koduri 2017-08-09 14:24:34 UTC

Many thanks to Ambarish, Rafi and Susant.

The update we have from today's debugging on Ambarish setup is that - we have observed that there are ESTALE errors being sent for nameless (gfid-based) LOOKUPs by snapview server. 

But there are no snaphots created nor does the client doing any operations on .snaps folder. So on theory, seems like these LOOKUPs should have got first performed on regular graph which is "write-behind/dht/ec/protocol_client" and only if dht returns ENOENT, this fop shall be sent to snapview server. 

Also we observed that the gfid on which lookup was sent to snapview server is of a directory and is present at the backend.  So going by theory maybe the LOOKUP failed on regular graph path and may be on bricks too. Unfortunately, there are no logs in nfs.log or bricks to confirm that.

So we have requested Ambarish to turn on log-level to DEBUG just before triggering add-brick which may help us in further debugging.

Comment 16 Soumya Koduri 2017-08-10 07:38:52 UTC

Thanks Ambarish.

So this issue arises only when uss is enabled. Request Susant to take a look at the debug logs (provided in comment#14).

Meanwhile I was re-looking at the pkt trace provided in the comment#8 ->
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug08/gqas006/

The pkt trace is a bit large but I think all the LOOKUP calls on "dir2" had op_ret=0 (success) in the response from regular bricks but failed with ESTALE when sent from snapd daemon (which was using port: 49162).

Below are few filters which I used to analyze the pkt trace
(tcp.dstport != 49162) && (glusterfs.bname == "dir2")
(glusterfs) && (glusterfs.proc == 27) 
((glusterfs) && (glusterfs.proc == 27)) && (gluster.op_errno != 0)

Since I couldn't find any LOOKUP failures (for 'dir2') sent from the brick servers, I am suspecting if the issue is in gf_svc_lookup if there could be cases in which it could have forwarded the LOOKUP to virtual namespace instead of first trying on regular graph.

Request Rafi to comment. 

From code inspection so far, maybe the checks in place are not sufficient, like

static int32_t
gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xdata)
{
       
...
...
 ret = svc_inode_ctx_get (this, loc->inode, &inode_type);
        if (!__is_root_gfid (loc->gfid)) {
                if (loc->parent) {
                        parent = inode_ref (loc->parent);
                        ret = svc_inode_ctx_get (this, loc->parent,
                                                 &parent_type);
                } else {
                        parent = inode_parent (loc->inode, loc->pargfid, NULL);
                        if (parent)
                                ret = svc_inode_ctx_get (this, parent,
                                                         &parent_type);'

>>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx?
                }
        }

...
...
        if (!loc->name) {
                if (gf_uuid_is_null (loc->inode->gfid)) {
                        subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                        wind = _gf_true;
                        goto out;
                } else {
                        if (inode_type >= 0)
                                subvolume = svc_get_subvolume (this,
                                                               inode_type);
                        else
                                subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                        wind = _gf_true;
                        goto out;
                }
        }

        if (strcmp (loc->name, priv->path)) {
                if (parent_type == NORMAL_INODE) {
                        subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                } else {
>>> Do we need to check if parent_type is actually set to VIRTUAL_INODE?

                        subvolume = SECOND_CHILD (this);
                        local->subvolume = subvolume;
                }
        } else {
                subvolume = SECOND_CHILD (this);
                local->subvolume = subvolume;
                if (parent_type == NORMAL_INODE) {
                        /* Indication of whether the lookup is happening on the
                           entry point or not, to the snapview-server.
                        */
                        SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno,
                                             new_xdata, priv, ret, out);
                }
        }

Comment 17 Soumya Koduri 2017-08-10 09:03:37 UTC

From nfs.log provided in comment#14 -
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09_Debug/


Around the time when nfs xlator logged ESTALE errors- 

[2017-08-09 14:32:22.109572] D [MSGID: 0] [client-rpc-fops.c:2948:client3_3_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapd-client returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109605] D [snapview-client.c:289:gf_svc_lookup_cbk] 0-butcher-snapview-client: Lookup failed on snapview graph with error Stale file handle

>>>> If the lookup was sent to regular graph first, we should this error message recorded as well -"Lookup failed on normal graph with error... ".

>>>> But that doesn't seem to be the case


[2017-08-09 14:32:22.109634] D [MSGID: 0] [snapview-client.c:329:gf_svc_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapview-client returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109664] D [MSGID: 0] [io-stats.c:2191:io_stats_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109685] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir1: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109719] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:992) butcher : 8694b0d6-8fba-40e3-83b9-c45d0f1ca515
[2017-08-09 14:32:22.109747] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir1 => (XID: 6be86, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address))
[2017-08-09 14:32:22.110010] D [MSGID: 0] [nfs3-helpers.c:1646:nfs3_log_common_call] 0-nfs-nfsv3: XID: 106be86, ACCESS: args: FH: exportid 1ff111d1-32be-4590-9b9b-b1b756e29ba9, gfid 8694b0d6-8fba-40e3-83b9-c45d0f1ca515, mountid e4706d5b-0000-0000-0000-000000000000

Comment 18 Soumya Koduri 2017-08-10 09:26:01 UTC

Created attachment 1311641 [details]
nfs_final_graph

Comment 19 Soumya Koduri 2017-08-10 09:27:19 UTC

@susant/rafi,

Attached the snippet of debug logs (nfs.log) from the time final graph was loaded (add-brick) for your reference.

Comment 20 Mohammed Rafi KC 2017-08-10 10:25:04 UTC

(In reply to Soumya Koduri from comment #16)
> Thanks Ambarish.
> 
> So this issue arises only when uss is enabled. Request Susant to take a look
> at the debug logs (provided in comment#14).
> 
> Meanwhile I was re-looking at the pkt trace provided in the comment#8 ->
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug08/gqas006/
> 
> The pkt trace is a bit large but I think all the LOOKUP calls on "dir2" had
> op_ret=0 (success) in the response from regular bricks but failed with
> ESTALE when sent from snapd daemon (which was using port: 49162).
> 
> Below are few filters which I used to analyze the pkt trace
> (tcp.dstport != 49162) && (glusterfs.bname == "dir2")
> (glusterfs) && (glusterfs.proc == 27) 
> ((glusterfs) && (glusterfs.proc == 27)) && (gluster.op_errno != 0)
> 
> Since I couldn't find any LOOKUP failures (for 'dir2') sent from the brick
> servers, I am suspecting if the issue is in gf_svc_lookup if there could be
> cases in which it could have forwarded the LOOKUP to virtual namespace
> instead of first trying on regular graph.
> 
> Request Rafi to comment. 
> 
> From code inspection so far, maybe the checks in place are not sufficient,
> like
> 
> static int32_t
> gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t
> *xdata)
> {
>        
> ...
> ...
>  ret = svc_inode_ctx_get (this, loc->inode, &inode_type);
>         if (!__is_root_gfid (loc->gfid)) {
>                 if (loc->parent) {
>                         parent = inode_ref (loc->parent);
>                         ret = svc_inode_ctx_get (this, loc->parent,
>                                                  &parent_type);
>                 } else {
>                         parent = inode_parent (loc->inode, loc->pargfid,
> NULL);
>                         if (parent)
>                                 ret = svc_inode_ctx_get (this, parent,
>                                                          &parent_type);'
> 
> >>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx?

What could have happened here is that we might have assumed that a parent might have resolved before doing lookup on the inode. I understand that nfs-ganesh can link an inode with out having a parent, is this same case with gNFS also ?


Still with a fresh lookup this might fail.


>                 }
>         }
> 
> ...
> ...
>         if (!loc->name) {
>                 if (gf_uuid_is_null (loc->inode->gfid)) {
>                         subvolume = FIRST_CHILD (this);
>                         local->subvolume = subvolume;
>                         wind = _gf_true;
>                         goto out;
>                 } else {
>                         if (inode_type >= 0)
>                                 subvolume = svc_get_subvolume (this,
>                                                                inode_type);
>                         else
>                                 subvolume = FIRST_CHILD (this);
>                         local->subvolume = subvolume;
>                         wind = _gf_true;
>                         goto out;
>                 }
>         }
> 
>         if (strcmp (loc->name, priv->path)) {
>                 if (parent_type == NORMAL_INODE) {
>                         subvolume = FIRST_CHILD (this);
>                         local->subvolume = subvolume;
>                 } else {
> >>> Do we need to check if parent_type is actually set to VIRTUAL_INODE?
> 
>                         subvolume = SECOND_CHILD (this);
>                         local->subvolume = subvolume;
>                 }
>         } else {
>                 subvolume = SECOND_CHILD (this);
>                 local->subvolume = subvolume;
>                 if (parent_type == NORMAL_INODE) {
>                         /* Indication of whether the lookup is happening on
> the
>                            entry point or not, to the snapview-server.
>                         */
>                         SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno,
>                                              new_xdata, priv, ret, out);
>                 }
>         }


Your suggestion is absolutely valid, considering the above mentioned point. But I was thinking that gNFS won't have loc->name filled. I'm not sure how much extend this statement is true, because during the resolve or from any other lookup from nfs xlator may have loc->name filled.

Comment 21 Soumya Koduri 2017-08-10 12:06:07 UTC

> > 
> > From code inspection so far, maybe the checks in place are not sufficient,
> > like
> > 
> > static int32_t
> > gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t
> > *xdata)
> > {
> >        
> > ...
> > ...
> >  ret = svc_inode_ctx_get (this, loc->inode, &inode_type);
> >         if (!__is_root_gfid (loc->gfid)) {
> >                 if (loc->parent) {
> >                         parent = inode_ref (loc->parent);
> >                         ret = svc_inode_ctx_get (this, loc->parent,
> >                                                  &parent_type);
> >                 } else {
> >                         parent = inode_parent (loc->inode, loc->pargfid,
> > NULL);
> >                         if (parent)
> >                                 ret = svc_inode_ctx_get (this, parent,
> >                                                          &parent_type);'
> > 
> > >>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx?
> 
> What could have happened here is that we might have assumed that a parent
> might have resolved before doing lookup on the inode. I understand that
> nfs-ganesh can link an inode with out having a parent, is this same case
> with gNFS also ?
> 

Yes I think for nameless lookups, like in ganesha, loc->parent may not be set or linked by gNFS too. Again I am not sure.

What if inode is linked from underlying layers and svc hasn't set the ctx value?

> >         if (!loc->name) {
> >                 if (gf_uuid_is_null (loc->inode->gfid)) {
> >                         subvolume = FIRST_CHILD (this);
> >                         local->subvolume = subvolume;
> >                         wind = _gf_true;
> >                         goto out;
> >                 } else {
> >                         if (inode_type >= 0)
> >                                 subvolume = svc_get_subvolume (this,
> >                                                                inode_type);
> >                         else
> >                                 subvolume = FIRST_CHILD (this);
> >                         local->subvolume = subvolume;
> >                         wind = _gf_true;
> >                         goto out;
> >                 }

Here the decision seems to be taken based on inode_type which is again read from inode_ctx which is not valid if loc->inode is not a linked inode.

> >         }
> > 
> >         if (strcmp (loc->name, priv->path)) {
> >                 if (parent_type == NORMAL_INODE) {
> >                         subvolume = FIRST_CHILD (this);
> >                         local->subvolume = subvolume;
> >                 } else {
> > >>> Do we need to check if parent_type is actually set to VIRTUAL_INODE?
> > 
> >                         subvolume = SECOND_CHILD (this);
> >                         local->subvolume = subvolume;
> >                 }
> >         } else {
> >                 subvolume = SECOND_CHILD (this);
> >                 local->subvolume = subvolume;
> >                 if (parent_type == NORMAL_INODE) {
> >                         /* Indication of whether the lookup is happening on
> > the
> >                            entry point or not, to the snapview-server.
> >                         */
> >                         SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno,
> >                                              new_xdata, priv, ret, out);
> >                 }
> >         }
> 
> 
> Your suggestion is absolutely valid, considering the above mentioned point.
> But I was thinking that gNFS won't have loc->name filled. I'm not sure how
> much extend this statement is true, because during the resolve or from any
> other lookup from nfs xlator may have loc->name filled.

Comment 26 Atin Mukherjee 2017-08-11 13:07:12 UTC

upstream patch : https://review.gluster.org/18028

Comment 28 Mohammed Rafi KC 2017-08-14 12:07:26 UTC

Thanks to Soumya for the RCA,

When restarting a nfs process, the parent inode were not linked to the inode table when resolving an entry. So resolving a parent inode results in sending a lookup with out it's parent , but the path and name was filled by the inode_path call. So there was one combination in svc that didn't handle this case.

Comment 31 Mohammed Rafi KC 2017-08-16 05:31:10 UTC

3.12 release branch patch id : https://review.gluster.org/#/c/18042

Comment 33 Ambarish 2017-08-18 09:34:53 UTC

Cannot reproduce the reported problem on 3.8.4-41 on multiple retries.

Moving to Verified.

Comment 35 errata-xmlrpc 2017-09-21 05:04:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.