Description of problem: ----------------------- Expanded my volume from 10*(4+2) to 12*(4+2) with linux untars and lookups from 6 gNFS mounts. Triggered rebalance. I/O errored out on 3/3 clients with ESTALE : On gqac007(Mounted via gqas008.sbu.lab.eng.bos.redhat.com) : <snip> tar: linux-4.13-rc3/arch/arm64/boot/dts: Cannot stat: Stale file handle tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle tar: linux-4.13-rc3: Cannot stat: Stale file handle tar: Exiting with failure status due to previous errors </snip> On gqac024(mounted via gqas013.sbu.lab.eng.bos.redhat.com) : <snip> tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle tar: linux-4.13-rc3: Cannot stat: Stale file handle tar: Exiting with failure status due to previous errors <snip> On gqac008 (Mounted via gqas007.sbu.lab.eng.bos.redhat.com) : <snip> tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle tar: linux-4.13-rc3: Cannot stat: Stale file handle tar: Exiting with failure status due to previous errors </snip> Version-Release number of selected component (if applicable): ------------------------------------------------------------- 3.8.4-38 How reproducible: ----------------- 1/1 Actual results: --------------- Application got affected.IO errored out with ESTALE. Expected results: ----------------- Successful IO completion. Additional info: Volume Name: butcher Type: Distributed-Disperse Volume ID: dbd4e36f-28de-445e-a5cd-c78d0a467fb6 Status: Started Snapshot Count: 0 Number of Bricks: 12 x (4 + 2) = 72 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick5: gqas003.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick6: gqas007.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick8: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick9: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick10: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick11: gqas003.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick12: gqas007.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick14: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick16: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick17: gqas003.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick18: gqas007.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick19: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick20: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick21: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick22: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick23: gqas003.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick24: gqas007.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick25: gqas013.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick26: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick27: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick28: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick29: gqas003.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick30: gqas007.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick31: gqas013.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick32: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick33: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick34: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick35: gqas003.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick36: gqas007.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick37: gqas013.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick38: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick39: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick40: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick41: gqas003.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick42: gqas007.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick43: gqas013.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick44: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick45: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick46: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick47: gqas003.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick48: gqas007.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick49: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick50: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick51: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick52: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick53: gqas003.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick54: gqas007.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick55: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick56: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick57: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick58: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick59: gqas003.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick60: gqas007.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick61: gqas013.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick62: gqas005.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick63: gqas006.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick64: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick65: gqas003.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick66: gqas007.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick67: gqas013.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick68: gqas005.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick69: gqas006.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick70: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick71: gqas003.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick72: gqas007.sbu.lab.eng.bos.redhat.com:/bricks12/brick Options Reconfigured: features.quota-deem-statfs: on features.inode-quota: on features.quota: on features.uss: enable client.event-threads: 4 server.event-threads: 4 network.inode-lru-limit: 50000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet nfs.disable: off [root@gqas013 ~]#
At around the same time as the IO error,I see reply submission and setxattr failures, in logs not sure if it's related : bricks8-brick.log:[2017-08-06 06:29:49.033474] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xb0bfb, Program: GlusterFS 3.3, ProgVers: 330, Proc: 13) to rpc-transport (tcp.butcher-server) bricks8-brick.log:[2017-08-06 06:29:49.033550] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x13808) [0x7fd3e2004808] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1f892) [0x7fd3e1baf892] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9406) [0x7fd3e1b99406] ) 0-: Reply submission failed bricks8-brick.log:[2017-08-06 06:29:49.200438] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xb2403, Program: GlusterFS 3.3, ProgVers: 330, Proc: 4) to rpc-transport (tcp.butcher-server) bricks8-brick.log:[2017-08-06 06:29:49.200479] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x15c82) [0x7fd3e2006c82] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1c7e2) [0x7fd3e1bac7e2] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9406) [0x7fd3e1b99406] ) 0-: Reply submission failed bricks8-brick.log:[2017-08-06 06:31:52.403175] E [MSGID: 113001] [posix.c:5645:_posix_handle_xattr_keyvalue_pair] 0-butcher-posix: setxattr failed on /bricks8/brick/.glusterfs/b4/2b/b42b9e62-08f4-4227-b4ef-f3f2b0162603 while doing xattrop: key=trusted.glusterfs.quota.432b1d19-a74f-46fc-8aa5-02f489acb3b1.contri.1 [No such file or directory] [root@gqas005 bricks]# And,a steady stream of ESTALEs in nfs.log on those 3 servers: <snip> [2017-08-06 06:33:03.317135] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.146:961) butcher : 6bff2f62-5a30-4dd3-81f4-d60b8e9d3daa [2017-08-06 06:33:03.344081] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d1/dir4: Stale file handle [Stale file handle] [2017-08-06 06:33:03.344138] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.146:961) butcher : 6bff2f62-5a30-4dd3-81f4-d60b8e9d3daa [2017-08-06 06:33:03.350853] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d1/dir4: Stale file handle [Stale file handle] [2017-08-06 06:33:03.350900] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.146:961) butcher : 6bff2f62-5a30-4dd3-81f4-d60b8e9d3daa <snip> [root@gqas008 glusterfs]# cat nfs.log |grep "Stale" | wc -l 126925 [root@gqas008 glusterfs]# cat nfs.log |grep "Stale" | wc -l 126925 [root@gqas008 glusterfs]# [root@gqas007 glusterfs]# cat nfs.log |grep "Stale" | wc -l 120680 [root@gqas007 glusterfs]#
From an initial look at the logs gqas008/nfs.log- [2017-08-06 06:30:07.378122] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle] [2017-08-06 06:30:07.378151] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9 [2017-08-06 06:30:07.378167] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ea9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument] [2017-08-06 06:30:07.378618] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle] [2017-08-06 06:30:07.378659] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9 [2017-08-06 06:30:07.378684] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: eb9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument] [2017-08-06 06:30:07.379205] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle] [2017-08-06 06:30:07.379252] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9 [2017-08-06 06:30:07.379271] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ec9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument] [2017-08-06 06:30:07.379764] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle] [2017-08-06 06:30:07.379801] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9 [2017-08-06 06:30:07.379820] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ed9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument] [2017-08-06 06:30:07.380342] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle] [2017-08-06 06:30:07.380378] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9 [2017-08-06 06:30:07.380396] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ee9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument] [2017-08-06 06:30:07.380892] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle] [2017-08-06 06:30:07.380928] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9 [2017-08-06 06:30:07.380954] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: ef9433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument] [2017-08-06 06:30:07.381463] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle] [2017-08-06 06:30:07.381499] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9 [2017-08-06 06:30:07.381516] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir4 => (XID: f09433ed, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [Invalid argument] [2017-08-06 06:30:07.381953] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir4: Stale file handle [Stale file handle] [2017-08-06 06:30:07.381983] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:958) butcher : aaf9d85f-8b09-40ec-a6a5-272e66efc2f9 There are LOOKUP failures on the same directory /d4/dir4 . But however IIRC, whenever there is any error sent by the brick server, client xlator also logs it. int client3_3_lookup_cbk (struct rpc_req *req, struct iovec *iov, int count, void *myframe) { clnt_local_t *local = NULL; call_frame_t *frame = NULL; int ret = 0; gfs3_lookup_rsp rsp = {0,}; struct iatt stbuf = {0,}; struct iatt postparent = {0,}; ... ... out: /* Restore the correct op_errno to rsp.op_errno */ rsp.op_errno = op_errno; if (rsp.op_ret == -1) { /* any error other than ENOENT */ if (!(local->loc.name && rsp.op_errno == ENOENT) && !(rsp.op_errno == ESTALE)) gf_msg (this->name, GF_LOG_WARNING, rsp.op_errno, PC_MSG_REMOTE_OP_FAILED, "remote operation " "failed. Path: %s (%s)", local->loc.path, loc_gfid_utoa (&local->loc)); else gf_msg_trace (this->name, 0, "not found on remote " "node"); } But since there are no errors logged from "client3_3_lookup_cbk()", this error may have got generated from the lower xlators. @Ambarish, Will it be possible to collect pkt trace during the interval this error is seen on the server? @Nithya/Susant, Since this error is seen only while doing performing rebalance operation, do you suspect any code-path/issue which could result in ESTALE for lookup? Thanks!
One possibility is if the file in question has been migrated. However from the messages it looks like the ESTALE errors were returned for directories. In the code snippet above: if (!(local->loc.name && rsp.op_errno == ENOENT) && !(rsp.op_errno == ESTALE)) It looks like ESTALE errs are also not logged. A better way might be to check the brick logs to see if there are any ESTALE errors logged there.
(In reply to Nithya Balachandran from comment #6) > One possibility is if the file in question has been migrated. However from > the messages it looks like the ESTALE errors were returned for directories. > > In the code snippet above: > if (!(local->loc.name && rsp.op_errno == ENOENT) && > !(rsp.op_errno == ESTALE)) > > It looks like ESTALE errs are also not logged. A better way might be to > check the brick logs to see if there are any ESTALE errors logged there. Oops..Thanks Nithya. I do not see I/O errors in the brick logs. @Ambarish, As updated in the above comment, please collect packet trace on the gNFS server nodes for a small interval (<1min) when the errors are observed on mount point. Also if possible, please attach gdb to gNFS process and collect backtrace gdb -p <pid_gnfs> b nfs3_fh_resolve_inode_lookup_cbk c When the breakpoint is hit, please collect bt and paste the same. Thanks!
(In reply to Soumya Koduri from comment #7) > (In reply to Nithya Balachandran from comment #6) > > One possibility is if the file in question has been migrated. However from > > the messages it looks like the ESTALE errors were returned for directories. > > > > In the code snippet above: > > if (!(local->loc.name && rsp.op_errno == ENOENT) && > > !(rsp.op_errno == ESTALE)) > > > > It looks like ESTALE errs are also not logged. A better way might be to > > check the brick logs to see if there are any ESTALE errors logged there. > > Oops..Thanks Nithya. I do not see I/O errors in the brick logs. > Even bricks log ESTALE/ENOENT errors at debug-level. Was the log-level set appropriately to capture these errors? Looks like these are errors from storage/posix. There are some errors logged by storage/posix.
Many thanks to Ambarish, Rafi and Susant. The update we have from today's debugging on Ambarish setup is that - we have observed that there are ESTALE errors being sent for nameless (gfid-based) LOOKUPs by snapview server. But there are no snaphots created nor does the client doing any operations on .snaps folder. So on theory, seems like these LOOKUPs should have got first performed on regular graph which is "write-behind/dht/ec/protocol_client" and only if dht returns ENOENT, this fop shall be sent to snapview server. Also we observed that the gfid on which lookup was sent to snapview server is of a directory and is present at the backend. So going by theory maybe the LOOKUP failed on regular graph path and may be on bricks too. Unfortunately, there are no logs in nfs.log or bricks to confirm that. So we have requested Ambarish to turn on log-level to DEBUG just before triggering add-brick which may help us in further debugging.
Thanks Ambarish. So this issue arises only when uss is enabled. Request Susant to take a look at the debug logs (provided in comment#14). Meanwhile I was re-looking at the pkt trace provided in the comment#8 -> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug08/gqas006/ The pkt trace is a bit large but I think all the LOOKUP calls on "dir2" had op_ret=0 (success) in the response from regular bricks but failed with ESTALE when sent from snapd daemon (which was using port: 49162). Below are few filters which I used to analyze the pkt trace (tcp.dstport != 49162) && (glusterfs.bname == "dir2") (glusterfs) && (glusterfs.proc == 27) ((glusterfs) && (glusterfs.proc == 27)) && (gluster.op_errno != 0) Since I couldn't find any LOOKUP failures (for 'dir2') sent from the brick servers, I am suspecting if the issue is in gf_svc_lookup if there could be cases in which it could have forwarded the LOOKUP to virtual namespace instead of first trying on regular graph. Request Rafi to comment. From code inspection so far, maybe the checks in place are not sufficient, like static int32_t gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xdata) { ... ... ret = svc_inode_ctx_get (this, loc->inode, &inode_type); if (!__is_root_gfid (loc->gfid)) { if (loc->parent) { parent = inode_ref (loc->parent); ret = svc_inode_ctx_get (this, loc->parent, &parent_type); } else { parent = inode_parent (loc->inode, loc->pargfid, NULL); if (parent) ret = svc_inode_ctx_get (this, parent, &parent_type);' >>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx? } } ... ... if (!loc->name) { if (gf_uuid_is_null (loc->inode->gfid)) { subvolume = FIRST_CHILD (this); local->subvolume = subvolume; wind = _gf_true; goto out; } else { if (inode_type >= 0) subvolume = svc_get_subvolume (this, inode_type); else subvolume = FIRST_CHILD (this); local->subvolume = subvolume; wind = _gf_true; goto out; } } if (strcmp (loc->name, priv->path)) { if (parent_type == NORMAL_INODE) { subvolume = FIRST_CHILD (this); local->subvolume = subvolume; } else { >>> Do we need to check if parent_type is actually set to VIRTUAL_INODE? subvolume = SECOND_CHILD (this); local->subvolume = subvolume; } } else { subvolume = SECOND_CHILD (this); local->subvolume = subvolume; if (parent_type == NORMAL_INODE) { /* Indication of whether the lookup is happening on the entry point or not, to the snapview-server. */ SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno, new_xdata, priv, ret, out); } }
From nfs.log provided in comment#14 - http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09_Debug/ Around the time when nfs xlator logged ESTALE errors- [2017-08-09 14:32:22.109572] D [MSGID: 0] [client-rpc-fops.c:2948:client3_3_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapd-client returned -1 error: Stale file handle [Stale file handle] [2017-08-09 14:32:22.109605] D [snapview-client.c:289:gf_svc_lookup_cbk] 0-butcher-snapview-client: Lookup failed on snapview graph with error Stale file handle >>>> If the lookup was sent to regular graph first, we should this error message recorded as well -"Lookup failed on normal graph with error... ". >>>> But that doesn't seem to be the case [2017-08-09 14:32:22.109634] D [MSGID: 0] [snapview-client.c:329:gf_svc_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapview-client returned -1 error: Stale file handle [Stale file handle] [2017-08-09 14:32:22.109664] D [MSGID: 0] [io-stats.c:2191:io_stats_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher returned -1 error: Stale file handle [Stale file handle] [2017-08-09 14:32:22.109685] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir1: Stale file handle [Stale file handle] [2017-08-09 14:32:22.109719] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:992) butcher : 8694b0d6-8fba-40e3-83b9-c45d0f1ca515 [2017-08-09 14:32:22.109747] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir1 => (XID: 6be86, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address)) [2017-08-09 14:32:22.110010] D [MSGID: 0] [nfs3-helpers.c:1646:nfs3_log_common_call] 0-nfs-nfsv3: XID: 106be86, ACCESS: args: FH: exportid 1ff111d1-32be-4590-9b9b-b1b756e29ba9, gfid 8694b0d6-8fba-40e3-83b9-c45d0f1ca515, mountid e4706d5b-0000-0000-0000-000000000000
Created attachment 1311641 [details] nfs_final_graph
@susant/rafi, Attached the snippet of debug logs (nfs.log) from the time final graph was loaded (add-brick) for your reference.
(In reply to Soumya Koduri from comment #16) > Thanks Ambarish. > > So this issue arises only when uss is enabled. Request Susant to take a look > at the debug logs (provided in comment#14). > > Meanwhile I was re-looking at the pkt trace provided in the comment#8 -> > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug08/gqas006/ > > The pkt trace is a bit large but I think all the LOOKUP calls on "dir2" had > op_ret=0 (success) in the response from regular bricks but failed with > ESTALE when sent from snapd daemon (which was using port: 49162). > > Below are few filters which I used to analyze the pkt trace > (tcp.dstport != 49162) && (glusterfs.bname == "dir2") > (glusterfs) && (glusterfs.proc == 27) > ((glusterfs) && (glusterfs.proc == 27)) && (gluster.op_errno != 0) > > Since I couldn't find any LOOKUP failures (for 'dir2') sent from the brick > servers, I am suspecting if the issue is in gf_svc_lookup if there could be > cases in which it could have forwarded the LOOKUP to virtual namespace > instead of first trying on regular graph. > > Request Rafi to comment. > > From code inspection so far, maybe the checks in place are not sufficient, > like > > static int32_t > gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t > *xdata) > { > > ... > ... > ret = svc_inode_ctx_get (this, loc->inode, &inode_type); > if (!__is_root_gfid (loc->gfid)) { > if (loc->parent) { > parent = inode_ref (loc->parent); > ret = svc_inode_ctx_get (this, loc->parent, > &parent_type); > } else { > parent = inode_parent (loc->inode, loc->pargfid, > NULL); > if (parent) > ret = svc_inode_ctx_get (this, parent, > &parent_type);' > > >>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx? What could have happened here is that we might have assumed that a parent might have resolved before doing lookup on the inode. I understand that nfs-ganesh can link an inode with out having a parent, is this same case with gNFS also ? Still with a fresh lookup this might fail. > } > } > > ... > ... > if (!loc->name) { > if (gf_uuid_is_null (loc->inode->gfid)) { > subvolume = FIRST_CHILD (this); > local->subvolume = subvolume; > wind = _gf_true; > goto out; > } else { > if (inode_type >= 0) > subvolume = svc_get_subvolume (this, > inode_type); > else > subvolume = FIRST_CHILD (this); > local->subvolume = subvolume; > wind = _gf_true; > goto out; > } > } > > if (strcmp (loc->name, priv->path)) { > if (parent_type == NORMAL_INODE) { > subvolume = FIRST_CHILD (this); > local->subvolume = subvolume; > } else { > >>> Do we need to check if parent_type is actually set to VIRTUAL_INODE? > > subvolume = SECOND_CHILD (this); > local->subvolume = subvolume; > } > } else { > subvolume = SECOND_CHILD (this); > local->subvolume = subvolume; > if (parent_type == NORMAL_INODE) { > /* Indication of whether the lookup is happening on > the > entry point or not, to the snapview-server. > */ > SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno, > new_xdata, priv, ret, out); > } > } Your suggestion is absolutely valid, considering the above mentioned point. But I was thinking that gNFS won't have loc->name filled. I'm not sure how much extend this statement is true, because during the resolve or from any other lookup from nfs xlator may have loc->name filled.
> > > > From code inspection so far, maybe the checks in place are not sufficient, > > like > > > > static int32_t > > gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t > > *xdata) > > { > > > > ... > > ... > > ret = svc_inode_ctx_get (this, loc->inode, &inode_type); > > if (!__is_root_gfid (loc->gfid)) { > > if (loc->parent) { > > parent = inode_ref (loc->parent); > > ret = svc_inode_ctx_get (this, loc->parent, > > &parent_type); > > } else { > > parent = inode_parent (loc->inode, loc->pargfid, > > NULL); > > if (parent) > > ret = svc_inode_ctx_get (this, parent, > > &parent_type);' > > > > >>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx? > > What could have happened here is that we might have assumed that a parent > might have resolved before doing lookup on the inode. I understand that > nfs-ganesh can link an inode with out having a parent, is this same case > with gNFS also ? > Yes I think for nameless lookups, like in ganesha, loc->parent may not be set or linked by gNFS too. Again I am not sure. What if inode is linked from underlying layers and svc hasn't set the ctx value? > > if (!loc->name) { > > if (gf_uuid_is_null (loc->inode->gfid)) { > > subvolume = FIRST_CHILD (this); > > local->subvolume = subvolume; > > wind = _gf_true; > > goto out; > > } else { > > if (inode_type >= 0) > > subvolume = svc_get_subvolume (this, > > inode_type); > > else > > subvolume = FIRST_CHILD (this); > > local->subvolume = subvolume; > > wind = _gf_true; > > goto out; > > } Here the decision seems to be taken based on inode_type which is again read from inode_ctx which is not valid if loc->inode is not a linked inode. > > } > > > > if (strcmp (loc->name, priv->path)) { > > if (parent_type == NORMAL_INODE) { > > subvolume = FIRST_CHILD (this); > > local->subvolume = subvolume; > > } else { > > >>> Do we need to check if parent_type is actually set to VIRTUAL_INODE? > > > > subvolume = SECOND_CHILD (this); > > local->subvolume = subvolume; > > } > > } else { > > subvolume = SECOND_CHILD (this); > > local->subvolume = subvolume; > > if (parent_type == NORMAL_INODE) { > > /* Indication of whether the lookup is happening on > > the > > entry point or not, to the snapview-server. > > */ > > SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno, > > new_xdata, priv, ret, out); > > } > > } > > > Your suggestion is absolutely valid, considering the above mentioned point. > But I was thinking that gNFS won't have loc->name filled. I'm not sure how > much extend this statement is true, because during the resolve or from any > other lookup from nfs xlator may have loc->name filled.
upstream patch : https://review.gluster.org/18028
Thanks to Soumya for the RCA, When restarting a nfs process, the parent inode were not linked to the inode table when resolving an entry. So resolving a parent inode results in sending a lookup with out it's parent , but the path and name was filled by the inode_path call. So there was one combination in svc that didn't handle this case.
3.12 release branch patch id : https://review.gluster.org/#/c/18042
Cannot reproduce the reported problem on 3.8.4-41 on multiple retries. Moving to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774