1480591 – [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during rebalance of an erasure coded volume.

Bug 1480591 - [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during rebalance of an erasure coded volume.

Summary: [Scale] : I/O errors on multiple gNFS mounts with "Stale file handle" during...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	snapshot
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Mohammed Rafi KC
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1478716 1481931
TreeView+	depends on / blocked

Reported:	2017-08-11 12:42 UTC by Mohammed Rafi KC
Modified:	2017-12-08 17:38 UTC (History)
CC List:	18 users (show)
Fixed In Version:	glusterfs-3.13.0
Clone Of:	1478716
Clones:	1481931 (view as bug list)
Environment:
Last Closed:	2017-12-08 17:38:25 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Mohammed Rafi KC 2017-08-11 12:47:26 UTC

Expanded my volume from 10*(4+2) to 12*(4+2) with linux untars and lookups from  6 gNFS mounts.

Triggered rebalance.


I/O errored out on 3/3 clients with ESTALE :


On gqac007(Mounted via gqas008.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot/dts: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors
</snip>



On gqac024(mounted via gqas013.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors

<snip>

On gqac008 (Mounted via gqas007.sbu.lab.eng.bos.redhat.com) :

<snip>

tar: linux-4.13-rc3/arch/arm64/boot: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch/arm64: Cannot stat: Stale file handle
tar: linux-4.13-rc3/arch: Cannot stat: Stale file handle
tar: linux-4.13-rc3: Cannot stat: Stale file handle
tar: Exiting with failure status due to previous errors
</snip>

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-38

How reproducible:
-----------------

1/1


Actual results:
---------------

Application got affected.IO errored out with ESTALE.


Expected results:
-----------------

Successful IO completion.

Additional info:

Volume Name: butcher
Type: Distributed-Disperse
Volume ID: dbd4e36f-28de-445e-a5cd-c78d0a467fb6
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x (4 + 2) = 72
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick5: gqas003.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick6: gqas007.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick8: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick9: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick10: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick11: gqas003.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick12: gqas007.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick14: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick16: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick17: gqas003.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick18: gqas007.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick19: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick20: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick21: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick22: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick23: gqas003.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick24: gqas007.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick25: gqas013.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick26: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick27: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick28: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick29: gqas003.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick30: gqas007.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick31: gqas013.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick32: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick33: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick34: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick35: gqas003.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick36: gqas007.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick37: gqas013.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick38: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick39: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick40: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick41: gqas003.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick42: gqas007.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick43: gqas013.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick44: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick45: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick46: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick47: gqas003.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick48: gqas007.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick49: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick50: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick51: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick52: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick53: gqas003.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick54: gqas007.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick55: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick56: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick57: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick58: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick59: gqas003.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick60: gqas007.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick61: gqas013.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick62: gqas005.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick63: gqas006.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick64: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick65: gqas003.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick66: gqas007.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick67: gqas013.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick68: gqas005.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick69: gqas006.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick70: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick71: gqas003.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick72: gqas007.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.uss: enable
client.event-threads: 4
server.event-threads: 4
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: off


Below are few filters which I used to analyze the pkt trace
(tcp.dstport != 49162) && (glusterfs.bname == "dir2")
(glusterfs) && (glusterfs.proc == 27) 
((glusterfs) && (glusterfs.proc == 27)) && (gluster.op_errno != 0)

Since I couldn't find any LOOKUP failures (for 'dir2') sent from the brick servers, I am suspecting if the issue is in gf_svc_lookup if there could be cases in which it could have forwarded the LOOKUP to virtual namespace instead of first trying on regular graph.

Request Rafi to comment. 

From code inspection so far, maybe the checks in place are not sufficient, like

static int32_t
gf_svc_lookup (call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xdata)
{
       
...
...
 ret = svc_inode_ctx_get (this, loc->inode, &inode_type);
        if (!__is_root_gfid (loc->gfid)) {
                if (loc->parent) {
                        parent = inode_ref (loc->parent);
                        ret = svc_inode_ctx_get (this, loc->parent,
                                                 &parent_type);
                } else {
                        parent = inode_parent (loc->inode, loc->pargfid, NULL);
                        if (parent)
                                ret = svc_inode_ctx_get (this, parent,
                                                         &parent_type);'

>>> We do not seem to be checking for 'ret' value here. What if parent_type is not yet set in inode_ctx?
                }
        }

...
...
        if (!loc->name) {
                if (gf_uuid_is_null (loc->inode->gfid)) {
                        subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                        wind = _gf_true;
                        goto out;
                } else {
                        if (inode_type >= 0)
                                subvolume = svc_get_subvolume (this,
                                                               inode_type);
                        else
                                subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                        wind = _gf_true;
                        goto out;
                }
        }

        if (strcmp (loc->name, priv->path)) {
                if (parent_type == NORMAL_INODE) {
                        subvolume = FIRST_CHILD (this);
                        local->subvolume = subvolume;
                } else {
>>> Do we need to check if parent_type is actually set to VIRTUAL_INODE?

                        subvolume = SECOND_CHILD (this);
                        local->subvolume = subvolume;
                }
        } else {
                subvolume = SECOND_CHILD (this);
                local->subvolume = subvolume;
                if (parent_type == NORMAL_INODE) {
                        /* Indication of whether the lookup is happening on the
                           entry point or not, to the snapview-server.
                        */
                        SVC_ENTRY_POINT_SET (this, xdata, op_ret, op_errno,
                                             new_xdata, priv, ret, out);
                }
        }

--- Additional comment from Soumya Koduri on 2017-08-10 05:03:37 EDT ---

From nfs.log provided in comment#14 -
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1478716/Aug09_Debug/


Around the time when nfs xlator logged ESTALE errors- 

[2017-08-09 14:32:22.109572] D [MSGID: 0] [client-rpc-fops.c:2948:client3_3_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapd-client returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109605] D [snapview-client.c:289:gf_svc_lookup_cbk] 0-butcher-snapview-client: Lookup failed on snapview graph with error Stale file handle

>>>> If the lookup was sent to regular graph first, we should this error message recorded as well -"Lookup failed on normal graph with error... ".

>>>> But that doesn't seem to be the case


[2017-08-09 14:32:22.109634] D [MSGID: 0] [snapview-client.c:329:gf_svc_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher-snapview-client returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109664] D [MSGID: 0] [io-stats.c:2191:io_stats_lookup_cbk] 0-stack-trace: stack-address: 0x7f88d007dd00, butcher returned -1 error: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109685] E [MSGID: 112198] [nfs3-helpers.c:3691:nfs3_fh_resolve_inode_lookup_cbk] 0-nfs-nfsv3: Lookup failed: /d4/dir1: Stale file handle [Stale file handle]
[2017-08-09 14:32:22.109719] E [MSGID: 112069] [nfs3.c:1692:nfs3_access_resume] 0-nfs-nfsv3: Stale file handle: (192.168.79.149:992) butcher : 8694b0d6-8fba-40e3-83b9-c45d0f1ca515
[2017-08-09 14:32:22.109747] W [MSGID: 112199] [nfs3-helpers.c:3414:nfs3_log_common_res] 0-nfs-nfsv3: /d4/dir1 => (XID: 6be86, ACCESS: NFS: 70(Invalid file handle), POSIX: 14(Bad address))
[2017-08-09 14:32:22.110010] D [MSGID: 0] [nfs3-helpers.c:1646:nfs3_log_common_call] 0-nfs-nfsv3: XID: 106be86, ACCESS: args: FH: exportid 1ff111d1-32be-4590-9b9b-b1b756e29ba9, gfid 8694b0d6-8fba-40e3-83b9-c45d0f1ca515, mountid e4706d5b-0000-0000-0000-000000000000

Comment 2 Worker Ant 2017-08-11 12:56:08 UTC

REVIEW: https://review.gluster.org/18028 (uss/svc: Send the first lookup to the normal graph) posted (#1) for review on master by mohammed rafi  kc (rkavunga)

Comment 3 Worker Ant 2017-08-14 07:30:24 UTC

COMMIT: https://review.gluster.org/18028 committed in master by mohammed rafi  kc (rkavunga) 
------
commit d396d358d4f0cfe87693179cfd13eb2a84ce62c2
Author: Mohammed Rafi KC <rkavunga>
Date:   Fri Aug 11 17:50:52 2017 +0530

    uss/svc: Send the first lookup to the normal graph
    
    When parent inode is not looked up before, the inode type
    won't be set. In this scenario we should send the lookup
    to the normal graph first.
    
    Change-Id: I5d3e53f5c1c33a17be32204f3114749d9b1e6db8
    BUG: 1480591
    Signed-off-by: Mohammed Rafi KC <rkavunga>
    Reviewed-on: https://review.gluster.org/18028
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra Bhat <raghavendra>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: soumya k <skoduri>

Comment 4 Shyamsundar 2017-12-08 17:38:25 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.13.0, please open a new bug report.

glusterfs-3.13.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-December/000087.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.