Bug 1438411 - [Ganesha + EC] : Input/Output Error while creating LOTS of smallfiles
Summary: [Ganesha + EC] : Input/Output Error while creating LOTS of smallfiles
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On: 1415038
Blocks: 1438423 1438424
TreeView+ depends on / blocked
 
Reported: 2017-04-03 11:43 UTC by Pranith Kumar K
Modified: 2017-05-30 18:48 UTC (History)
12 users (show)

Fixed In Version: glusterfs-3.11.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1415038
: 1438423 1438424 (view as bug list)
Environment:
Last Closed: 2017-05-30 18:48:52 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Pranith Kumar K 2017-04-03 11:45:52 UTC
    Problem:
    local->loc.gfid in dht_lookup_directory() will be null-gfid for a fresh lookup.
    dht_lookup_dir_cbk() updates local->loc.gfid while in other thread dht_lookup_directory()
    is still winding lookup calls to subvolumes so there is a chance of partial gfid being
    seen by EC.
    
    We saw in 12x(4+2) volume, ec is receiving an loc where the gfid has last 12 bytes matching
    with the gfid of the directory and the first 4 bytes are all-zeros. This is leading to EC
    erroring out the lookup with EINVAL which leads to NFS failing lookup with EIO.
    
    snip from gdb:
    $37 = (dht_local_t *) 0x7fde5de5b3cc
    (gdb) p /x $37->loc.gfid
    $39 = {0x3b, 0x82, 0x10, 0x5e, 0x40, 0x65, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5,
    0x6c, 0x2c, 0xb8, 0x56}
    (gdb) fr 7
    state=<optimized out>) at ec-generic.c:837
    837                     ec_lookup_rebuild(fop->xl->private, fop, cbk);
    (gdb) p /x fop->loc[0].gfid
    $40 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c,
    0x2c, 0xb8, 0x56}
    
    snip from log:
    [2017-01-29 03:22:30.132328] W [MSGID: 122019]
    [ec-helpers.c:354:ec_loc_gfid_check] 0-butcher-disperse-4: Mismatching GFID's
    in loc [2017-01-29 03:22:30.132709] W [MSGID: 112199]
    [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3:
    /linux-4.9.5/Documentation => (XID: b27b9474, MKDIR: NFS: 5(I/O error), POSIX:
    5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid
    00000000-0000-0000-0000-000000000000, mountid
    00000000-0000-0000-0000-000000000000 [Invalid argument]
    
    Fix:
    update local->loc.gfid in last-call to make sure there are no races.

Comment 2 Pranith Kumar K 2017-04-03 11:47:15 UTC
(In reply to Pranith Kumar K from comment #1)
>     Problem:
>     local->loc.gfid in dht_lookup_directory() will be null-gfid for a fresh
> lookup.
>     dht_lookup_dir_cbk() updates local->loc.gfid while in other thread
> dht_lookup_directory()
>     is still winding lookup calls to subvolumes so there is a chance of
> partial gfid being
>     seen by EC.
>     
>     We saw in 12x(4+2) volume, ec is receiving an loc where the gfid has
> last 12 bytes matching

Sorry, just last 10 bytes not 12 bytes.

>     with the gfid of the directory and the first 4 bytes are all-zeros. This
> is leading to EC
>     erroring out the lookup with EINVAL which leads to NFS failing lookup
> with EIO.
>     
>     snip from gdb:
>     $37 = (dht_local_t *) 0x7fde5de5b3cc
>     (gdb) p /x $37->loc.gfid
>     $39 = {0x3b, 0x82, 0x10, 0x5e, 0x40, 0x65, 0x43, 0x14, 0xa0, 0xc6, 0x8,
> 0xf5,
>     0x6c, 0x2c, 0xb8, 0x56}
>     (gdb) fr 7
>     state=<optimized out>) at ec-generic.c:837
>     837                     ec_lookup_rebuild(fop->xl->private, fop, cbk);
>     (gdb) p /x fop->loc[0].gfid
>     $40 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5,
> 0x6c,
>     0x2c, 0xb8, 0x56}
>     
>     snip from log:
>     [2017-01-29 03:22:30.132328] W [MSGID: 122019]
>     [ec-helpers.c:354:ec_loc_gfid_check] 0-butcher-disperse-4: Mismatching
> GFID's
>     in loc [2017-01-29 03:22:30.132709] W [MSGID: 112199]
>     [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3:
>     /linux-4.9.5/Documentation => (XID: b27b9474, MKDIR: NFS: 5(I/O error),
> POSIX:
>     5(Input/output error)), FH: exportid
> 00000000-0000-0000-0000-000000000000, gfid
>     00000000-0000-0000-0000-000000000000, mountid
>     00000000-0000-0000-0000-000000000000 [Invalid argument]
>     
>     Fix:
>     update local->loc.gfid in last-call to make sure there are no races.

Comment 3 Worker Ant 2017-04-03 11:48:28 UTC
REVIEW: https://review.gluster.org/16986 (cluster/dht: Modify local->loc.gfid in thread safe manner) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 4 Worker Ant 2017-04-03 11:53:54 UTC
REVIEW: https://review.gluster.org/16986 (cluster/dht: Modify local->loc.gfid in thread safe manner) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 5 Worker Ant 2017-04-04 12:44:56 UTC
COMMIT: https://review.gluster.org/16986 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit b75fa35694af916e0923f10e4f9491c364a4ba79
Author: Pranith Kumar K <pkarampu>
Date:   Thu Mar 30 14:58:38 2017 +0530

    cluster/dht: Modify local->loc.gfid in thread safe manner
    
    Problem:
    local->loc.gfid in dht_lookup_directory() will be null-gfid for a fresh lookup.
    dht_lookup_dir_cbk() updates local->loc.gfid while in other thread dht_lookup_directory()
    is still winding lookup calls to subvolumes so there is a chance of partial gfid being
    seen by EC.
    
    We saw in 12x(4+2) volume, ec is receiving an loc where the gfid has last 10 bytes matching
    with the gfid of the directory and the first 4 bytes are all-zeros. This is leading to EC
    erroring out the lookup with EINVAL which leads to NFS failing lookup with EIO.
    
    snip from gdb:
    $37 = (dht_local_t *) 0x7fde5de5b3cc
    (gdb) p /x $37->loc.gfid
    $39 = {0x3b, 0x82, 0x10, 0x5e, 0x40, 0x65, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5,
    0x6c, 0x2c, 0xb8, 0x56}
    (gdb) fr 7
    state=<optimized out>) at ec-generic.c:837
    837	                ec_lookup_rebuild(fop->xl->private, fop, cbk);
    (gdb) p /x fop->loc[0].gfid
    $40 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c,
    0x2c, 0xb8, 0x56}
    
    snip from log:
    [2017-01-29 03:22:30.132328] W [MSGID: 122019]
    [ec-helpers.c:354:ec_loc_gfid_check] 0-butcher-disperse-4: Mismatching GFID's
    in loc [2017-01-29 03:22:30.132709] W [MSGID: 112199]
    [nfs3-helpers.c:3515:nfs3_log_newfh_res] 0-nfs-nfsv3:
    /linux-4.9.5/Documentation => (XID: b27b9474, MKDIR: NFS: 5(I/O error), POSIX:
    5(Input/output error)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid
    00000000-0000-0000-0000-000000000000, mountid
    00000000-0000-0000-0000-000000000000 [Invalid argument]
    
    Fix:
    update local->loc.gfid in last-call to make sure there are no races.
    
    BUG: 1438411
    Change-Id: Ifcb7e911568c1f1f83123da6ff0cf742b91800a0
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: https://review.gluster.org/16986
    Reviewed-by: Raghavendra G <rgowdapp>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>

Comment 6 Shyamsundar 2017-05-30 18:48:52 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.