Bug 1323040 - Inconsistent directory structure on dht subvols caused by parent layouts going stale during entry create operations because of fix-layout
Summary: Inconsistent directory structure on dht subvols caused by parent layouts goin...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1323042 1329062
TreeView+ depends on / blocked
 
Reported: 2016-04-01 04:35 UTC by Raghavendra G
Modified: 2016-06-16 14:02 UTC (History)
3 users (show)

Fixed In Version: glusterfs-3.8rc2
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1323042 1329062 (view as bug list)
Environment:
Last Closed: 2016-06-16 14:02:34 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Raghavendra G 2016-04-01 04:35:40 UTC
Description of problem:
After rebalance changes the layouts of the directories on-disk, client's in-memory layout becomes stale. If a lookup was not sent on the directory (driven by higher layers), dht goes ahead and uses the stale layouts. This has a serious consequence during entry operations which normally rely on layout to determine a hashed subvol. Some of the manifestations of this problem we've seen are:

1. A directory having different gfid on different subvolumes (resulting from parallel mkdir of same path from different clients with some having up-to-date layout and some having stale layout).
2. A file with data-file being present on different subvols and having different gfid (resulting from parallel create of same file from different clients with some having up-to-date layout and some having stale layout).

Version-Release number of selected component (if applicable):


How reproducible:
Quite consistently

Steps to Reproduce:

Set up a dist-rep volume, maybe 6x2.
1. Create a data set with a large number of directories- fairly deep and several
dirs at each level
2. Add several bricks.
3. From multiple NFS clients run the same script to create multiple dirs inside
the ones already created. We want different clients to try creating the same
dirs so only one should succeed.
4. While the script is running, start a rebalance.
 
The issue we want to test is a mkdir issue during rebalance when different
clients have different in memory layouts for the parent dirs.

Actual results:

Same dir has different gfids on different subvols.

To find the issue use following steps, once your test is complete.

1. Have a fuse mount with use-readdirp=no and disable attribute/entry caching.

[root@unused ~]# mount -t glusterfs -o entry-timeout=0,attribute-timeout=0,use-readdirp=no localhost:/dist /mnt/glusterfs
[root@unused ~]# ps ax | grep -i readdirp
30801 ?        Ssl    0:00 /usr/local/sbin/glusterfs --use-readdirp=no --attribute-timeout=0 --entry-timeout=0 --volfile-server=localhost --volfile-id=/dist /mnt/glusterfs

2. turn off md-cache/stat-prefetch 

[root@unused ~]# gluster volume set dist performance.stat-prefetch off
volume set: success

3. Now do a crawl of the entire glusterfs.
  [root@unused ~]# find /mnt/glusterfs > /dev/null

4. Look for mount log for MGSID: 109009

[root@unused ~]# grep "MSGID: 109009" /var/log/glusterfs/mnt-glusterfs.log 

[2016-03-30 06:00:18.762188] W [MSGID: 109009] [dht-common.c:571ht_lookup_dir_cbk] 0-dist-dht: /dir: gfid different on dist-client-9. gfid local = cd4adbd2-823b-4feb-82eb-b0011d71cfec, gfid subvol = cafedbd2-823b-4feb-82eb-b0011d71babe
[2016-03-30 06:00:22.596947] W [MSGID: 109009] [dht-common.c:571ht_lookup_dir_cbk] 0-dist-dht: /dir: gfid different on dist-client-9. gfid local = cd4adbd2-823b-4feb-82eb-b0011d71cfec, gfid subvol = cafedbd2-823b-4feb-82eb-b0011d71babe

Expected results:
1. Not more than one among mkdirs issued on same path from multiple clients should succeed.
2. No directory should have different gfid on different subvols.

Additional info:

Comment 1 Vijay Bellur 2016-04-01 09:55:50 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#1) for review on master by Raghavendra G (rgowdapp)

Comment 2 Vijay Bellur 2016-04-01 09:59:56 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#2) for review on master by Raghavendra G (rgowdapp)

Comment 3 Vijay Bellur 2016-04-02 05:03:15 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#3) for review on master by Raghavendra G (rgowdapp)

Comment 4 Vijay Bellur 2016-04-06 07:12:02 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#4) for review on master by Raghavendra G (rgowdapp)

Comment 5 Vijay Bellur 2016-04-06 07:15:05 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#5) for review on master by Raghavendra G (rgowdapp)

Comment 6 Vijay Bellur 2016-04-13 05:13:47 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#6) for review on master by Raghavendra G (rgowdapp)

Comment 7 Vijay Bellur 2016-04-13 05:15:16 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#7) for review on master by Raghavendra G (rgowdapp)

Comment 8 Vijay Bellur 2016-04-13 05:26:57 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#8) for review on master by Raghavendra G (rgowdapp)

Comment 9 Vijay Bellur 2016-04-14 06:17:12 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#9) for review on master by Raghavendra G (rgowdapp)

Comment 10 Vijay Bellur 2016-04-15 04:40:55 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#10) for review on master by Raghavendra G (rgowdapp)

Comment 11 Vijay Bellur 2016-04-19 05:21:47 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#11) for review on master by Raghavendra G (rgowdapp)

Comment 12 Vijay Bellur 2016-04-19 06:34:26 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#12) for review on master by Raghavendra G (rgowdapp)

Comment 13 Vijay Bellur 2016-04-20 12:59:04 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#13) for review on master by Raghavendra G (rgowdapp)

Comment 14 Vijay Bellur 2016-04-21 03:58:34 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#14) for review on master by Raghavendra G (rgowdapp)

Comment 15 Vijay Bellur 2016-04-21 08:42:07 UTC
REVIEW: http://review.gluster.org/13885 (cluster/distribute: detect stale layouts in entry fops) posted (#15) for review on master by Raghavendra G (rgowdapp)

Comment 16 Vijay Bellur 2016-04-22 17:28:58 UTC
COMMIT: http://review.gluster.org/13885 committed in master by Raghavendra G (rgowdapp) 
------
commit 823bda0f28cba1b0632d99a22cdecaee16c6db56
Author: Raghavendra G <rgowdapp>
Date:   Fri Apr 1 15:16:23 2016 +0530

    cluster/distribute: detect stale layouts in entry fops
    
    dht_mkdir ()
    {
          first-hashed-subvol = hashed-subvol for "bname" in in-memory
                                layout of "parent";
          inodelk (SETLKW, parent, "LAYOUT_HEAL_DOMAIN", "can be any
                   subvol, but we choose first-hashed-subvol randomly");
          {
    begin:
                hashed-subvol = hashed-subvol for "bname" in in-memory
                                layout of "parent";
                hash-range = extract hashe-range from layout of "parent";
    
                ret = mkdir (parent/bname, hashed-subvol, hash-range);
                if (ret == "hash-value doesn't fall into layout stored on
                           the brick (this error is returned by posix-mkdir)")
                {
                    refresh_parent_layout ();
                    goto begin;
                }
    
          }
          inodelk (UNLCK, parent, "LAYOUT_HEAL_DOMAIN",
                   "first-hashed-subvol");
    
          proceed with other parts of dht_mkdir;
    }
    
    posix_mkdir (parent/bname, client-hash-range)
    {
    
           disk-hash-range = getxattr (parent, "dht-layout-key");
           if (disk-hash-range != client-hash-range) {
                  fail-with-error ("hash-value doesn't fall into layout
                                    stored on the brick");
                  return 0;
           }
    
           continue-with-posix-mkdir;
    }
    
    Similar changes need to be done for dentry operations like create,
    symlink, link, unlink, rmdir, rename. These will be addressed in
    subsequent patches. This patch addresses only mkdir codepath.
    
    This change breaks stripe tests, as on some striped subvols dht layout
    xattrs are not set for some reason. This results in failure of
    mkdir. Since striped volumes are always created with dht, some tests
    associated with stripe also fail. So, I am making following tests
    changes (since stripe is out of maintainance):
    * modify ./tests/basic/rpc-coverage.t to not to use striped volumes
    * mark all (2) tests in tests/bugs/stripe/ as bad tests
    
    Change-Id: Idd1ae879f24a48303dc743c1bb4d91f89a629e25
    BUG: 1323040
    Signed-off-by: Raghavendra G <rgowdapp>
    Reviewed-on: http://review.gluster.org/13885
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: N Balachandran <nbalacha>

Comment 17 Vijay Bellur 2016-04-25 11:49:27 UTC
REVIEW: http://review.gluster.org/14062 (storage/posix: change the conflicting msg-id) posted (#1) for review on master by Raghavendra G (rgowdapp)

Comment 18 Vijay Bellur 2016-04-25 11:53:39 UTC
REVIEW: http://review.gluster.org/14062 (storage/posix: change the conflicting msg-id) posted (#2) for review on master by Raghavendra G (rgowdapp)

Comment 19 Vijay Bellur 2016-04-26 08:24:15 UTC
COMMIT: http://review.gluster.org/14062 committed in master by Raghavendra G (rgowdapp) 
------
commit 007dce0c7093a8534dd23340a38a8ce3cf3cd048
Author: Raghavendra G <rgowdapp>
Date:   Mon Apr 25 17:16:42 2016 +0530

    storage/posix: change the conflicting msg-id
    
    Change-Id: I11b2ffb73b2358380771921548fa2c51da6ad93f
    BUG: 1323040
    Signed-off-by: Raghavendra G <rgowdapp>
    Reviewed-on: http://review.gluster.org/14062
    Reviewed-by: N Balachandran <nbalacha>
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Poornima G <pgurusid>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 20 Vijay Bellur 2016-05-13 11:10:09 UTC
REVIEW: http://review.gluster.org/14333 (dht: detect stale layout in unlink fop) posted (#1) for review on master by Sakshi Bansal

Comment 21 Niels de Vos 2016-06-16 14:02:34 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.