Bug 1323042 - Inconsistent directory structure on dht subvols caused by parent layouts going stale during entry create operations because of fix-layout
Summary: Inconsistent directory structure on dht subvols caused by parent layouts goin...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.1.3
Assignee: Raghavendra G
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On: 1323040 1329062
Blocks: 1311387 1311817
TreeView+ depends on / blocked
 
Reported: 2016-04-01 04:44 UTC by Raghavendra G
Modified: 2016-06-23 05:15 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.7.9-3
Doc Type: Bug Fix
Doc Text:
The hashed subvolume of the Distributed Hash Table acts as an arbitrator for parallel directory entry operations like mkdir. This hashed subvolume is derived from the layout of the parent directory. If the directory layout on bricks changed, and fix layout ran before the layout was synced to clients, client and server layouts differed. This also meant that the client and server hashed subvolumes differed, affecting file and directory GFID consistency and causing various directory operations to fail. Clients are now prompted to refresh their layouts when their hash range does not match that of the server. This prevents issues with stale directory layout between server and client.
Clone Of: 1323040
Environment:
Last Closed: 2016-06-23 05:15:12 UTC
Embargoed:
rgowdapp: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1240 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.1 Update 3 2016-06-23 08:51:28 UTC

Description Raghavendra G 2016-04-01 04:44:34 UTC
+++ This bug was initially created as a clone of Bug #1323040 +++

Description of problem:
After rebalance changes the layouts of the directories on-disk, client's in-memory layout becomes stale. If a lookup was not sent on the directory (driven by higher layers), dht goes ahead and uses the stale layouts. This has a serious consequence during entry operations which normally rely on layout to determine a hashed subvol. Some of the manifestations of this problem we've seen are:

1. A directory having different gfid on different subvolumes (resulting from parallel mkdir of same path from different clients with some having up-to-date layout and some having stale layout).
2. A file with data-file being present on different subvols and having different gfid (resulting from parallel create of same file from different clients with some having up-to-date layout and some having stale layout).

Version-Release number of selected component (if applicable):


How reproducible:
Quite consistently

Steps to Reproduce:

Set up a dist-rep volume, maybe 6x2.
1. Create a data set with a large number of directories- fairly deep and several
dirs at each level
2. Add several bricks.
3. From multiple NFS clients run the same script to create multiple dirs inside
the ones already created. We want different clients to try creating the same
dirs so only one should succeed.
4. While the script is running, start a rebalance.
 
The issue we want to test is a mkdir issue during rebalance when different
clients have different in memory layouts for the parent dirs.

Actual results:

Same dir has different gfids on different subvols.

To find the issue use following steps, once your test is complete.

1. Have a fuse mount with use-readdirp=no and disable attribute/entry caching.

[root@unused ~]# mount -t glusterfs -o entry-timeout=0,attribute-timeout=0,use-readdirp=no localhost:/dist /mnt/glusterfs
[root@unused ~]# ps ax | grep -i readdirp
30801 ?        Ssl    0:00 /usr/local/sbin/glusterfs --use-readdirp=no --attribute-timeout=0 --entry-timeout=0 --volfile-server=localhost --volfile-id=/dist /mnt/glusterfs

2. turn off md-cache/stat-prefetch 

[root@unused ~]# gluster volume set dist performance.stat-prefetch off
volume set: success

3. Now do a crawl of the entire glusterfs.
  [root@unused ~]# find /mnt/glusterfs > /dev/null

4. Look for mount log for MGSID: 109009

[root@unused ~]# grep "MSGID: 109009" /var/log/glusterfs/mnt-glusterfs.log 

[2016-03-30 06:00:18.762188] W [MSGID: 109009] [dht-common.c:571ht_lookup_dir_cbk] 0-dist-dht: /dir: gfid different on dist-client-9. gfid local = cd4adbd2-823b-4feb-82eb-b0011d71cfec, gfid subvol = cafedbd2-823b-4feb-82eb-b0011d71babe
[2016-03-30 06:00:22.596947] W [MSGID: 109009] [dht-common.c:571ht_lookup_dir_cbk] 0-dist-dht: /dir: gfid different on dist-client-9. gfid local = cd4adbd2-823b-4feb-82eb-b0011d71cfec, gfid subvol = cafedbd2-823b-4feb-82eb-b0011d71babe

Expected results:
1. Not more than one among mkdirs issued on same path from multiple clients should succeed.
2. No directory should have different gfid on different subvols.

Additional info:

Comment 5 krishnaram Karthick 2016-05-19 13:24:23 UTC
Verified the bug in build - glusterfs-3.7.9-5

Following steps were used to verify the fix

1) create a pure distribute volume 
2) Fuse mount the volume on a client, say client-1
3) gdb into the process of mount and have a breakpoint on 'dht_mkdir'
4) create a directory from client-1
5) print the value of hashed_subvol->name. Note down the value
6) Add more bricks
7) Run fix layout
8) Fuse mount the volume on a new client, say client-2
9) gdb into the process of mount and have a breakpoint on 'dht_mkdir_hashed_cbk'
10) print the value of hashed_subvol->name. Ensure this value is not the same as the value in step 5
11) allow both process to continue
12) check if directory is created and gfid of the directory created on all the sub-vols are same

Comment 6 krishnaram Karthick 2016-05-19 15:15:49 UTC
Additionally, the test mentioned in steps to reproduce was also tried. The issue is no more seen. Marking the bug as verified.

Comment 8 Raghavendra G 2016-06-10 04:25:17 UTC
Laura,

Doc text is fine.

regards,
Raghavendra

Comment 10 errata-xmlrpc 2016-06-23 05:15:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240


Note You need to log in before you can comment on or make changes to this bug.