Bug 838784

Summary: DHT: readdirp goes into a infinite loop with ext4
Product: [Community] GlusterFS Reporter: shishir gowda <sgowda>
Component: distributeAssignee: shishir gowda <sgowda>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: pre-releaseCC: aavati, adechiaro, admin, bugs+redhat, christian.wittwer, fdewaley, filip.pytloun, glusterbugs, gluster-bugs, iwienand, jbyers, jdarcy, joe, johnmark, mailbox, ndevos, nsathyan, nux, olivier.lelain, pierre.francois, primeroznl, sia, social, vikumar, wenfeng.liu, wls
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:19:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 918917    
Attachments:
Description Flags
Test to find out if a system is affected none

Description shishir gowda 2012-07-10 05:50:00 UTC
Description of problem:
The problem is GlusterFS and upstream Ext4 (since d1f5273e9adb40724a85272f248f210dc4ce919a i.e
v3.3-rc2-2-gd1f5273) is broken. Ext4 now returns 64bit d_off values in
struct dirent, and when DHT applies dht_itransform() on those values,
for almost every file it overflows UINT64_MAX. This results in truncated
readdir cookies getting used for the following readdir (and ext4 seeks
back to offset 0 for unknown cookies). This ends up in an infinite
readdir loop.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Anthony DeChiaro 2012-08-08 15:04:17 UTC
I'm seeing this behavior on 2.6.32-279.el6.x86_64, default CentOS 6.3 install (CentOS release 6.3 (Final)).

Created a distributed-replicate volume then tried doing an add-brick which resulted in this behavior.  Details are available (including log files) here if needed: http://fpaste.org/ymq6/

A sample test program was nicely suggested by the guys in #gluster which tests this (http://pastie.org/4412258).  My results indicate this ext4 issue is definitely present on this kernel:

[root@njtest115 ~]# ./testext4 .
.viminfo: 336304820918345855 (336304820918345855)
testext4: 772176040014408130 (772176040014408130)
.: 1204328800623893744 (1204328800623893744)
.bash_history: 1486019236749960094 (1486019236749960094)

Whereas an XFS filesystem seems to indicate it works OK:

[root@njtest115 ~]# ./testext4 /srv/gluster/brick1/
.: 4 (4)
..: 6 (6)
.glusterfs: 9 (9)
tmp.tWzGg1lJCG: 13 (13)
tmp.3hQLDjVnHe: 17 (17)
tmp.G6MO4oTGxA: 21 (21)

Comment 2 shishir gowda 2012-08-08 15:12:10 UTC
*** Bug 846723 has been marked as a duplicate of this bug. ***

Comment 3 Louis Zuckerman 2012-08-08 15:13:34 UTC
Created attachment 603057 [details]
Test to find out if a system is affected

This program was written by Avati in response to the first report of this problem in #gluster IRC.

Compile & run it on a glusterfs server: ./a.out <brick-dir>

It will report "d_off" values for all items in <brick-dir> (not recursively.)

If those values are 32-bit, the system is not affected by this bug.  If those values are 64-bit, then the system is affected.

Please feel free to comment on this bug with your distribution release version & kernel version so that others can find out what systems are known to be affected.

Comment 4 Louis Zuckerman 2012-08-08 15:19:04 UTC
user NuxRo on #gluster IRC also reports having this problem on a CentOS system with a 2.6 kernel.

The reported OS/kernel version is "centos 6.3, 2.6.32-279.2.1.el6.x86_64"

Comment 5 Joe Julian 2012-08-15 06:40:46 UTC
This issue was back ported into the el6 kernel on:
* Tue Apr 24 2012 Jarod Wilson <jarod> [2.6.32-268.el6]
- [fs] ext4: return 32/64-bit dir name hash according to usage type (J. Bruce Fields) [813070]

So as a workaround, anyone with a newer kernel can roll back to 2.6.32-267.el6 until the fix is released.

Comment 6 Niels de Vos 2012-08-16 15:34:35 UTC
Patches have been posted and are currently pending review:
- http://review.gluster.com/3679
- http://review.gluster.com/3756

Comment 7 Christian Wittwer 2012-10-12 10:56:08 UTC
I have trouble to interpret the results correctly.

root@foo:~# ./test /data/brick0/instances/_base/
496e4306f1de9b647fc0a6021b35fd5428e88c70_10: 495888329 (495888329)
..: 616848399 (616848399)
.: 723397576 (723397576)
fc9ab12b1551f9c358b5acd65e8fcbe197c89f6c: 752979281 (752979281)
ephemeral_0_40_None: 995810121 (995810121)
496e4306f1de9b647fc0a6021b35fd5428e88c70: 2147483647 (2147483647)

Does this output mean, that our kernel already has that patch?
We use kernel 3.2.0-26-generic on Ubuntu 12.04.

Comment 8 Louis Zuckerman 2012-10-12 13:13:24 UTC
Christian,

That output means that your kernel does *not* have the change to ext4, so it is safe to use glusterfs with ext4 bricks.  However, XFS is generally recommended over ext4 for glusterfs, so you may still want to use XFS even though ext4 would work on your system.

Comment 9 JMW 2012-12-20 15:38:16 UTC
Any changes to this? What is the course of action we are taking? Do we just tell users not to use certain kernels? What do we recommend?

Comment 10 Jeff Darcy 2013-03-12 14:56:37 UTC
FWIW, I refreshed the patch and ran it on a Fedora 17 machine that does seem to have the relevant version of ext4.  Listing a 10K-file directory does seem to work using our native protocol, though the d_off values are different than what I see locally.  Using NFS we do still get into the infinite loop.

Comment 11 shishir gowda 2013-03-12 15:56:46 UTC
Hi Jeff,

The current fix is only for fuse clients.  We would save the overflow (from transformation) bits and the previous return offset in the fd_ctx. On a subsequent readdir call, with the previous offset, we would use those overflow bits to re-transform.

In NFS, we cannot use fd_ctx, as they use anonymous fd's. The plan was to pass on the overflow and offset in the cookie of readdir response. Our understanding was that the cookie would be returned in a subsequent readdir call, which would help us re-transformation. But, we notice nfs clients listing duplicate entries.

Comment 12 Jeff Darcy 2013-03-13 03:16:17 UTC
As far as I can tell, the NFS client doesn't return anything useful between calls except for the readdir cookie, which is no bigger than that used by the local FS.  There was some talk at one point of using the RPC verifier, but I really don't see how that is supposed to work.  The only approach that seems viable at this point is to have the client maintain an external mapping of inode plus user cookie to subvolume plus local-FS cookie.  The question then becomes how to populate such maps, and subsequently depopulate (i.e. garbage-collect) them when they're no longer likely to be as useful as the space they consume.  *sigh*  It's going to be yet another ugly but necessary bit of code festering in our source tree forever just because of other people's lameness.  Thanks, NFS and ext4 developers.

Comment 13 Jeff Darcy 2013-03-16 03:31:45 UTC
After lengthy discussion and lengthier coding, I've created a patch which uses the mapping approach described above.

http://review.gluster.org/#change,4675

Specifically, this uses a fixed-size cache with approximate LRU to map from GFID and user-visible d_off to a subvolume and local-FS d_off.  It seems to work around the problem in some basic tests, so it's a start, but a lot more work and discussion will be necessary before we'll be ready for prime time.

Comment 14 Niels de Vos 2013-03-18 16:03:29 UTC
On RHEL(-clones), ext3 and ext4 only recently started to use the whole 64-bits in the d_off in struct dirent. This caused several issues for NFS-clients that were not prepared to deal with that. Disabling the dir_index feature of the filesystem prevents ext3/4 from using 64-bits in d_off.

This workaround may help in this case too:

# tune2fs -O ^dir_index /dev/brick_device

Comment 15 Vijay Bellur 2013-03-25 14:28:27 UTC
REVIEW: http://review.gluster.org/4675 (dht: fix readdir traversal across subvols, even for ext4) posted (#3) for review on master by Jeff Darcy (jdarcy)

Comment 16 Vijay Bellur 2013-03-28 22:17:37 UTC
REVIEW: http://review.gluster.org/4711 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#4) for review on master by Anand Avati (avati)

Comment 17 Vijay Bellur 2013-03-29 03:18:50 UTC
REVIEW: http://review.gluster.org/4711 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#5) for review on master by Jeff Darcy (jdarcy)

Comment 18 Vijay Bellur 2013-04-01 00:16:00 UTC
REVIEW: http://review.gluster.org/4711 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#6) for review on master by Anand Avati (avati)

Comment 19 Vijay Bellur 2013-04-01 20:13:24 UTC
COMMIT: http://review.gluster.org/4711 committed in master by Anand Avati (avati) 
------
commit e0616e9314c8323dc59fca7cad6972f08d72b936
Author: Anand Avati <avati>
Date:   Sun Mar 17 03:32:44 2013 -0700

    dht: improve transform/detransform of d_off (and be ext4 safe)
    
    The scheme to encode brick d_off and brick id into global d_off has
    two approaches. Since both brick d_off and global d_off are both 64-bit
    wide, we need to be careful about how the brick id is encoded.
    
    Filesystems like XFS always give a d_off which fits within 32bits. So
    we have another 32bits (actually 31, in this scheme, as seen ahead) to
    encode the brick id - which is typically plenty.
    
    Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
    as the d_off is calculated based on a hash function value. This leaves
    us no "unused" bits to encode the brick id.
    
    However both these filesystmes (EXT4 more importantly) are "tolerant" in
    terms of the accuracy of the value presented back in seekdir(). i.e, a
    seekdir(val) actually seeks to the entry which has the "closest" true
    offset.
    
    This "two-prong" scheme exploits this behavior - which seems to be the
    best middle ground amongst various approaches and has all the advantages
    of the old approach:
    
    - Works against XFS and EXT4, the two most common filesystems out there.
      (which wasn't an "advantage" of the old approach as it is borken against
       EXT4)
    
    - Probably works against most of the others as well. The ones which would
      NOT work are those which return HUGE d_offs _and_ NOT tolerant to
      seekdir() to "closest" true offset.
    
    - Nothing to "remember in memory" or evict "old entries".
    
    - Works fine across NFS server reboots and also NFS head failover.
    
    - Tolerant to seekdir() to arbitrary locations.
    
    Algorithm:
    
    Each d_off can be encoded in either of the two schemes. There is no
    requirement to encode all d_offs of a directory or a reply-set in
    the same scheme.
    
    The topmost bit of the 64 bits is used to specify the "type" of encoding
    of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
    that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
    it indicates that the "small" d_off encoding scheme is used.
    
    The goal of the "small" d_off encoding is to stay as dense as possible
    towards the lower bits even in the global d_off.
    
    The goal of the HUGE d_off encoding is to stay as accurate (close) as
    possible to the "true" d_off after a round of encoding and decoding.
    
    If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
    ID (call it "n").
    
    SMALL d_off
    ===========
    
    Encoding
    --------
        If the top n + 1 bits are free in a brick offset, then we leave the
    top bit as 0 and set the remaining bits based on the old formula:
    
       hi_mask = 0xffffffffffffffff
    
       hi_mask = ~(hi_mask >> (n + 1))
    
       if ((hi_mask & d_off_brick) != 0)
           do_large_d_off_encoding ()
    
       d_off_global = (d_off_brick * N) + brick_id
    
    Decoding
    --------
        If the top bit in the global offset is 0, it indicates that this
    is the encoding formula used. So decoding such a global offset will
    be like the old formula:
    
       if ((d_off_global & 0x8000000000000000) != 0)
          do_large_d_off_decoding()
    
       d_off_brick = (d_off_global % N)
    
       brick_id = d_off_global / N
    
    HUGE d_off
    ==========
    
    Encoding
    --------
       If the top n + 1 bits are NOT free in a given brick offset, then we
    set the top bit as 1 in the global offset. The low n bits are replaced
    by brick_id.
    
        low_mask = 0xffffffffffffffff << n   // where n is ROOF(Log2(N))
    
        d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id
    
        if (d_off_global == 0xffffffffffffffff)
            discard_entry();
    
    Decoding
    --------
        If the top bit in the global offset is set 1, it indicates that
    the encoding formula used is above. So decoding would look like:
    
        hi_mask = (0xffffffffffffffff << n)
        low_mask = ~(hi_mask)
    
        d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)
    
        brick_id = global_d_off & low_mask
    
        If "losing" the low n bits in this decoding of d_off_brick looks
    "scary", we need to realize that till recently EXT4 used to only
    return what can now be expressed as (d_off_global >> 32). The extra
    31 bits of hash added by EXT recently, only decreases the probability
    of a collision, and not eliminate it completely, anyways. In a way,
    the "lost" n bits are made up by decreasing the probability of
    collision by sharding the files into N bricks / EXT directories
        -- call it "hash hedging", if you will :-)
    
    Thanks-to: Zach Brown <zab>
    Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855
    BUG: 838784
    Signed-off-by: Anand Avati <avati>
    Reviewed-on: http://review.gluster.org/4711
    Reviewed-by: Jeff Darcy <jdarcy>
    Tested-by: Gluster Build System <jenkins.com>

Comment 20 Anand Avati 2013-04-10 05:56:27 UTC
REVIEW: http://review.gluster.org/4799 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#1) for review on release-3.4 by Shishir Gowda (sgowda)

Comment 21 Anand Avati 2013-04-11 17:41:11 UTC
COMMIT: http://review.gluster.org/4799 committed in release-3.4 by Anand Avati (avati) 
------
commit 432ce7e60fb58d3cbb019ab3159b7d393afaaed6
Author: Niels de Vos <ndevos>
Date:   Wed Apr 10 17:51:37 2013 +0200

    build: really disable fusermount if you say so
    
    There is no logic in configure.ac that provides a $disable_fusermount
    variable. So, use the $enable_fusermount variable instead.
    
    Follow-up-for: http://review.gluster.org/4773
    Change-Id: I81cdbd0045409d0036438d542ca6dc1934f784e4
    BUG: 948205
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: http://review.gluster.org/4803
    Reviewed-by: Kaleb KEITHLEY <kkeithle>
    Reviewed-by: Sachidananda Urs <sacchi>
    Tested-by: Gluster Build System <jenkins.com>

Comment 22 Anand Avati 2013-04-12 07:38:08 UTC
The previous comment about an unrelated patch is because of a bug in the change-merged hook script in Gerrit. Fixed the hook script now.

Comment 23 Anand Avati 2013-04-14 06:32:43 UTC
REVIEW: http://review.gluster.org/4822 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#1) for review on release-3.3 by Vijay Bellur (vbellur)

Comment 24 Anand Avati 2013-04-16 17:36:26 UTC
COMMIT: http://review.gluster.org/4822 committed in release-3.3 by Vijay Bellur (vbellur) 
------
commit 490b791f44135db72cba1d8df9b40a66b457bff2
Author: shishir gowda <sgowda>
Date:   Wed Apr 10 10:42:25 2013 +0530

    dht: improve transform/detransform of d_off (and be ext4 safe)
    
    Backporting  Avati's fix http://review.gluster.org/4711
    
    The scheme to encode brick d_off and brick id into global d_off has
    two approaches. Since both brick d_off and global d_off are both 64-bit
    wide, we need to be careful about how the brick id is encoded.
    
    Filesystems like XFS always give a d_off which fits within 32bits. So
    we have another 32bits (actually 31, in this scheme, as seen ahead) to
    encode the brick id - which is typically plenty.
    
    Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off,
    as the d_off is calculated based on a hash function value. This leaves
    us no "unused" bits to encode the brick id.
    
    However both these filesystmes (EXT4 more importantly) are "tolerant" in
    terms of the accuracy of the value presented back in seekdir(). i.e, a
    seekdir(val) actually seeks to the entry which has the "closest" true
    offset.
    
    This "two-prong" scheme exploits this behavior - which seems to be the
    best middle ground amongst various approaches and has all the advantages
    of the old approach:
    
    - Works against XFS and EXT4, the two most common filesystems out there.
      (which wasn't an "advantage" of the old approach as it is borken against
       EXT4)
    
    - Probably works against most of the others as well. The ones which would
      NOT work are those which return HUGE d_offs _and_ NOT tolerant to
      seekdir() to "closest" true offset.
    
    - Nothing to "remember in memory" or evict "old entries".
    
    - Works fine across NFS server reboots and also NFS head failover.
    
    - Tolerant to seekdir() to arbitrary locations.
    
    Algorithm:
    
    Each d_off can be encoded in either of the two schemes. There is no
    requirement to encode all d_offs of a directory or a reply-set in
    the same scheme.
    
    The topmost bit of the 64 bits is used to specify the "type" of encoding
    of this particular d_off. If the topmost bit (bit-63) is 1, it indicates
    that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0,
    it indicates that the "small" d_off encoding scheme is used.
    
    The goal of the "small" d_off encoding is to stay as dense as possible
    towards the lower bits even in the global d_off.
    
    The goal of the HUGE d_off encoding is to stay as accurate (close) as
    possible to the "true" d_off after a round of encoding and decoding.
    
    If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick
    ID (call it "n").
    
    SMALL d_off
    ===========
    
    Encoding
    --------
        If the top n + 1 bits are free in a brick offset, then we leave the
    top bit as 0 and set the remaining bits based on the old formula:
    
       hi_mask = 0xffffffffffffffff
    
       hi_mask = ~(hi_mask >> (n + 1))
    
       if ((hi_mask & d_off_brick) != 0)
           do_large_d_off_encoding ()
    
       d_off_global = (d_off_brick * N) + brick_id
    
    Decoding
    --------
        If the top bit in the global offset is 0, it indicates that this
    is the encoding formula used. So decoding such a global offset will
    be like the old formula:
    
       if ((d_off_global & 0x8000000000000000) != 0)
          do_large_d_off_decoding()
    
       d_off_brick = (d_off_global % N)
    
       brick_id = d_off_global / N
    
    HUGE d_off
    ==========
    
    Encoding
    --------
       If the top n + 1 bits are NOT free in a given brick offset, then we
    set the top bit as 1 in the global offset. The low n bits are replaced
    by brick_id.
    
        low_mask = 0xffffffffffffffff << n   // where n is ROOF(Log2(N))
    
        d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id
    
        if (d_off_global == 0xffffffffffffffff)
            discard_entry();
    
    Decoding
    --------
        If the top bit in the global offset is set 1, it indicates that
    the encoding formula used is above. So decoding would look like:
    
        hi_mask = (0xffffffffffffffff << n)
        low_mask = ~(hi_mask)
    
        d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff)
    
        brick_id = global_d_off & low_mask
    
        If "losing" the low n bits in this decoding of d_off_brick looks
    "scary", we need to realize that till recently EXT4 used to only
    return what can now be expressed as (d_off_global >> 32). The extra
    31 bits of hash added by EXT recently, only decreases the probability
    of a collision, and not eliminate it completely, anyways. In a way,
    the "lost" n bits are made up by decreasing the probability of
    collision by sharding the files into N bricks / EXT directories
        -- call it "hash hedging", if you will :-)
    
    Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557
    Thanks-to: Zach Brown <zab>
    BUG: 838784
    Signed-off-by: shishir gowda <sgowda>
    Reviewed-on: http://review.gluster.org/4799
    Reviewed-by: Amar Tumballi <amarts>
    Reviewed-by: Jeff Darcy <jdarcy>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-on: http://review.gluster.org/4822

Comment 25 Ian Wienand 2013-04-25 23:59:59 UTC
*** Bug 952975 has been marked as a duplicate of this bug. ***

Comment 27 Jeff Byers 2015-02-03 22:08:20 UTC
I've been fighting a problem that seems to be related to this
problem in 3.6.2 for days now.

What I've been seeing is in a Volume Type="Distributed-
Replicate" with Number of Bricks="2 x 2 = 4" using two Gluster
nodes, find, 'ls', etc. file counts would give a different
number of files based upon which GlusterFS cluster node I was
accessing through. This occurred with both FUSE and NFS
mounts, and I would consistently see the same file with the
same stat data and inode appear multiple times via one path,
and only once via the other. The numbers returned were stable,
but they were different for each "path"!

I believe that this is also the cause problems seen when
deleting a directory tree from Windows Explorer where it
complained about being unable to delete files that were
already deleted, and/or the directory not being empty, and so
I would have to retry.

Of all of the things I've tried to work around this, the
*only* thing that has helped was, on all of the segments, to
disable the ext4 file-system option 'dir_index', and doing an
'e4fsck -D', then remounting, etc. With this, both the FUSE
and NFS mounts, connecting to either node now yield exactly
the same number of files.

My understanding though was that with this fix, it was not
necessary to disable 'dir_index', and that disabling ext4
'dir_index' was not a real fix, more of a statistical work-
around, and that it may have some performance impact on large
directories.

Is the GlusterFS fix discussed above effective in GlusterFS
3.6.2; should I be seeing problem with duplicate files and
missing files on directory reads?