Hide Forgot
Description of problem: The problem is GlusterFS and upstream Ext4 (since d1f5273e9adb40724a85272f248f210dc4ce919a i.e v3.3-rc2-2-gd1f5273) is broken. Ext4 now returns 64bit d_off values in struct dirent, and when DHT applies dht_itransform() on those values, for almost every file it overflows UINT64_MAX. This results in truncated readdir cookies getting used for the following readdir (and ext4 seeks back to offset 0 for unknown cookies). This ends up in an infinite readdir loop. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I'm seeing this behavior on 2.6.32-279.el6.x86_64, default CentOS 6.3 install (CentOS release 6.3 (Final)). Created a distributed-replicate volume then tried doing an add-brick which resulted in this behavior. Details are available (including log files) here if needed: http://fpaste.org/ymq6/ A sample test program was nicely suggested by the guys in #gluster which tests this (http://pastie.org/4412258). My results indicate this ext4 issue is definitely present on this kernel: [root@njtest115 ~]# ./testext4 . .viminfo: 336304820918345855 (336304820918345855) testext4: 772176040014408130 (772176040014408130) .: 1204328800623893744 (1204328800623893744) .bash_history: 1486019236749960094 (1486019236749960094) Whereas an XFS filesystem seems to indicate it works OK: [root@njtest115 ~]# ./testext4 /srv/gluster/brick1/ .: 4 (4) ..: 6 (6) .glusterfs: 9 (9) tmp.tWzGg1lJCG: 13 (13) tmp.3hQLDjVnHe: 17 (17) tmp.G6MO4oTGxA: 21 (21)
*** Bug 846723 has been marked as a duplicate of this bug. ***
Created attachment 603057 [details] Test to find out if a system is affected This program was written by Avati in response to the first report of this problem in #gluster IRC. Compile & run it on a glusterfs server: ./a.out <brick-dir> It will report "d_off" values for all items in <brick-dir> (not recursively.) If those values are 32-bit, the system is not affected by this bug. If those values are 64-bit, then the system is affected. Please feel free to comment on this bug with your distribution release version & kernel version so that others can find out what systems are known to be affected.
user NuxRo on #gluster IRC also reports having this problem on a CentOS system with a 2.6 kernel. The reported OS/kernel version is "centos 6.3, 2.6.32-279.2.1.el6.x86_64"
This issue was back ported into the el6 kernel on: * Tue Apr 24 2012 Jarod Wilson <jarod> [2.6.32-268.el6] - [fs] ext4: return 32/64-bit dir name hash according to usage type (J. Bruce Fields) [813070] So as a workaround, anyone with a newer kernel can roll back to 2.6.32-267.el6 until the fix is released.
Patches have been posted and are currently pending review: - http://review.gluster.com/3679 - http://review.gluster.com/3756
I have trouble to interpret the results correctly. root@foo:~# ./test /data/brick0/instances/_base/ 496e4306f1de9b647fc0a6021b35fd5428e88c70_10: 495888329 (495888329) ..: 616848399 (616848399) .: 723397576 (723397576) fc9ab12b1551f9c358b5acd65e8fcbe197c89f6c: 752979281 (752979281) ephemeral_0_40_None: 995810121 (995810121) 496e4306f1de9b647fc0a6021b35fd5428e88c70: 2147483647 (2147483647) Does this output mean, that our kernel already has that patch? We use kernel 3.2.0-26-generic on Ubuntu 12.04.
Christian, That output means that your kernel does *not* have the change to ext4, so it is safe to use glusterfs with ext4 bricks. However, XFS is generally recommended over ext4 for glusterfs, so you may still want to use XFS even though ext4 would work on your system.
Any changes to this? What is the course of action we are taking? Do we just tell users not to use certain kernels? What do we recommend?
FWIW, I refreshed the patch and ran it on a Fedora 17 machine that does seem to have the relevant version of ext4. Listing a 10K-file directory does seem to work using our native protocol, though the d_off values are different than what I see locally. Using NFS we do still get into the infinite loop.
Hi Jeff, The current fix is only for fuse clients. We would save the overflow (from transformation) bits and the previous return offset in the fd_ctx. On a subsequent readdir call, with the previous offset, we would use those overflow bits to re-transform. In NFS, we cannot use fd_ctx, as they use anonymous fd's. The plan was to pass on the overflow and offset in the cookie of readdir response. Our understanding was that the cookie would be returned in a subsequent readdir call, which would help us re-transformation. But, we notice nfs clients listing duplicate entries.
As far as I can tell, the NFS client doesn't return anything useful between calls except for the readdir cookie, which is no bigger than that used by the local FS. There was some talk at one point of using the RPC verifier, but I really don't see how that is supposed to work. The only approach that seems viable at this point is to have the client maintain an external mapping of inode plus user cookie to subvolume plus local-FS cookie. The question then becomes how to populate such maps, and subsequently depopulate (i.e. garbage-collect) them when they're no longer likely to be as useful as the space they consume. *sigh* It's going to be yet another ugly but necessary bit of code festering in our source tree forever just because of other people's lameness. Thanks, NFS and ext4 developers.
After lengthy discussion and lengthier coding, I've created a patch which uses the mapping approach described above. http://review.gluster.org/#change,4675 Specifically, this uses a fixed-size cache with approximate LRU to map from GFID and user-visible d_off to a subvolume and local-FS d_off. It seems to work around the problem in some basic tests, so it's a start, but a lot more work and discussion will be necessary before we'll be ready for prime time.
On RHEL(-clones), ext3 and ext4 only recently started to use the whole 64-bits in the d_off in struct dirent. This caused several issues for NFS-clients that were not prepared to deal with that. Disabling the dir_index feature of the filesystem prevents ext3/4 from using 64-bits in d_off. This workaround may help in this case too: # tune2fs -O ^dir_index /dev/brick_device
REVIEW: http://review.gluster.org/4675 (dht: fix readdir traversal across subvols, even for ext4) posted (#3) for review on master by Jeff Darcy (jdarcy)
REVIEW: http://review.gluster.org/4711 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#4) for review on master by Anand Avati (avati)
REVIEW: http://review.gluster.org/4711 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#5) for review on master by Jeff Darcy (jdarcy)
REVIEW: http://review.gluster.org/4711 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#6) for review on master by Anand Avati (avati)
COMMIT: http://review.gluster.org/4711 committed in master by Anand Avati (avati) ------ commit e0616e9314c8323dc59fca7cad6972f08d72b936 Author: Anand Avati <avati> Date: Sun Mar 17 03:32:44 2013 -0700 dht: improve transform/detransform of d_off (and be ext4 safe) The scheme to encode brick d_off and brick id into global d_off has two approaches. Since both brick d_off and global d_off are both 64-bit wide, we need to be careful about how the brick id is encoded. Filesystems like XFS always give a d_off which fits within 32bits. So we have another 32bits (actually 31, in this scheme, as seen ahead) to encode the brick id - which is typically plenty. Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off, as the d_off is calculated based on a hash function value. This leaves us no "unused" bits to encode the brick id. However both these filesystmes (EXT4 more importantly) are "tolerant" in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the "closest" true offset. This "two-prong" scheme exploits this behavior - which seems to be the best middle ground amongst various approaches and has all the advantages of the old approach: - Works against XFS and EXT4, the two most common filesystems out there. (which wasn't an "advantage" of the old approach as it is borken against EXT4) - Probably works against most of the others as well. The ones which would NOT work are those which return HUGE d_offs _and_ NOT tolerant to seekdir() to "closest" true offset. - Nothing to "remember in memory" or evict "old entries". - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. Algorithm: Each d_off can be encoded in either of the two schemes. There is no requirement to encode all d_offs of a directory or a reply-set in the same scheme. The topmost bit of the 64 bits is used to specify the "type" of encoding of this particular d_off. If the topmost bit (bit-63) is 1, it indicates that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0, it indicates that the "small" d_off encoding scheme is used. The goal of the "small" d_off encoding is to stay as dense as possible towards the lower bits even in the global d_off. The goal of the HUGE d_off encoding is to stay as accurate (close) as possible to the "true" d_off after a round of encoding and decoding. If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick ID (call it "n"). SMALL d_off =========== Encoding -------- If the top n + 1 bits are free in a brick offset, then we leave the top bit as 0 and set the remaining bits based on the old formula: hi_mask = 0xffffffffffffffff hi_mask = ~(hi_mask >> (n + 1)) if ((hi_mask & d_off_brick) != 0) do_large_d_off_encoding () d_off_global = (d_off_brick * N) + brick_id Decoding -------- If the top bit in the global offset is 0, it indicates that this is the encoding formula used. So decoding such a global offset will be like the old formula: if ((d_off_global & 0x8000000000000000) != 0) do_large_d_off_decoding() d_off_brick = (d_off_global % N) brick_id = d_off_global / N HUGE d_off ========== Encoding -------- If the top n + 1 bits are NOT free in a given brick offset, then we set the top bit as 1 in the global offset. The low n bits are replaced by brick_id. low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N)) d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id if (d_off_global == 0xffffffffffffffff) discard_entry(); Decoding -------- If the top bit in the global offset is set 1, it indicates that the encoding formula used is above. So decoding would look like: hi_mask = (0xffffffffffffffff << n) low_mask = ~(hi_mask) d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff) brick_id = global_d_off & low_mask If "losing" the low n bits in this decoding of d_off_brick looks "scary", we need to realize that till recently EXT4 used to only return what can now be expressed as (d_off_global >> 32). The extra 31 bits of hash added by EXT recently, only decreases the probability of a collision, and not eliminate it completely, anyways. In a way, the "lost" n bits are made up by decreasing the probability of collision by sharding the files into N bricks / EXT directories -- call it "hash hedging", if you will :-) Thanks-to: Zach Brown <zab> Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855 BUG: 838784 Signed-off-by: Anand Avati <avati> Reviewed-on: http://review.gluster.org/4711 Reviewed-by: Jeff Darcy <jdarcy> Tested-by: Gluster Build System <jenkins.com>
REVIEW: http://review.gluster.org/4799 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#1) for review on release-3.4 by Shishir Gowda (sgowda)
COMMIT: http://review.gluster.org/4799 committed in release-3.4 by Anand Avati (avati) ------ commit 432ce7e60fb58d3cbb019ab3159b7d393afaaed6 Author: Niels de Vos <ndevos> Date: Wed Apr 10 17:51:37 2013 +0200 build: really disable fusermount if you say so There is no logic in configure.ac that provides a $disable_fusermount variable. So, use the $enable_fusermount variable instead. Follow-up-for: http://review.gluster.org/4773 Change-Id: I81cdbd0045409d0036438d542ca6dc1934f784e4 BUG: 948205 Signed-off-by: Niels de Vos <ndevos> Reviewed-on: http://review.gluster.org/4803 Reviewed-by: Kaleb KEITHLEY <kkeithle> Reviewed-by: Sachidananda Urs <sacchi> Tested-by: Gluster Build System <jenkins.com>
The previous comment about an unrelated patch is because of a bug in the change-merged hook script in Gerrit. Fixed the hook script now.
REVIEW: http://review.gluster.org/4822 (dht: improve transform/detransform of d_off (and be ext4 safe)) posted (#1) for review on release-3.3 by Vijay Bellur (vbellur)
COMMIT: http://review.gluster.org/4822 committed in release-3.3 by Vijay Bellur (vbellur) ------ commit 490b791f44135db72cba1d8df9b40a66b457bff2 Author: shishir gowda <sgowda> Date: Wed Apr 10 10:42:25 2013 +0530 dht: improve transform/detransform of d_off (and be ext4 safe) Backporting Avati's fix http://review.gluster.org/4711 The scheme to encode brick d_off and brick id into global d_off has two approaches. Since both brick d_off and global d_off are both 64-bit wide, we need to be careful about how the brick id is encoded. Filesystems like XFS always give a d_off which fits within 32bits. So we have another 32bits (actually 31, in this scheme, as seen ahead) to encode the brick id - which is typically plenty. Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off, as the d_off is calculated based on a hash function value. This leaves us no "unused" bits to encode the brick id. However both these filesystmes (EXT4 more importantly) are "tolerant" in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the "closest" true offset. This "two-prong" scheme exploits this behavior - which seems to be the best middle ground amongst various approaches and has all the advantages of the old approach: - Works against XFS and EXT4, the two most common filesystems out there. (which wasn't an "advantage" of the old approach as it is borken against EXT4) - Probably works against most of the others as well. The ones which would NOT work are those which return HUGE d_offs _and_ NOT tolerant to seekdir() to "closest" true offset. - Nothing to "remember in memory" or evict "old entries". - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. Algorithm: Each d_off can be encoded in either of the two schemes. There is no requirement to encode all d_offs of a directory or a reply-set in the same scheme. The topmost bit of the 64 bits is used to specify the "type" of encoding of this particular d_off. If the topmost bit (bit-63) is 1, it indicates that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0, it indicates that the "small" d_off encoding scheme is used. The goal of the "small" d_off encoding is to stay as dense as possible towards the lower bits even in the global d_off. The goal of the HUGE d_off encoding is to stay as accurate (close) as possible to the "true" d_off after a round of encoding and decoding. If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick ID (call it "n"). SMALL d_off =========== Encoding -------- If the top n + 1 bits are free in a brick offset, then we leave the top bit as 0 and set the remaining bits based on the old formula: hi_mask = 0xffffffffffffffff hi_mask = ~(hi_mask >> (n + 1)) if ((hi_mask & d_off_brick) != 0) do_large_d_off_encoding () d_off_global = (d_off_brick * N) + brick_id Decoding -------- If the top bit in the global offset is 0, it indicates that this is the encoding formula used. So decoding such a global offset will be like the old formula: if ((d_off_global & 0x8000000000000000) != 0) do_large_d_off_decoding() d_off_brick = (d_off_global % N) brick_id = d_off_global / N HUGE d_off ========== Encoding -------- If the top n + 1 bits are NOT free in a given brick offset, then we set the top bit as 1 in the global offset. The low n bits are replaced by brick_id. low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N)) d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id if (d_off_global == 0xffffffffffffffff) discard_entry(); Decoding -------- If the top bit in the global offset is set 1, it indicates that the encoding formula used is above. So decoding would look like: hi_mask = (0xffffffffffffffff << n) low_mask = ~(hi_mask) d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff) brick_id = global_d_off & low_mask If "losing" the low n bits in this decoding of d_off_brick looks "scary", we need to realize that till recently EXT4 used to only return what can now be expressed as (d_off_global >> 32). The extra 31 bits of hash added by EXT recently, only decreases the probability of a collision, and not eliminate it completely, anyways. In a way, the "lost" n bits are made up by decreasing the probability of collision by sharding the files into N bricks / EXT directories -- call it "hash hedging", if you will :-) Change-Id: I9551c581c3f3d4c9e719764881036d554f60c557 Thanks-to: Zach Brown <zab> BUG: 838784 Signed-off-by: shishir gowda <sgowda> Reviewed-on: http://review.gluster.org/4799 Reviewed-by: Amar Tumballi <amarts> Reviewed-by: Jeff Darcy <jdarcy> Tested-by: Gluster Build System <jenkins.com> Reviewed-on: http://review.gluster.org/4822
*** Bug 952975 has been marked as a duplicate of this bug. ***
I've been fighting a problem that seems to be related to this problem in 3.6.2 for days now. What I've been seeing is in a Volume Type="Distributed- Replicate" with Number of Bricks="2 x 2 = 4" using two Gluster nodes, find, 'ls', etc. file counts would give a different number of files based upon which GlusterFS cluster node I was accessing through. This occurred with both FUSE and NFS mounts, and I would consistently see the same file with the same stat data and inode appear multiple times via one path, and only once via the other. The numbers returned were stable, but they were different for each "path"! I believe that this is also the cause problems seen when deleting a directory tree from Windows Explorer where it complained about being unable to delete files that were already deleted, and/or the directory not being empty, and so I would have to retry. Of all of the things I've tried to work around this, the *only* thing that has helped was, on all of the segments, to disable the ext4 file-system option 'dir_index', and doing an 'e4fsck -D', then remounting, etc. With this, both the FUSE and NFS mounts, connecting to either node now yield exactly the same number of files. My understanding though was that with this fix, it was not necessary to disable 'dir_index', and that disabling ext4 'dir_index' was not a real fix, more of a statistical work- around, and that it may have some performance impact on large directories. Is the GlusterFS fix discussed above effective in GlusterFS 3.6.2; should I be seeing problem with duplicate files and missing files on directory reads?