Description of problem: When an process attempts to access certain files or directories through GlusterFS native FUSE client, that process hangs: the process can not be interrupted, and killing it with signal 9 (SIGKILL) may leave it in <defuct> state, and make it a direct child of the 'init' process. This "zombie" process then keeps the target file or directory still open, and because of this, corresponding mount point can then not be unmounted, because it is "busy"; the only method to restore everything back to normal is to reboot the client. The hanging happens for example (and specifically) when running a 'find' command over the entire GlusterFS file system in order to trigger canonical self-healing on a troublesome volume: the 'find' command hangs, and if a new 'find' command is issued again, it too hangs in the exact same spot. This would indicate that this is not some random timing or deadlock problem, but there is really something "special" in some target files and directories that causes the problem. If all copies of the the troublesome file or directory are manually removed from every GlusterFS brick, the 'find' command will then advance past that point, only to hang at some later time at some other troublesome file or directory. 'ps axl' status report of a hanged 'find' process iterating through GlusterFS volume '/mnt/volume' process looks like this (with unrelevant processes removed from the list): F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 3771 1 20 0 120272 9180 wait_a S ? 0:18 find /mnt/volume Corresponding 'lsof -p 3771' output is (with the actual hanging directory path replaced with a/b/c/d/e for clarity): COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME find 3771 root cwd DIR 0,18 98304 18446744065793965768 /mnt/volume/a/b/c/d/e find 3771 root rtd DIR 9,0 4096 2 / find 3771 root txt REG 9,0 234512 58195979 /bin/find find 3771 root mem REG 9,0 99158752 8652401 /usr/lib/locale/locale-archive find 3771 root mem REG 9,0 19536 39321619 /lib64/libdl-2.12.so find 3771 root mem REG 9,0 142424 39321637 /lib64/libpthread-2.12.so find 3771 root mem REG 9,0 1832712 39321613 /lib64/libc-2.12.so find 3771 root mem REG 9,0 122008 39321674 /lib64/libselinux.so.1 find 3771 root mem REG 9,0 595816 39321621 /lib64/libm-2.12.so find 3771 root mem REG 9,0 43840 39321641 /lib64/librt-2.12.so find 3771 root mem REG 9,0 148504 39321922 /lib64/ld-2.12.so find 3771 root 0u CHR 136,1 0t0 4 /dev/pts/1 (deleted) find 3771 root 1w REG 9,0 30834688 46661667 /root/find-2012030403.log find 3771 root 2u CHR 136,1 0t0 4 /dev/pts/1 (deleted) find 3771 root 3r DIR 9,0 4096 46661633 /root find 3771 root 4r DIR 9,0 4096 46661633 /root Access to the the hanged directory parent directory works normally, but any attempt to access any files or directories inside the troublesome directory hangs as well. For example, the following 'ls' commands generate the following output: [root@hostname ~]# ls /mnt/volume/a/b/c/d e (terminates normally, with expected output) [root@hostname ~]# ls /mnt/volume/a/b/c/d/e (hangs, no output, can be killed with 'kill -9') [root@hostname ~]# ls /mnt/volume/a/b/c/d/e/f (hangs, no output, can not be killed with 'kill -9') The corresponding 'ps axl' output for the hanged processes look like this: 26292 pts/3 D+ 0:00 ls /mnt/volume/a/b/c/d/e/f 26524 pts/4 D+ 0:00 ls /mnt/volume/a/b/c/d/e After sending signal 9 (SIGKILL) to both processes: kill -9 26292 26524 the command that was accessing 'e' dies normally, but the one that was accessing 'f' does not. The corresponding 'ps axl' output is: 26292 pts/3 D+ 0:00 ls /mnt/volume/a/b/c/d/e/f There are also a number of warnings and errors in corresponding GlusterFS volume logs in the client that was accessing the volume: /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.336764] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-3: path /a/b/c/d/e on subvolume volume-client-6 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.336961] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-1: path /a/b/c/d/e on subvolume volume-client-2 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.337715] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-17: path /a/b/c/d/e on subvolume volume-client-34 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.339047] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-20: path /a/b/c/d/e on subvolume volume-client-40 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.342469] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-0: path /a/b/c/d/e on subvolume volume-client-0 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.343970] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-14: path /a/b/c/d/e on subvolume volume-client-28 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.346431] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-2: path /a/b/c/d/e on subvolume volume-client-4 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.356663] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-23: path /a/b/c/d/e on subvolume volume-client-46 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.373125] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-11: path /a/b/c/d/e on subvolume volume-client-22 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.399428] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-16: path /a/b/c/d/e on subvolume volume-client-33 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.399532] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-19: path /a/b/c/d/e on subvolume volume-client-39 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.399753] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-22: path /a/b/c/d/e on subvolume volume-client-45 => -1 (No such file or directory) /var/log/glusterfs/mnt-volume.log:[2012-03-04 22:50:52.407050] E [afr-self-heal-common.c:1054:afr_sh_common_lookup_resp_handler] 7-volume-replicate-13: path /a/b/c/d/e on subvolume volume-client-27 => -1 (No such file or directory) Here is GlusterFS volume info (with confidential site/volume/brick names altered and most of the bricks removed): [root@hostname ~]# gluster volume info volume Volume Name: volume Type: Distributed-Replicate Status: Started Number of Bricks: 24 x 2 = 48 Transport-type: tcp Bricks: Brick1: r1s8.cluster.site.com:/mnt/data2/gfs/volume Brick2: r1s9.cluster.site.com:/mnt/data2/gfs/volume : Brick47: r1s4.cluster.site.com:/mnt/data9/gfs/volume Brick48: r1s5.cluster.site.com:/mnt/data9/gfs/volume Options Reconfigured: diagnostics.client-log-level: WARNING diagnostics.brick-log-level: WARNING performance.quick-read: off Please note that 'performance.quick-read' has been turned off in an attempt to avoid hanging issues described in: https://bugzilla.redhat.com/show_bug.cgi?id=764743 When we inspect the troublesome files through the the brick native file systems, we do not see anything different in them and nearby files relative to directories and files that work correctly. Here are 'stat' output from two of the bricks (the directory is present in every 48 bricks, which we assume is normal): File: `/mnt/data8/gfs/volume/a/b/c/d/e' Size: 4096 Blocks: 8 IO Block: 4096 directory Device: 821h/2081d Inode: 63833568 Links: 8 Access: (0775/drwxrwxr-x) Uid: ( 91/ tomcat) Gid: ( 91/ tomcat) Access: 2012-03-05 10:28:32.516782891 +0200 Modify: 2012-03-04 22:50:53.618780064 +0200 Change: 2012-03-04 22:50:53.618780064 +0200 File: `/mnt/data9/gfs/volume/a/b/c/d/e' Size: 4096 Blocks: 16 IO Block: 4096 directory Device: 831h/2097d Inode: 16778400 Links: 8 Access: (0775/drwxrwxr-x) Uid: ( 91/ tomcat) Gid: ( 91/ tomcat) Access: 2012-03-05 10:28:32.516782891 +0200 Modify: 2012-03-04 22:50:52.000000000 +0200 Change: 2012-03-04 23:53:51.838991487 +0200 This particular case described here involves a troublesome directory, but we have experienced exactly similar situations with individual troublesome files, too. We are running the GlusterFS volume in servers that run 'CentOS release 6.2' operating system, with kernel '2.6.32-220.4.2.el6.x86_64', 'Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz' CPE, and 16Gb of memory. The GlusterFS versio is 3.2.5, release 2.el6, for architecture 'x86_64'. GlusterFS has been installed from RedHat RPM:s. Corresponding 'glusterfs-core-3.2.5-2.el6.x86_64' RPM info header is: Name : glusterfs-core Relocations: (not relocatable) Version : 3.2.5 Vendor: Red Hat, Inc. Release : 2.el6 Build Date: Tue 15 Nov 2011 03:43:32 PM EET Install Date: Fri 27 Jan 2012 05:13:50 PM EET Build Host: x86-004.build.bos.redhat.com Group : System Environment/Libraries Source RPM: glusterfs-3.2.5-2.el6.src.rpm Size : 7146188 License: GPLv3+ Signature : (none) Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla> URL : http://www.gluster.org/docs/index.php/GlusterFS Before issuing this bug report we carefully researched earlier bug reports, and found at least the following other reports that seem describe similar situations that what we are now facing: http://gluster.org/pipermail/gluster-users/2011-May/007580.html https://bugzilla.redhat.com/show_bug.cgi?id=764743 How reproducible: The problem occurs every time when we try to 'self-heal' (or 'rebalance') on our volume that was damaged earlier in a massive hardware failure. The operation always hangs when it finds the first "troublesome" file or directory. Steps to Reproduce: We do not know of course how exactly the hardware failure the broke the GlusterFS volume did what it did, or how to reproduce similar damage. We are however willing to assist you in any way we can by running any tests and provide any diagnostic information you wish in our damaged GlusterFS volume. Actual results: 'find' and any other similar commands that try to access specific file or directory in GlusterFS hangs, and the corresponding operation and anything depending on it then fails. Expected results: Commands should not hang, but simply access and read the file or directory and any related meta-data. Additional info: These problems appeared after the corresponding cluster suffered a massive hardware failure, which broke several disks, and also introduced a 'split-brain' scenario. Faulty hardware was replaced, and the GFID problems caused by the 'split-brain' situation were fixed with the manual procedure described in: http://gluster.org/pipermail/gluster-users/2011-July/008215.html https://github.com/vikasgorur/gfid After these issues were addressed, all attempts to self-heal the GlusterFS volume have failed because of corresponding 'find' command hangs. All other alternative commands like 'ls -R' or 'tree' that iterate the volume directory tree hang too. It is probably worth mentioning that GlusterFS rebalance operation fails too: in 'fix-layout' phase the operation starts normally, and the corresponding counter as reported by rebalance 'status' command grows first steadily for a while, but then the index value stops growing, and rebalancing never becomes complete. There are currently about 300.000 directories and 1.000.000 files in the affected volume, and about 0.02% of the files and directories seem to hang any commands that try to access them. The rest of files and directories work normally. All access is done with GlusterFS native FUSE client: NFS is not involved in any way. The affected GlusterFS volume was constructed recently, using the latest stable GlusterFS 3.2.5 version. In other words, the volume has never been upgraded from an older version. The volume currently uses only 1% of the maximum capacity: we could recover from the current situation by simply copying all data away from the bricks, clearing the whole volume, and then copying the data back. However, we are unwilling to do this, because we are also evaluating GlusterFS, and we need to be sure that GlusterFS volumes can be properly recovered without 'start-everything-from-the-beginning' approach.
Thanks for the detailed report on the issue. The possible scenarios why this happens is possibly because of GFID mismatches. We have seen a similar issue in our testing too, and debugging it is in progress. Will update you soon on the situation. -Amar
Addressing this post 3.3.0.
Rami Hänninen, I realize this is a very late request but I was just wondering if you could provide the 'getfattr -d -m . -e hex ', 'stat' of both the file, parent-dir-of-file on the bricks the file is present. We know of hangs when the files have missing xattrs on the backends https://bugzilla.redhat.com/show_bug.cgi?id=798874, https://bugzilla.redhat.com/show_bug.cgi?id=765587 I would like to verify if this bug is related to those. Pranith.
Closing as requested information has not been provided.