GlusterFS replica mount hangs forever on "toxic" file healing. During lab fail-over testing, where one node of a two node cluster with 2-way replica volumes is failed during client I/O, a situation occured where a number of files were left in a state where any attempt to access them (either for healing, or client access) hangs the client process forever, until the glusterfs mount was forcably umounted. In this case, the replica healing process was hung, but regular clients also experience the hang too: SC-99 # ps -elf |grep find 0 S root 15726 15725 0 90 9 - 2622 wait_a 17:06 ? 00:00:00 /usr/bin/find /nas_healing/failover -ignore_readdir_race -noleaf -print0 The glustefs process on both nodes has the file 'shelbyWRC2791.rvf' opened for a long time, but I am not sure if that was the last one that had worked, or if it is the one that hung: SC-99 # lsof | grep gluster glusterfs 6221 root 12u REG 8,144 42962 141 /exports/nas-segment-0003/failover/SDS4-14/shelbyWRC2791.rvf SC-99 # ps -elf | grep 6221 5 S root 6221 1 0 84 9 - 38771 - 16:57 ? 00:00:00 glusterfsd --xlator-option failover-server.listen-port=24015 -s localhost --volfile-id failover.192.168.5.99.exports-nas-segment-0003-failover -p /etc/glusterd/vols/failover/run/192.168.5.99-exports-nas-segment-0003-failover.pid -S /tmp/99ee99735e05a4371c71ab1c9884f10b.socket --brick-name /exports/nas-segment-0003/failover --brick-port 24015 -l /usr/local/var/log/glusterfs/bricks/exports-nas-segment-0003-failover.log SC-100# lsof | grep gluster glusterfs 32033 root 12u REG 8,128 22941 93 /exports/nas-segment-0004/failover/SDS4-14/shelbyWRC2791.rvf SC-100# ps -elf | grep 32033 5 S root 32033 1 0 84 9 - 37219 - 16:57 ? 00:00:00 glusterfsd --xlator-option failover-server.listen-port=24012 -s localhost --volfile-id failover.192.168.5.100.exports-nas-segment-0004-failover -p /etc/glusterd/vols/failover/run/192.168.5.100-exports-nas-segment-0004-failover.pid -S /tmp/f9808a6b9d52ab7ef4290dfa73eb355e.socket --brick-name /exports/nas-segment-0004/failover --brick-port 24012 -l /usr/local/var/log/glusterfs/bricks/exports-nas-segment-0004-failover.log The 'shelbyWRC2791.rvf' trusted.afr attributes are all zeros on both nodes, so I also included details for what looks to be the next file to be processed '/SDS4WRC2752.rvf' which has non-zero trusted.afr attributes below. This problem is very bad, because unlike 'split-brain' file situations where only a single file cannot be accessed until manual recovery actions, any client that trips on one of the problematic files that need healing gets hung forever. This issue seems similar to: Bug 799856 - Native client hangs when accessing certain files or directories https://bugzilla.redhat.com/show_bug.cgi?id=799856 but all of the gfids I've checked are identicle for the files. I have tared one of the problematic directories from each brick, and saved the extended attributes and reproduce the problem with these files on a different system. If it would be helpful, I could upload these files. Unfortunately, I cannot know exactly what happened during the failure, other than the clients were writing files while one of the replica nodes went down, was down for a while, and was then brought back up and healing started. Details follow. ******************************************************************************* Logfile: /usr/local/var/log/glusterfs/nas_healing-failover.log Given volfile: +------------------------------------------------------------------------------+ 1: volume failover-client-0 2: type protocol/client 3: option remote-host 192.168.5.99 4: option remote-subvolume /exports/nas-segment-0003/failover 5: option transport-type tcp 6: end-volume 7: 8: volume failover-client-1 9: type protocol/client 10: option remote-host 192.168.5.100 11: option remote-subvolume /exports/nas-segment-0004/failover 12: option transport-type tcp 13: end-volume 14: 15: volume failover-replicate-0 16: type cluster/replicate 17: subvolumes failover-client-0 failover-client-1 18: end-volume 19: 20: volume failover-io-cache 21: type performance/io-cache 22: subvolumes failover-replicate-0 23: end-volume 24: 25: volume failover-quick-read 26: type performance/quick-read 27: subvolumes failover-io-cache 28: end-volume 29: 30: volume failover 31: type debug/io-stats 32: option latency-measurement off 33: option count-fop-hits off 34: subvolumes failover-quick-read 35: end-volume +------------------------------------------------------------------------------+ [2012-04-04 17:06:03.385704] I [rpc-clnt.c:1536:rpc_clnt_reconfig] 0-failover-client-0: changing port to 24015 (from 0) [2012-04-04 17:06:03.385873] I [rpc-clnt.c:1536:rpc_clnt_reconfig] 0-failover-client-1: changing port to 24012 (from 0) [2012-04-04 17:06:07.389913] I [client-handshake.c:1090:select_server_supported_programs] 0-failover-client-0: Using Program GlusterFS 3.2.5, Num (1298437), Version (310) [2012-04-04 17:06:07.394519] I [client-handshake.c:913:client_setvolume_cbk] 0-failover-client-0: Connected to 192.168.5.99:24015, attached to remote volume '/exports/nas-segment-0003/failover'. [2012-04-04 17:06:07.394555] I [afr-common.c:3141:afr_notify] 0-failover-replicate-0: Subvolume 'failover-client-0' came back up; going online. [2012-04-04 17:06:07.396291] I [client-handshake.c:1090:select_server_supported_programs] 0-failover-client-1: Using Program GlusterFS 3.2.5, Num (1298437), Version (310) [2012-04-04 17:06:07.400967] I [client-handshake.c:913:client_setvolume_cbk] 0-failover-client-1: Connected to 192.168.5.100:24012, attached to remote volume '/exports/nas-segment-0004/failover'. [2012-04-04 17:06:07.408633] I [fuse-bridge.c:3339:fuse_graph_setup] 0-fuse: switched to graph 0 [2012-04-04 17:06:07.408870] I [fuse-bridge.c:2927:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.10 [2012-04-04 17:06:07.409626] I [afr-common.c:1520:afr_set_root_inode_on_first_lookup] 0-failover-replicate-0: added root inode [2012-04-04 17:06:07.646182] I [afr-common.c:861:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: permissions differ for /SDS4-14/gt500WRC2797.rvf [2012-04-04 17:06:07.646299] I [afr-common.c:1038:afr_launch_self_heal] 0-failover-replicate-0: background meta-data self-heal triggered. path: /SDS4-14/gt500WRC2797.rvf [2012-04-04 17:06:07.650518] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-failover-replicate-0: background meta-data self-heal completed on /SDS4-14/gt500WRC2797.rvf [2012-04-04 17:06:07.660088] I [afr-common.c:861:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: permissions differ for /SDS4-14/SDS4WRC2757.rvf [2012-04-04 17:06:07.660172] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: size differs for /SDS4-14/SDS4WRC2757.rvf [2012-04-04 17:06:07.660206] I [afr-common.c:1038:afr_launch_self_heal] 0-failover-replicate-0: background meta-data data self-heal triggered. path: /SDS4-14/SDS4WRC2757.rvf [2012-04-04 17:06:07.667600] I [afr-common.c:861:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: permissions differ for /SDS4-14/SDS4WRC2786.rvf [2012-04-04 17:06:07.667687] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: size differs for /SDS4-14/SDS4WRC2786.rvf [2012-04-04 17:06:07.667726] I [afr-common.c:1038:afr_launch_self_heal] 0-failover-replicate-0: background meta-data data self-heal triggered. path: /SDS4-14/SDS4WRC2786.rvf [2012-04-04 17:06:07.669339] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-failover-replicate-0: background meta-data data self-heal completed on /SDS4-14/SDS4WRC2757.rvf [2012-04-04 17:06:07.673447] I [afr-common.c:861:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: permissions differ for /SDS4-14/gt500WRC2767.rvf [2012-04-04 17:06:07.673516] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: size differs for /SDS4-14/gt500WRC2767.rvf [2012-04-04 17:06:07.673549] I [afr-common.c:1038:afr_launch_self_heal] 0-failover-replicate-0: background meta-data data self-heal triggered. path: /SDS4-14/gt500WRC2767.rvf [2012-04-04 17:06:07.675117] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-failover-replicate-0: background meta-data data self-heal completed on /SDS4-14/SDS4WRC2786.rvf [2012-04-04 17:06:07.681892] I [afr-common.c:861:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: permissions differ for /SDS4-14/SDS4WRC2760.rvf [2012-04-04 17:06:07.681969] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: size differs for /SDS4-14/SDS4WRC2760.rvf [2012-04-04 17:06:07.682008] I [afr-common.c:1038:afr_launch_self_heal] 0-failover-replicate-0: background meta-data data self-heal triggered. path: /SDS4-14/SDS4WRC2760.rvf [2012-04-04 17:06:07.683014] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-failover-replicate-0: background meta-data data self-heal completed on /SDS4-14/gt500WRC2767.rvf [2012-04-04 17:06:07.689701] I [afr-common.c:861:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: permissions differ for /SDS4-14/vmw2k3r2x64WRC2772.rvf [2012-04-04 17:06:07.689783] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: size differs for /SDS4-14/vmw2k3r2x64WRC2772.rvf [2012-04-04 17:06:07.689817] I [afr-common.c:1038:afr_launch_self_heal] 0-failover-replicate-0: background meta-data data self-heal triggered. path: /SDS4-14/vmw2k3r2x64WRC2772.rvf [2012-04-04 17:06:07.692408] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-failover-replicate-0: background meta-data data self-heal completed on /SDS4-14/SDS4WRC2760.rvf [2012-04-04 17:06:07.697664] I [afr-common.c:861:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: permissions differ for /SDS4-14/SDS4WRC2771.rvf [2012-04-04 17:06:07.697747] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: size differs for /SDS4-14/SDS4WRC2771.rvf [2012-04-04 17:06:07.697782] I [afr-common.c:1038:afr_launch_self_heal] 0-failover-replicate-0: background meta-data data self-heal triggered. path: /SDS4-14/SDS4WRC2771.rvf [2012-04-04 17:06:07.699238] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-failover-replicate-0: background meta-data data self-heal completed on /SDS4-14/vmw2k3r2x64WRC2772.rvf [2012-04-04 17:06:07.704341] I [afr-common.c:861:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: permissions differ for /SDS4-14/vmw2k3r2x64WRC2792.rvf [2012-04-04 17:06:07.704432] I [afr-common.c:1038:afr_launch_self_heal] 0-failover-replicate-0: background meta-data self-heal triggered. path: /SDS4-14/vmw2k3r2x64WRC2792.rvf [2012-04-04 17:06:07.706531] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-failover-replicate-0: background meta-data data self-heal completed on /SDS4-14/SDS4WRC2771.rvf [2012-04-04 17:06:07.708105] I [afr-self-heal-common.c:2077:afr_self_heal_completion_cbk] 0-failover-replicate-0: background meta-data self-heal completed on /SDS4-14/vmw2k3r2x64WRC2792.rvf [2012-04-04 17:06:07.709542] I [afr-common.c:861:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: permissions differ for /SDS4-14/shelbyWRC2791.rvf [2012-04-04 17:06:07.709644] I [afr-common.c:875:afr_detect_self_heal_by_iatt] 0-failover-replicate-0: size differs for /SDS4-14/shelbyWRC2791.rvf [2012-04-04 17:06:07.709696] I [afr-common.c:1038:afr_launch_self_heal] 0-failover-replicate-0: background meta-data data self-heal triggered. path: /SDS4-14/shelbyWRC2791.rvf ******************************************************************************* Logfile: /usr/local/var/log/glusterfs/bricks/exports-nas-segment-0003-failover.log Given volfile: +------------------------------------------------------------------------------+ 1: volume failover-posix 2: type storage/posix 3: option directory /exports/nas-segment-0003/failover 4: end-volume 5: 6: volume failover-access-control 7: type features/access-control 8: subvolumes failover-posix 9: end-volume 10: 11: volume failover-locks 12: type features/locks 13: subvolumes failover-access-control 14: end-volume 15: 16: volume failover-io-threads 17: type performance/io-threads 18: subvolumes failover-locks 19: end-volume 20: 21: volume failover-marker 22: type features/marker 23: option volume-uuid 1a21ec35-9077-4587-84ad-7bbd5fe22501 24: option timestamp-file /etc/glusterd/vols/failover/marker.tstamp 25: option xtime off 26: option quota off 27: subvolumes failover-io-threads 28: end-volume 29: 30: volume /exports/nas-segment-0003/failover 31: type debug/io-stats 32: option latency-measurement off 33: option count-fop-hits off 34: subvolumes failover-marker 35: end-volume 36: 37: volume failover-server 38: type protocol/server 39: option transport-type tcp 40: option auth.addr./exports/nas-segment-0003/failover.allow * 41: subvolumes /exports/nas-segment-0003/failover 42: end-volume +------------------------------------------------------------------------------+ [2012-04-04 17:06:07.390223] I [server-handshake.c:542:server_setvolume] 0-failover-server: accepted client from 192.168.5.99:1019 (version: 3.2.5) ******************************************************************************* Logfile: /usr/local/var/log/glusterfs/bricks/exports-nas-segment-0004-failover.log Given volfile: +------------------------------------------------------------------------------+ 1: volume failover-posix 2: type storage/posix 3: option directory /exports/nas-segment-0004/failover 4: end-volume 5: 6: volume failover-access-control 7: type features/access-control 8: subvolumes failover-posix 9: end-volume 10: 11: volume failover-locks 12: type features/locks 13: subvolumes failover-access-control 14: end-volume 15: 16: volume failover-io-threads 17: type performance/io-threads 18: subvolumes failover-locks 19: end-volume 20: 21: volume failover-marker 22: type features/marker 23: option volume-uuid 1a21ec35-9077-4587-84ad-7bbd5fe22501 24: option timestamp-file /etc/glusterd/vols/failover/marker.tstamp 25: option xtime off 26: option quota off 27: subvolumes failover-io-threads 28: end-volume 29: 30: volume /exports/nas-segment-0004/failover 31: type debug/io-stats 32: option latency-measurement off 33: option count-fop-hits off 34: subvolumes failover-marker 35: end-volume 36: 37: volume failover-server 38: type protocol/server 39: option transport-type tcp 40: option auth.addr./exports/nas-segment-0004/failover.allow * 41: subvolumes /exports/nas-segment-0004/failover 42: end-volume +------------------------------------------------------------------------------+ [2012-04-04 17:06:07.398223] I [server-handshake.c:542:server_setvolume] 0-failover-server: accepted client from 192.168.5.99:1018 (version: 3.2.5) ******************************************************************************* File Attributes: /exports/nas-segment-0003/failover/SDS4-14/shelbyWRC2791.rvf SC-99 # stat /exports/nas-segment-0003/failover/SDS4-14/shelbyWRC2791.rvf File: `/exports/nas-segment-0003/failover/SDS4-14/shelbyWRC2791.rvf' Size: 42962 Blocks: 96 IO Block: 4096 regular file Device: 890h/2192d Inode: 141 Links: 1 Access: (2770/-rwxrws---) Uid: ( 0/ root) Gid: ( 1235/nas_users) Access: 2012-04-04 17:00:17.000000000 -0700 Modify: 2012-04-03 11:01:13.000000000 -0700 Change: 2012-04-04 17:06:07.712981733 -0700 # file: exports/nas-segment-0003/failover/SDS4-14/shelbyWRC2791.rvf trusted.afr.failover-client-0=0x000000000000000000000000 trusted.afr.failover-client-1=0x000000000000000000000000 trusted.gfid=0xf32ac22f971e47b5972a43f306689495 ******************************************************************************* File Attributes: /exports/nas-segment-0004/failover/SDS4-14/shelbyWRC2791.rvf SC-100# stat /exports/nas-segment-0004/failover/SDS4-14/shelbyWRC2791.rvf File: `/exports/nas-segment-0004/failover/SDS4-14/shelbyWRC2791.rvf' Size: 22941 Blocks: 56 IO Block: 4096 regular file Device: 880h/2176d Inode: 93 Links: 1 Access: (0770/-rwxrwx---) Uid: ( 0/ root) Gid: ( 1235/nas_users) Access: 2012-04-04 17:00:17.000000000 -0700 Modify: 2012-04-03 11:01:13.000000000 -0700 Change: 2012-04-04 17:06:07.713862699 -0700 # file: exports/nas-segment-0004/failover/SDS4-14/shelbyWRC2791.rvf trusted.afr.failover-client-0=0x000000000000000000000000 trusted.afr.failover-client-1=0x000000000000000000000000 trusted.gfid=0xf32ac22f971e47b5972a43f306689495 ******************************************************************************* ******************************************************************************* File Attributes: /exports/nas-segment-0003/failover/SDS4-14/SDS4WRC2752.rvf SC-99 # stat /exports/nas-segment-0003/failover/SDS4-14/SDS4WRC2752.rvf File: `/exports/nas-segment-0003/failover/SDS4-14/SDS4WRC2752.rvf' Size: 44948 Blocks: 96 IO Block: 4096 regular file Device: 890h/2192d Inode: 219 Links: 1 Access: (2770/-rwxrws---) Uid: ( 0/ root) Gid: ( 1235/nas_users) Access: 2012-04-04 17:00:17.000000000 -0700 Modify: 2012-04-03 11:04:18.000000000 -0700 Change: 2012-04-04 17:05:30.738981733 -0700 # file: exports/nas-segment-0003/failover/SDS4-14/SDS4WRC2752.rvf trusted.afr.failover-client-0=0x000000000000000000000000 trusted.afr.failover-client-1=0x000000030000000200000000 trusted.gfid=0x786cb26427254fb3abe06ca0b26e99fc ******************************************************************************* File Attributes: /exports/nas-segment-0004/failover/SDS4-14/SDS4WRC2752.rvf SC-100# stat /exports/nas-segment-0004/failover/SDS4-14/SDS4WRC2752.rvf File: `/exports/nas-segment-0004/failover/SDS4-14/SDS4WRC2752.rvf' Size: 15550 Blocks: 40 IO Block: 4096 regular file Device: 880h/2176d Inode: 40 Links: 1 Access: (0770/-rwxrwx---) Uid: ( 0/ root) Gid: ( 1235/nas_users) Access: 2012-04-04 17:04:50.000000000 -0700 Modify: 2012-04-02 10:13:35.000000000 -0700 Change: 2012-04-04 17:05:21.863862699 -0700 # file: exports/nas-segment-0004/failover/SDS4-14/SDS4WRC2752.rvf trusted.afr.failover-client-0=0x000000000000000000000000 trusted.afr.failover-client-1=0x000000000000000000000000 trusted.gfid=0x786cb26427254fb3abe06ca0b26e99fc
It seems that this problem is in part related to the use of the SGID (Set Group ID) dir/file mode attributes, that were added to the Samba 'smb.conf' file to resolve an unrelated GlusterFS issue Bug 802423 -GlusterFS does not work well with MS Office 2010 and Samba "posix locking = yes". https://bugzilla.redhat.com/show_bug.cgi?id=802423 [Gluster-users] Using Samba and MSOffice files http://gluster.org/pipermail/gluster-users/2011-November/009127.html The thought that the SGID mode bit was somehow related to the problem was because the 2-way replica toxic files showed the SGID bit set on one copy, but not the other. However, AFAIK, the Samba host was not ever changing the SGID bit or creating files without it, so the case where the SGID bit was lost must be related to replica healing in some way. When I changed the Samba 'smb.conf' settings from: directory mask = 2770 force directory mode = 2770 create mask = 2770 force create mode = 2770 force security mode = 2770 force directory security mode = 2770 to: directory mask = 770 force directory mode = 770 create mask = 770 force create mode = 770 force security mode = 770 force directory security mode = 770 the "toxic" file healing problem does not occur, where it occured most of the time otherwise. This can act as a work-around, but it should be possible to use the SGID file/directory mode bits if desired.
Jeff, >>I have tared one of the problematic directories from each brick, >> and saved the extended attributes and reproduce the problem with >>these files on a different system. If it would be helpful, I >>could upload these files. We would like to recreate the problem in our setup. Could you provide the xattrs which result in this scenario. Pranith
Created attachment 591616 [details] The extended file attributes dumps as requested.
Jeff Byers, What are the .dmp files? How to open them?. If I just create such a directory hierarchy and assign the xattrs I should be able to see the issue?. Should it be done only on samba or it can happen even on fuse client?. Pranith
The .dmp files are the output of 'getfattr' with the '-d' option, and is in ascii text format. This format can be used to restore attributes to files using 'setfattr'. Not sure if you could just restore and reproduce, it has been almost 2.5 months. I was surprised that I still had the files. Since Samba only accesses through the GlusterFS fuse client, so it probably shouldn't be necessary.
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed.