Description of problem: ======================== Had a 4 node cluster, with 2*(4+2) volume as cold. Mounted the volume over 3 nfs clients, and started creating files, under 3 different folders. Attached a 4*1 distribute tier as hot and experienced a hang in all the clients where I/O was in progress. Did a volume start force, which is the documented work around when we hit this issue. The I/O from two clients resumed, but saw no effect on the third client. Did a volume start force a couple of more times, and the I/O resumed. Changed the default watermark values - high to 15 and low to 5. NFS I/O from 3 clients continued to proceed, and low watermark crossed. du -H on the mountpoint showed 2 files which it was unable to access. It error'ed out with the message - Stale file handle When checked on the backend, the file is present on the hot tier brick, with a link file present in the cold tier. nfs.log shows the below errors: [2016-05-10 04:27:08.116850] W [MSGID: 112199] [nfs3-helpers.c:3520:nfs3_log_newfh_res] 0-nfs-nfsv3: /1k_files/file11940 => (XID: 2ae29970, LOOKUP: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000 The message "W [MSGID: 109009] [dht-common.c:1926:dht_lookup_linkfile_cbk] 0-ozone-tier-dht: /1k_files/file11940: gfid different on data file on ozone-hot-dht, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = 128d47d1-a259-433c-9ff4-50faa87e4cbf " repeated 3 times between [2016-05-10 04:26:54.071680] and [2016-05-10 04:27:08.114327] [2016-05-10 04:34:29.879588] W [MSGID: 109009] [dht-common.c:1926:dht_lookup_linkfile_cbk] 0-ozone-tier-dht: /1k_files/file11940: gfid different on data file on ozone-hot-dht, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = 128d47d1-a259-433c-9ff4-50faa87e4cbf [2016-05-10 04:34:29.881425] W [MSGID: 109009] [dht-common.c:1670:dht_lookup_everywhere_cbk] 0-ozone-tier-dht: /1k_files/file11940: gfid differs on subvolume ozone-hot-dht, gfid local = fbd44247-495c-4f8c-bee2-b3c268136cc3, gfid node = 128d47d1-a259-433c-9ff4-50faa87e4cbf Version-Release number of selected component (if applicable): ============================================================ 3.7.9-3 How reproducible: ================= Hit it once Additional info: ================= CLIENT LOGS: -------------- [root@dhcp35-3 ~]# cd /mnt/oz [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# df -k . Filesystem 1K-blocks Used Available Use% Mounted on 10.70.47.64:/ozone 565942272 13722624 552219648 3% /mnt/oz [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# ls -l total 12 drwxr-xr-x. 2 root root 4096 May 9 18:47 1g_files drwxr-xr-x. 2 root root 4096 May 9 18:47 1k_files drwxr-xr-x. 2 root root 4096 May 9 18:47 1m_files [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# ls -l 1k_files/file11940 ls: cannot access 1k_files/file11940: Stale file handle [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# ls -l 1m_files/file4106 ls: cannot access 1m_files/file4106: Stale file handle [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# mount | grep oz 10.70.47.64:/ozone on /mnt/oz type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.70.47.64,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=10.70.47.64) [root@dhcp35-3 oz]# [root@dhcp35-30 oz]# du -H 4 ./.trashcan/internal_op 8 ./.trashcan 7292356 ./1g_files du: cannot access ‘./1k_files/file11940’: Stale file handle 12111 ./1k_files du: cannot access ‘./1m_files/file4106’: Stale file handle 4814852 ./1m_files 12119331 . [root@dhcp35-30 oz]# [root@dhcp35-3 oz]# [root@dhcp35-3 oz]# ls -l 1m_files/file4106 ls: cannot access 1m_files/file4106: Stale file handle [root@dhcp35-3 oz]# SERVER LOGS --------------- [root@dhcp47-64 ~]# gluster v info Volume Name: ozone Type: Tier Volume ID: 8fff87bd-478c-4f13-9381-e935c92b4357 Status: Started Number of Bricks: 16 Transport-type: tcp Hot Tier : Hot Tier Type : Distribute Number of Bricks: 4 Brick1: 10.70.47.190:/bricks/brick4/ozone Brick2: 10.70.46.33:/bricks/brick4/ozone Brick3: 10.70.46.121:/bricks/brick4/ozone Brick4: 10.70.47.64:/bricks/brick4/ozone Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 2 x (4 + 2) = 12 Brick5: 10.70.47.64:/bricks/brick1/ozone Brick6: 10.70.46.121:/bricks/brick1/ozone Brick7: 10.70.46.33:/bricks/brick1/ozone Brick8: 10.70.47.190:/bricks/brick1/ozone Brick9: 10.70.47.64:/bricks/brick2/ozone Brick10: 10.70.46.121:/bricks/brick2/ozone Brick11: 10.70.46.33:/bricks/brick2/ozone Brick12: 10.70.47.190:/bricks/brick2/ozone Brick13: 10.70.47.64:/bricks/brick3/ozone Brick14: 10.70.46.121:/bricks/brick3/ozone Brick15: 10.70.46.33:/bricks/brick3/ozone Brick16: 10.70.47.190:/bricks/brick3/ozone Options Reconfigured: cluster.watermark-hi: 15 cluster.watermark-low: 5 cluster.tier-mode: cache features.ctr-enabled: on performance.readdir-ahead: on [root@dhcp47-64 ~]# >>>>>>>>>>>>>>>>>>>> 1k_files/file11940 >>>>>>>>>>>>>>>>>>>>>>> [root@dhcp47-190 ~]# [root@dhcp47-190 ~]# ls -l /bricks/brick4/ozone/1k_files/file11940 -rw-r--r--. 2 root root 1024 May 9 18:39 /bricks/brick4/ozone/1k_files/file11940 [root@dhcp47-190 ~]# [root@dhcp47-190 ~]# [root@dhcp47-190 ~]# getfattr -d -m . -e hex /bricks/brick4/ozone/1k_files/file11940 getfattr: Removing leading '/' from absolute path names # file: bricks/brick4/ozone/1k_files/file11940 security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.bit-rot.version=0x020000000000000057308126000b7358 trusted.gfid=0x128d47d1a259433c9ff450faa87e4cbf [root@dhcp47-190 ~]# [root@dhcp46-121 ~]# ls -l /bricks/brick2/ozone/1k_files/file11940 ---------T. 2 root root 0 May 9 18:29 /bricks/brick2/ozone/1k_files/file11940 [root@dhcp46-121 ~]# ls -l /bricks/brick1/ozone/1k_files/file11940 ---------T. 2 root root 0 May 9 18:29 /bricks/brick1/ozone/1k_files/file11940 [root@dhcp46-121 ~]# [root@dhcp46-121 ~]# getfattr -d -m . -e hex /bricks/brick2/ozone/1k_files/file11940 getfattr: Removing leading '/' from absolute path names # file: bricks/brick2/ozone/1k_files/file11940 security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.ec.config=0x0000080602000200 trusted.ec.size=0x0000000000000000 trusted.ec.version=0x00000000000000000000000000000000 trusted.gfid=0xfbd44247495c4f8cbee2b3c268136cc3 trusted.tier.tier-dht.linkto=0x6f7a6f6e652d686f742d64687400 [root@dhcp46-121 ~]# >>>>>>>>>>>>>>>>>>>>>>> 1m_files/file4106 >>>>>>>>>>>>>>>>>>>>>>>>>>>> root@dhcp46-33 ~]# ls -l /bricks/brick4/ozone/1m_files/file4106 -rw-r--r--. 2 root root 1048576 May 9 18:32 /bricks/brick4/ozone/1m_files/file4106 [root@dhcp46-33 ~]# [root@dhcp46-33 ~]# [root@dhcp46-33 ~]# [root@dhcp46-33 ~]# getfattr -d -m . -e hex /bricks/brick4/ozone/1m_files/file4106 getfattr: Removing leading '/' from absolute path names # file: bricks/brick4/ozone/1m_files/file4106 security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.bit-rot.version=0x0200000000000000573081260007cf54 trusted.gfid=0xd68e87ec488a427f9adac0100d5333bc [root@dhcp46-33 ~]# [root@dhcp46-121 ~]# ls -l /bricks/brick1/ozone/1m_files/file4106 ---------T. 2 root root 0 May 9 18:29 /bricks/brick1/ozone/1m_files/file4106 [root@dhcp46-121 ~]# [root@dhcp46-121 ~]# [root@dhcp46-121 ~]# getfattr -d -m . -e hex /bricks/brick1/ozone/1m_files/file4106 getfattr: Removing leading '/' from absolute path names # file: bricks/brick1/ozone/1m_files/file4106 security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000 trusted.ec.config=0x0000080602000200 trusted.ec.size=0x0000000000000000 trusted.ec.version=0x00000000000000000000000000000000 trusted.gfid=0x832e8f3ed390491e8f146d51b2761410 trusted.tier.tier-dht.linkto=0x6f7a6f6e652d686f742d64687400 [root@dhcp46-121 ~]#
Thank you for your bug report. We are not further root causing this bug, as a result this bug is being closed as WONTFIX. Please reopen if the problem continues to be observed after upgrading to a latest version.