Bug 1334577 - [Tiering]: Unable to access file(s) from nfs client; gfid mismatch between cold and hot tier entries
Summary: [Tiering]: Unable to access file(s) from nfs client; gfid mismatch between c...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: tier
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Nithya Balachandran
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks: 1336320
TreeView+ depends on / blocked
 
Reported: 2016-05-10 05:30 UTC by Sweta Anandpara
Modified: 2018-02-06 17:52 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1336320 (view as bug list)
Environment:
Last Closed: 2018-02-06 17:52:28 UTC
Embargoed:


Attachments (Terms of Use)

Description Sweta Anandpara 2016-05-10 05:30:11 UTC
Description of problem:
========================

Had a 4 node cluster, with 2*(4+2) volume as cold. Mounted the volume over 3 nfs clients, and started creating files, under 3 different folders. Attached a 4*1 distribute tier as hot and experienced a hang in all the clients where I/O was in progress. Did a volume start force, which is the documented work around when we hit this issue. The I/O from two clients resumed, but saw no effect on the third client. Did a volume start force a couple of more times, and the I/O resumed. Changed the default watermark values - high to 15 and low to 5. NFS I/O from 3 clients continued to proceed, and low watermark crossed.

du -H on the mountpoint showed 2 files which it was unable to access. It error'ed out with the message  - Stale file handle

When checked on the backend, the file is present on the hot tier brick, with a link file present in the cold tier. nfs.log shows the below errors:

[2016-05-10 04:27:08.116850] W [MSGID: 112199] [nfs3-helpers.c:3520:nfs3_log_newfh_res] 0-nfs-nfsv3: /1k_files/file11940 => (XID: 2ae29970, LOOKUP: NFS: 70(Invalid file handle), POSIX: 116(Stale file handle)), FH: exportid 00000000-0000-0000-0000-000000000000, gfid 00000000-0000-0000-0000-000000000000, mountid 00000000-0000-0000-0000-000000000000
The message "W [MSGID: 109009] [dht-common.c:1926:dht_lookup_linkfile_cbk] 0-ozone-tier-dht: /1k_files/file11940: gfid different on data file on ozone-hot-dht, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = 128d47d1-a259-433c-9ff4-50faa87e4cbf " repeated 3 times between [2016-05-10 04:26:54.071680] and [2016-05-10 04:27:08.114327]
[2016-05-10 04:34:29.879588] W [MSGID: 109009] [dht-common.c:1926:dht_lookup_linkfile_cbk] 0-ozone-tier-dht: /1k_files/file11940: gfid different on data file on ozone-hot-dht, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = 128d47d1-a259-433c-9ff4-50faa87e4cbf
[2016-05-10 04:34:29.881425] W [MSGID: 109009] [dht-common.c:1670:dht_lookup_everywhere_cbk] 0-ozone-tier-dht: /1k_files/file11940: gfid differs on subvolume ozone-hot-dht, gfid local = fbd44247-495c-4f8c-bee2-b3c268136cc3, gfid node = 128d47d1-a259-433c-9ff4-50faa87e4cbf



Version-Release number of selected component (if applicable):
============================================================
3.7.9-3

How reproducible:
=================
Hit it once


Additional info:
=================

CLIENT LOGS:
--------------

[root@dhcp35-3 ~]# cd /mnt/oz
[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]# df -k .
Filesystem         1K-blocks     Used Available Use% Mounted on
10.70.47.64:/ozone 565942272 13722624 552219648   3% /mnt/oz
[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]# ls -l
total 12
drwxr-xr-x. 2 root root 4096 May  9 18:47 1g_files
drwxr-xr-x. 2 root root 4096 May  9 18:47 1k_files
drwxr-xr-x. 2 root root 4096 May  9 18:47 1m_files
[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]#
[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]# ls -l  1k_files/file11940
ls: cannot access 1k_files/file11940: Stale file handle
[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]# ls -l  1m_files/file4106
ls: cannot access 1m_files/file4106: Stale file handle
[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]# mount | grep oz
10.70.47.64:/ozone on /mnt/oz type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.70.47.64,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=10.70.47.64)
[root@dhcp35-3 oz]# 
[root@dhcp35-30 oz]# du -H
4	./.trashcan/internal_op
8	./.trashcan
7292356	./1g_files
du: cannot access ‘./1k_files/file11940’: Stale file handle
12111	./1k_files
du: cannot access ‘./1m_files/file4106’: Stale file handle
4814852	./1m_files
12119331	.
[root@dhcp35-30 oz]# 

[root@dhcp35-3 oz]# 
[root@dhcp35-3 oz]# ls -l  1m_files/file4106
ls: cannot access 1m_files/file4106: Stale file handle
[root@dhcp35-3 oz]# 


SERVER LOGS
---------------

[root@dhcp47-64 ~]# gluster v info
 
Volume Name: ozone
Type: Tier
Volume ID: 8fff87bd-478c-4f13-9381-e935c92b4357
Status: Started
Number of Bricks: 16
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distribute
Number of Bricks: 4
Brick1: 10.70.47.190:/bricks/brick4/ozone
Brick2: 10.70.46.33:/bricks/brick4/ozone
Brick3: 10.70.46.121:/bricks/brick4/ozone
Brick4: 10.70.47.64:/bricks/brick4/ozone
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick5: 10.70.47.64:/bricks/brick1/ozone
Brick6: 10.70.46.121:/bricks/brick1/ozone
Brick7: 10.70.46.33:/bricks/brick1/ozone
Brick8: 10.70.47.190:/bricks/brick1/ozone
Brick9: 10.70.47.64:/bricks/brick2/ozone
Brick10: 10.70.46.121:/bricks/brick2/ozone
Brick11: 10.70.46.33:/bricks/brick2/ozone
Brick12: 10.70.47.190:/bricks/brick2/ozone
Brick13: 10.70.47.64:/bricks/brick3/ozone
Brick14: 10.70.46.121:/bricks/brick3/ozone
Brick15: 10.70.46.33:/bricks/brick3/ozone
Brick16: 10.70.47.190:/bricks/brick3/ozone
Options Reconfigured:
cluster.watermark-hi: 15
cluster.watermark-low: 5
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on
[root@dhcp47-64 ~]# 

>>>>>>>>>>>>>>>>>>>>  1k_files/file11940   >>>>>>>>>>>>>>>>>>>>>>>

[root@dhcp47-190 ~]# 
[root@dhcp47-190 ~]# ls -l /bricks/brick4/ozone/1k_files/file11940
-rw-r--r--. 2 root root 1024 May  9 18:39 /bricks/brick4/ozone/1k_files/file11940
[root@dhcp47-190 ~]#
[root@dhcp47-190 ~]# 
[root@dhcp47-190 ~]# getfattr -d -m . -e hex /bricks/brick4/ozone/1k_files/file11940
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick4/ozone/1k_files/file11940
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.bit-rot.version=0x020000000000000057308126000b7358
trusted.gfid=0x128d47d1a259433c9ff450faa87e4cbf

[root@dhcp47-190 ~]#

[root@dhcp46-121 ~]# ls -l /bricks/brick2/ozone/1k_files/file11940
---------T. 2 root root 0 May  9 18:29 /bricks/brick2/ozone/1k_files/file11940
[root@dhcp46-121 ~]# ls -l /bricks/brick1/ozone/1k_files/file11940
---------T. 2 root root 0 May  9 18:29 /bricks/brick1/ozone/1k_files/file11940
[root@dhcp46-121 ~]#
[root@dhcp46-121 ~]# getfattr -d -m . -e hex /bricks/brick2/ozone/1k_files/file11940
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick2/ozone/1k_files/file11940
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.size=0x0000000000000000
trusted.ec.version=0x00000000000000000000000000000000
trusted.gfid=0xfbd44247495c4f8cbee2b3c268136cc3
trusted.tier.tier-dht.linkto=0x6f7a6f6e652d686f742d64687400

[root@dhcp46-121 ~]# 

>>>>>>>>>>>>>>>>>>>>>>>  1m_files/file4106   >>>>>>>>>>>>>>>>>>>>>>>>>>>>

root@dhcp46-33 ~]# ls -l  /bricks/brick4/ozone/1m_files/file4106
-rw-r--r--. 2 root root 1048576 May  9 18:32 /bricks/brick4/ozone/1m_files/file4106
[root@dhcp46-33 ~]# 
[root@dhcp46-33 ~]# 
[root@dhcp46-33 ~]# 
[root@dhcp46-33 ~]# getfattr -d -m . -e hex /bricks/brick4/ozone/1m_files/file4106
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick4/ozone/1m_files/file4106
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.bit-rot.version=0x0200000000000000573081260007cf54
trusted.gfid=0xd68e87ec488a427f9adac0100d5333bc

[root@dhcp46-33 ~]#


[root@dhcp46-121 ~]# ls -l  /bricks/brick1/ozone/1m_files/file4106
---------T. 2 root root 0 May  9 18:29 /bricks/brick1/ozone/1m_files/file4106
[root@dhcp46-121 ~]# 
[root@dhcp46-121 ~]# 
[root@dhcp46-121 ~]# getfattr -d -m . -e hex /bricks/brick1/ozone/1m_files/file4106
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/ozone/1m_files/file4106
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.ec.config=0x0000080602000200
trusted.ec.size=0x0000000000000000
trusted.ec.version=0x00000000000000000000000000000000
trusted.gfid=0x832e8f3ed390491e8f146d51b2761410
trusted.tier.tier-dht.linkto=0x6f7a6f6e652d686f742d64687400

[root@dhcp46-121 ~]#

Comment 14 Shyamsundar 2018-02-06 17:52:28 UTC
Thank you for your bug report.

We are not further root causing this bug, as a result this bug is being closed as WONTFIX. Please reopen if the problem continues to be observed after upgrading
to a latest version.


Note You need to log in before you can comment on or make changes to this bug.