1610256 – [Ganesha] While performing lookups from two of the clients, "ls" command got failed with "Invalid argument"

Bug 1610256 - [Ganesha] While performing lookups from two of the clients, "ls" command got failed with "Invalid argument"

Summary: [Ganesha] While performing lookups from two of the clients, "ls" command got ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	libgfapi
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Jiffin
QA Contact:	bugs@gluster.org
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1569657
TreeView+	depends on / blocked

Reported:	2018-07-31 10:12 UTC by Jiffin
Modified:	2018-10-23 15:15 UTC (History)
CC List:	14 users (show)
Fixed In Version:	glusterfs-5.0
Clone Of:	1569657
Environment:
Last Closed:	2018-10-23 15:15:49 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jiffin 2018-07-31 10:12:09 UTC

+++ This bug was initially created as a clone of Bug #1569657 +++

Description of problem:

Single volume mounted via 4 different VIP's on 4 clients (v3/v4).While running linux untar,dbench,iozone from 2 clients and parallel lookups from another 2 clients, lookups got failed on both the clients.

Client on which lookup failed did following sequence 
   Client 1:
  -> while true;do find . -mindepth 1 -type f;done
  ->  while true;do ls -lRt;done 

   Client 2:
  -> find command in loop

Doing "ls" on the same mount point-

[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-33 mani-mount]# 


Able to create files and dirs on same mount-

[root@dhcp47-33 mani-mount]# touch mani
[root@dhcp47-33 mani-mount]# touch mani1
[root@dhcp47-33 mani-mount]# touch mani2
[root@dhcp47-33 mani-mount]# touch mani3
[root@dhcp47-33 mani-mount]# mkdir ms1
[root@dhcp47-33 mani-mount]# mkdir ms2
[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument


Another client on which lookups failed-

[root@dhcp46-20 mani-mount]# ^C
[root@dhcp46-20 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp46-20 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp46-20 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp46-20 mani-mount]# ls
ls: reading directory .: Invalid argument
[root@dhcp46-20 mani-mount]# ls
ls: reading directory .: Invalid argument



Unmounted and remounted the same volume on same client with same VIP.Issue still exist.

Mounted the same volume on another client with same VIP.Again "ls" unable to list content

Did "ls" from one of the client from which iozone was ongoing,able to get data-

mani-mount]# ls
dir1  f2           linux-4.9.5.tar.xz  mani1  mani3  ms2      test
f1    linux-4.9.5  mani                mani2  ms1    run6396  test1



Version-Release number of selected component (if applicable):

# rpm -qa | grep ganesha
nfs-ganesha-gluster-2.5.5-4.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-7.el7rhgs.x86_64
nfs-ganesha-2.5.5-4.el7rhgs.x86_64


How reproducible:
1/1

Steps to Reproduce:
1.Create 4 node ganesha cluster
2.Create  2 x (2 + 1) arbiter volume
3.Export the volume via ganesha
4.Mount the volume on 4 clients with 4 different VIP's.
   2 clients with vers=3 and 2 clients with vers=4.0
5.Perform following data set-
  -> Client 1 (v3):Run dbench first.Post completion run iozone
  -> Client 2 (v4):lookups finds and ls -lRt in loop
  -> Client 3 (v3):lookups finds
  -> Client 4 (v4) :liux untars

Actual results:
Lookups got failed from both the clients performing lookups.No impact on ongoing IO's


Expected results:
lookups should not fail


Additional info:

Not able to find any error logs causing lookups to fail in ganesha-gfapi.logs.
On all the 4 server node,ganesha is up and running


# showmount -e
Export list for dhcp37-120.lab.eng.blr.redhat.com:
/Ganesha-lock (everyone)
/mani-test1   (everyone)

------------------------------

]#
[root@dhcp47-33 mani-mount]# ls
ls: reading directory .: Invalid argument

A
--- Additional comment from Jiffin on 2018-04-23 03:01:59 EDT ---

Reason for error :
after performing readdir call in, ganesha's mdcache(not gluster mdcahe) performs a getattr call on each entries of dirent list to refresh it's cache. When gettattr call reaches fsal_gluster, first it performs the glfs_h_stat, for directory  in the root "ms2"(gfid : 59d7dc9b-e2ae-4bca-8b97-14539fe1aa7a) , one of the layer in client stack returned EINVAL(I was not able to find any packets related to this gfid). I find only this server in ganesha cluster have that issue
And I have lost setup in same state and didn't find layer returned EINVAL 

At back end :

# getfattr -d -m "." -e hex ms2
# file: ms2
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.gfid=0x59d7dc9be2ae4bca8b9714539fe1aa7a
trusted.glusterfs.dht=0x0000000000000000000000007ffffffe
trusted.glusterfs.dht.mds=0x00000000

I found following messages in ganesha-gfapi.log :

[2018-04-19 16:50:33.923678] E [MSGID: 101046] [dht-common.c:1857:dht_revalidate_cbk] 1-mani-test1-dht: dict is null
[2018-04-19 16:51:37.606081] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = a0056ea8-18ac-431f-a2a0-b06a5355998f). Holes=1 overlaps=0
[2018-04-19 16:53:07.867398] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = 859569c9-d4fa-49e0-b15a-5102b85f3c51). Holes=1 overlaps=0
[2018-04-19 16:55:28.636417] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = 74f8d078-af5a-4cb2-9241-a1a080c47e7d). Holes=1 overlaps=0
[2018-04-19 16:59:05.204906] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = 59e0bb2d-7ffa-444d-b071-69963db29047). Holes=1 overlaps=0
[2018-04-19 17:07:47.896369] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = 29887570-967c-4603-a4bf-a55601b0d0f3). Holes=1 overlaps=0
[2018-04-19 17:10:27.273871] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = 877b79e6-a47c-479d-b7b5-5879a4c21fca). Holes=1 overlaps=0
[2018-04-19 17:10:41.758168] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = 59d7dc9b-e2ae-4bca-8b97-14539fe1aa7a). Holes=1 overlaps=0


Manisha :

Since set up was not in same state, priority of this bug depends on how reproducible  the issue is ?

@Dang :

Is it okay to skip the entry which failed "getattrs" in from directory list and continue with rest of entries instead of failing the entire readdir operation ?

@Nithya :
Have u encountered any similar issue with dht ?

--- Additional comment from Manisha Saini on 2018-04-23 03:16:32 EDT ---

(In reply to Jiffin from comment #3)
> Reason for error :
> after performing readdir call in, ganesha's mdcache(not gluster mdcahe)
> performs a getattr call on each entries of dirent list to refresh it's
> cache. When gettattr call reaches fsal_gluster, first it performs the
> glfs_h_stat, for directory  in the root "ms2"(gfid :
> 59d7dc9b-e2ae-4bca-8b97-14539fe1aa7a) , one of the layer in client stack
> returned EINVAL(I was not able to find any packets related to this gfid). I
> find only this server in ganesha cluster have that issue
> And I have lost setup in same state and didn't find layer returned EINVAL 
> 
> At back end :
> 
> # getfattr -d -m "." -e hex ms2
> # file: ms2
> security.
> selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f7
> 43a733000
> trusted.gfid=0x59d7dc9be2ae4bca8b9714539fe1aa7a
> trusted.glusterfs.dht=0x0000000000000000000000007ffffffe
> trusted.glusterfs.dht.mds=0x00000000
> 
> I found following messages in ganesha-gfapi.log :
> 
> [2018-04-19 16:50:33.923678] E [MSGID: 101046]
> [dht-common.c:1857:dht_revalidate_cbk] 1-mani-test1-dht: dict is null
> [2018-04-19 16:51:37.606081] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> (null) (gfid = a0056ea8-18ac-431f-a2a0-b06a5355998f). Holes=1 overlaps=0
> [2018-04-19 16:53:07.867398] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> (null) (gfid = 859569c9-d4fa-49e0-b15a-5102b85f3c51). Holes=1 overlaps=0
> [2018-04-19 16:55:28.636417] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> (null) (gfid = 74f8d078-af5a-4cb2-9241-a1a080c47e7d). Holes=1 overlaps=0
> [2018-04-19 16:59:05.204906] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> (null) (gfid = 59e0bb2d-7ffa-444d-b071-69963db29047). Holes=1 overlaps=0
> [2018-04-19 17:07:47.896369] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> (null) (gfid = 29887570-967c-4603-a4bf-a55601b0d0f3). Holes=1 overlaps=0
> [2018-04-19 17:10:27.273871] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> (null) (gfid = 877b79e6-a47c-479d-b7b5-5879a4c21fca). Holes=1 overlaps=0
> [2018-04-19 17:10:41.758168] I [MSGID: 109063]
> [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> (null) (gfid = 59d7dc9b-e2ae-4bca-8b97-14539fe1aa7a). Holes=1 overlaps=0
> 
> 
> Manisha :
> 
> Since set up was not in same state, priority of this bug depends on how
> reproducible  the issue is ?

Jiffin,when I shared the setup,to me it was in same state.Don't know how ganesha service got crashed.That also needs to be looked upon.
Also as we are unable to get any files from mount point post performing "ls",to me it stands blocker.

I will try to repro the issue.But considering the lack of qe bandwidth I will try to update the BZ by 26th April EOD 

Keeping needinfo intact

> 
> @Dang :
> 
> Is it okay to skip the entry which failed "getattrs" in from directory list
> and continue with rest of entries instead of failing the entire readdir
> operation ?
> 
> @Nithya :
> Have u encountered any similar issue with dht ?

--- Additional comment from Susant Kumar Palai on 2018-04-23 05:48:00 EDT ---

from gfapi log :

[2018-04-19 16:18:17.188419] W [MSGID: 108001] [afr-common.c:5171:afr_notify] 0-mani-test1-replicate-0: Client-quorum is not met
[2018-04-19 16:18:17.188877] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-mani-test1-client-3: disconnected from mani-test1-client-3. Client process will keep trying to connect to glusterd until brick's port is available
[2018-04-19 16:18:17.188955] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-mani-test1-client-4: disconnected from mani-test1-client-4. Client process will keep trying to connect to glusterd until brick's port is available
[2018-04-19 16:18:17.188976] W [MSGID: 108001] [afr-common.c:5171:afr_notify] 0-mani-test1-replicate-1: Client-quorum is not met
[2018-04-19 16:18:17.188805] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-mani-test1-client-2: disconnected from mani-test1-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2018-04-19 16:18:17.189312] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-mani-test1-client-5: disconnected from mani-test1-client-5. Client process will keep trying to connect to glusterd until brick's port is available
[2018-04-19 16:18:17.189342] E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-mani-test1-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up.
[2018-04-19 16:18:17.190301] E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-mani-test1-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
The message "I [MSGID: 104043] [glfs-mgmt.c:628:glfs_mgmt_getspec_cbk] 0-gfapi: No change in volfile, continuing" repeated 2 times between [2018-04-19 16:17:52.270376] and [2018-04-19 16:18:15.862041]
[2018-04-19 16:35:23.081937] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = 7cdfc3f6-4337-4a50-a58f-0305e65cb0c0). Holes=1 overlaps=0
[2018-04-19 16:38:39.796313] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = f4619ab7-cfc9-481c-91ad-a03fa096ecdc). Holes=1 overlaps=0
[2018-04-19 16:38:42.005686] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = bbb57d4e-b895-4437-82bb-04c9b45a991b). Holes=1 overlaps=0
[2018-04-19 16:39:22.441298] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = f6ac57d8-835d-4ad1-8eb2-2d970b14b312). Holes=1 overlaps=0
[2018-04-19 16:39:29.100457] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in (null) (gfid = 29ee038c-9e56-4ff3-965a-e619d3c0eec3). Holes=1 overla

Seems like the layout needed a heal and both the server went down. This will lead to lookup failure on root it self.

Having the setup would have helped confirming the layout issue and further any client-server connection issue.

In my opinion, either the bricks were killed or there was network partition and hence the problem.

--- Additional comment from Daniel Gryniewicz on 2018-04-23 09:49:13 EDT ---

MDCACHE doesn't do a getattrs.  The attributes of the object referenced by the dirent are passed back to MDCACHE in the callback by the FSAL.  FSAL_GLUSTER uses glfs_xreaddirplus_r() to get both the file handle and it's attributes, which are then passed back to MDCACHE.  So no separate getattrs() should be called.

That said, MDCACHE needs the attributes when it creates the object, so we can't just skip the dirent.

--- Additional comment from Manisha Saini on 2018-04-24 06:00:05 EDT ---

(In reply to Manisha Saini from comment #4)
> (In reply to Jiffin from comment #3)
> > Reason for error :
> > after performing readdir call in, ganesha's mdcache(not gluster mdcahe)
> > performs a getattr call on each entries of dirent list to refresh it's
> > cache. When gettattr call reaches fsal_gluster, first it performs the
> > glfs_h_stat, for directory  in the root "ms2"(gfid :
> > 59d7dc9b-e2ae-4bca-8b97-14539fe1aa7a) , one of the layer in client stack
> > returned EINVAL(I was not able to find any packets related to this gfid). I
> > find only this server in ganesha cluster have that issue
> > And I have lost setup in same state and didn't find layer returned EINVAL 
> > 
> > At back end :
> > 
> > # getfattr -d -m "." -e hex ms2
> > # file: ms2
> > security.
> > selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f7
> > 43a733000
> > trusted.gfid=0x59d7dc9be2ae4bca8b9714539fe1aa7a
> > trusted.glusterfs.dht=0x0000000000000000000000007ffffffe
> > trusted.glusterfs.dht.mds=0x00000000
> > 
> > I found following messages in ganesha-gfapi.log :
> > 
> > [2018-04-19 16:50:33.923678] E [MSGID: 101046]
> > [dht-common.c:1857:dht_revalidate_cbk] 1-mani-test1-dht: dict is null
> > [2018-04-19 16:51:37.606081] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = a0056ea8-18ac-431f-a2a0-b06a5355998f). Holes=1 overlaps=0
> > [2018-04-19 16:53:07.867398] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 859569c9-d4fa-49e0-b15a-5102b85f3c51). Holes=1 overlaps=0
> > [2018-04-19 16:55:28.636417] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 74f8d078-af5a-4cb2-9241-a1a080c47e7d). Holes=1 overlaps=0
> > [2018-04-19 16:59:05.204906] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 59e0bb2d-7ffa-444d-b071-69963db29047). Holes=1 overlaps=0
> > [2018-04-19 17:07:47.896369] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 29887570-967c-4603-a4bf-a55601b0d0f3). Holes=1 overlaps=0
> > [2018-04-19 17:10:27.273871] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 877b79e6-a47c-479d-b7b5-5879a4c21fca). Holes=1 overlaps=0
> > [2018-04-19 17:10:41.758168] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 59d7dc9b-e2ae-4bca-8b97-14539fe1aa7a). Holes=1 overlaps=0
> > 
> > 
> > Manisha :
> > 
> > Since set up was not in same state, priority of this bug depends on how
> > reproducible  the issue is ?
> 
> Jiffin,when I shared the setup,to me it was in same state.Don't know how
> ganesha service got crashed.That also needs to be looked upon.
> Also as we are unable to get any files from mount point post performing
> "ls",to me it stands blocker.
> 
> I will try to repro the issue.But considering the lack of qe bandwidth I
> will try to update the BZ by 26th April EOD 
> 




> 
> > 
> > @Dang :
> > 
> > Is it okay to skip the entry which failed "getattrs" in from directory list
> > and continue with rest of entries instead of failing the entire readdir
> > operation ?
> > 
> > @Nithya :
> > Have u encountered any similar issue with dht ?


(In reply to Manisha Saini from comment #4)
> (In reply to Jiffin from comment #3)
> > Reason for error :
> > after performing readdir call in, ganesha's mdcache(not gluster mdcahe)
> > performs a getattr call on each entries of dirent list to refresh it's
> > cache. When gettattr call reaches fsal_gluster, first it performs the
> > glfs_h_stat, for directory  in the root "ms2"(gfid :
> > 59d7dc9b-e2ae-4bca-8b97-14539fe1aa7a) , one of the layer in client stack
> > returned EINVAL(I was not able to find any packets related to this gfid). I
> > find only this server in ganesha cluster have that issue
> > And I have lost setup in same state and didn't find layer returned EINVAL 
> > 
> > At back end :
> > 
> > # getfattr -d -m "." -e hex ms2
> > # file: ms2
> > security.
> > selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f7
> > 43a733000
> > trusted.gfid=0x59d7dc9be2ae4bca8b9714539fe1aa7a
> > trusted.glusterfs.dht=0x0000000000000000000000007ffffffe
> > trusted.glusterfs.dht.mds=0x00000000
> > 
> > I found following messages in ganesha-gfapi.log :
> > 
> > [2018-04-19 16:50:33.923678] E [MSGID: 101046]
> > [dht-common.c:1857:dht_revalidate_cbk] 1-mani-test1-dht: dict is null
> > [2018-04-19 16:51:37.606081] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = a0056ea8-18ac-431f-a2a0-b06a5355998f). Holes=1 overlaps=0
> > [2018-04-19 16:53:07.867398] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 859569c9-d4fa-49e0-b15a-5102b85f3c51). Holes=1 overlaps=0
> > [2018-04-19 16:55:28.636417] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 74f8d078-af5a-4cb2-9241-a1a080c47e7d). Holes=1 overlaps=0
> > [2018-04-19 16:59:05.204906] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 59e0bb2d-7ffa-444d-b071-69963db29047). Holes=1 overlaps=0
> > [2018-04-19 17:07:47.896369] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 29887570-967c-4603-a4bf-a55601b0d0f3). Holes=1 overlaps=0
> > [2018-04-19 17:10:27.273871] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 877b79e6-a47c-479d-b7b5-5879a4c21fca). Holes=1 overlaps=0
> > [2018-04-19 17:10:41.758168] I [MSGID: 109063]
> > [dht-layout.c:713:dht_layout_normalize] 1-mani-test1-dht: Found anomalies in
> > (null) (gfid = 59d7dc9b-e2ae-4bca-8b97-14539fe1aa7a). Holes=1 overlaps=0
> > 
> > 
> > Manisha :
> > 
> > Since set up was not in same state, priority of this bug depends on how
> > reproducible  the issue is ?
> 
> Jiffin,when I shared the setup,to me it was in same state.Don't know how
> ganesha service got crashed.That also needs to be looked upon.
> Also as we are unable to get any files from mount point post performing
> "ls",to me it stands blocker.
> 
> I will try to repro the issue.But considering the lack of qe bandwidth I
> will try to update the BZ by 26th April EOD 
> 
> Keeping needinfo intact
> 
> > 
> > @Dang :
> > 
> > Is it okay to skip the entry which failed "getattrs" in from directory list
> > and continue with rest of entries instead of failing the entire readdir
> > operation ?
> > 
> > @Nithya :
> > Have u encountered any similar issue with dht ?


--- Additional comment from Jiffin on 2018-06-06 02:59:38 EDT ---

I tried to recreate  similar on latest ganesha build(+ plus fix for 1580107) and I was not able to create the issue with bonnie + linux untar(for 2 hrs) (tried twice) on 2*(2+1) volume. Can you please retry the following in the new build.
Please enable debugging logging for ganesha and gfapi , collect the packet on the server where "ls -ltr " is performed. 

Enable debug log for gfapi -- set diagnostics.client-log-level to DEBUG

Enable debug for ganesha on following components readdir and cache inode
add follow in ganesha.conf

LOG {
                ## Default log level for all components
        #Default_Log_Level = WARN;

                ## Configure per-component log levels.
                        Components {
                                       CACHE_INODE = FULL_DEBUG;
                                       CACHE_INODE_LRU = FULL_DEBUG;
                                       NFS_READDIR = FULL_DEBUG;
                       }


                                                                                                                                                                                                        }

Please restart nfs-ganesha post that

--- Additional comment from Manisha Saini on 2018-06-21 14:54:00 EDT ---

(In reply to Jiffin from comment #10)
> I tried to recreate  similar on latest ganesha build(+ plus fix for 1580107)
> and I was not able to create the issue with bonnie + linux untar(for 2 hrs)
> (tried twice) on 2*(2+1) volume. Can you please retry the following in the
> new build.
> Please enable debugging logging for ganesha and gfapi , collect the packet
> on the server where "ls -ltr " is performed. 
> 
> Enable debug log for gfapi -- set diagnostics.client-log-level to DEBUG
> 
> Enable debug for ganesha on following components readdir and cache inode
> add follow in ganesha.conf
> 
> LOG {
>                 ## Default log level for all components
>         #Default_Log_Level = WARN;
> 
>                 ## Configure per-component log levels.
>                         Components {
>                                        CACHE_INODE = FULL_DEBUG;
>                                        CACHE_INODE_LRU = FULL_DEBUG;
>                                        NFS_READDIR = FULL_DEBUG;
>                        }
> 
> 
>                                                                             
> }
> 



There are no logs generated on those Ganesha server nodes through which clients are mapped,performing lookups.The other nodes which are performing dbench and untars have the logs in place.

Setup detail is same as in comment #17

The client on which lookup causing "invalid argument"


dhcp47-170.lab.eng.blr.redhat.com - root/redhat

[root@dhcp47-170 readdir_test]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-170 readdir_test]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-170 readdir_test]# ls
ls: reading directory .: Invalid argument
[root@dhcp47-170 readdir_test]# ls
ls: reading directory .: Invalid argument

Comment 1 Worker Ant 2018-07-31 10:13:34 UTC

REVIEW: https://review.gluster.org/20598 (gfapi : Set need lookup in pub_glfs_h_create_handle) posted (#1) for review on master by jiffin tony Thottan

Comment 2 Worker Ant 2018-08-06 10:06:55 UTC

REVIEW: https://review.gluster.org/20643 (cluster/dht: Extra unref on inode in discover path) posted (#3) for review on master by Susant Palai

Comment 3 Worker Ant 2018-08-14 12:04:46 UTC

COMMIT: https://review.gluster.org/20643 committed in master by "Atin Mukherjee" <amukherj> with a commit message- cluster/dht: fix inode ref management in dht_heal_path

In dht_heal_path, the inodes are created & looked up from top to down.

If the path is "a/b/c", then lookup will be done on a, then b and so
on. Here is a rough snippet of the function "dht_heal_path".

<snippet>
if (bname) {						ref_count
	- loc.inode = create/grep inode    		  1
	- syncop_lookup (loc.inode)
	- linked_inode = inode_link (loc.inode)		  2
	/*clean up current loc*/
	- loc_wipe(&loc)				  1
	/*set up parent and bname for next child */
	- loc.parent = inode
	- bname = next_child_name
}
out:
	- inode_ref (linked_inode)			  2
	- loc_wipe (&loc) 				  1
</snippet>

The problem with the above code is if _bname_ is empty ie the chain lookup is
done, then for the next iteration we populate loc.parent anyway. Now that
bname is empty, the loc_wipe is done in the _out_ section as well. Since, the
loc.parent was set to the previous inode, we lose a ref unwantedly. Now a
dht_local_wipe as part of the DHT_STACK_UNWIND takes away the last ref leading
to inode_destroy.

This problenm is observed currently with nfs-ganesha with the nameless lookup.
Post the inode_purge, gfapi does not get the new inode to link and hence, it links
the inode it sent in the lookup fop, which does not have any dht related context
(layout) leading to "invalid argument error" in lookup path done parallely with tar
operation.

test done in the following way:
 - create two nfs client connected with two different nfs servers.
 - run untar on one client and run lookup continuously on the other.
 - Prior to this patch, invalid arguement was seen which is fixed with
   the current patch.

Change-Id: Ifb90c178a2f3c16604068c7da8fa562b877f5c61
fixes: bz#1610256
Signed-off-by: Susant Palai <spalai>

Comment 4 Shyamsundar 2018-10-23 15:15:49 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.