Bug 1345909 - ls -lRt taking very long time to display contents on an dist-disperse volume with nfs ganesha protocol
Summary: ls -lRt taking very long time to display contents on an dist-disperse volume ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Kaleb KEITHLEY
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks: 1330997
TreeView+ depends on / blocked
 
Reported: 2016-06-13 12:26 UTC by Nag Pavan Chilakam
Modified: 2020-04-06 10:55 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-16 18:18:17 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2016-06-13 12:26:13 UTC
As part of bug verification of 1330997 - [Disperse volume]: IO hang seen on mount with file ops 
I populated lot of data(many files) in the volume and then issued an ls -lRt on root, 
It took more than 15 hours in one case to start display the contents(it may initally display only the root contents, but post that you won't see any other contents)

In this case I am just running ls -lRt on the client without doing any other IOs(single point mount)

Details of IO Etc which were done previously on the vol can  be found in bugs like
1) 1330997 - [Disperse volume]: IO hang seen on mount with file ops 
2)1344675 - Stale file handle seen on the mount of dist-disperse volume when doing IOs with nfs-ganesha protocol 
3)1342426 - self heal deamon killed due to oom kills on a dist-disperse volume using nfs ganesha 




[root@dhcp35-98 log]# rpm -qa|grep gluster
glusterfs-client-xlators-3.7.9-9.el7rhgs.x86_64
nfs-ganesha-gluster-2.3.1-8.el7rhgs.x86_64
glusterfs-server-3.7.9-9.el7rhgs.x86_64
glusterfs-api-3.7.9-9.el7rhgs.x86_64
glusterfs-debuginfo-3.7.9-9.el7rhgs.x86_64
glusterfs-libs-3.7.9-9.el7rhgs.x86_64
glusterfs-fuse-3.7.9-9.el7rhgs.x86_64
glusterfs-cli-3.7.9-9.el7rhgs.x86_64
glusterfs-3.7.9-9.el7rhgs.x86_64
glusterfs-ganesha-3.7.9-9.el7rhgs.x86_64
python-gluster-3.7.9-9.el7rhgs.noarch
[root@dhcp35-98 log]# rpm -qa|grep ganesha
nfs-ganesha-gluster-2.3.1-8.el7rhgs.x86_64
nfs-ganesha-2.3.1-8.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.3.1-8.el7rhgs.x86_64
glusterfs-ganesha-3.7.9-9.el7rhgs.x86_64
[root@dhcp35-98 log]# rpm -qa|grep pcs
pcs-0.9.143-15.el7.x86_64

Comment 1 Soumya Koduri 2016-06-13 12:54:02 UTC
[root@dhcp35-126 ~]# mount -t nfs -o vers=3 10.70.35.98:/ec-nfsganesha /mnt/nfs
[root@dhcp35-126 ~]# cd /mnt/nfs
[root@dhcp35-126 nfs]# ls -lRt
.:
total 36910
-rw-r--r--.     1 root root  7561690 Jun 10 23:05 refreshed.ls.log
-rw-r--r--.     1 root root       55 Jun 10 18:35 sample.log
drwxr-xr-x.     2 root root     4096 Jun 10 16:08 126
-rw-r--r--.     1 root root 30212125 Jun 10 16:07 time_ls.log
drwxr-xr-x.     8 root root     4096 Jun  9 20:08 64
drwxr-xr-x. 10002 root root     4096 Jun  9 19:07 103
drwxr-xr-x.     2 root root     4096 Jun  9 17:46 162

./126:
<hang/pause>


From other mount point,

[root@dhcp35-126 ~]# ls /mnt/fuse-mnt/126 | wc -l
438717

There are 438717 entries under '126' directory. When taken pkt trace, could see that majority of the traffic is 'lookup'. As part of readdir(/plus), since nfs-ganesha caches the attributes, there must have been lookup sent for each directory entry to fetch its attributes which in turn got multiplied by 6 as EC seem to be sending  each such lookup request to all the (4+2) bricks comprising a replica set. I think that's how EC functions. 

@Pranithk,
Could you please confirm the same. What is the reason behind doing lookup on all the bricks in a replica set. Can one(or maybe 4) not enough to lookup and read attributes?



7285 12:02:24.972065  10.70.35.98 -> 10.70.35.191 GlusterFS 452 V330 LOOKUP Call
7286 12:02:24.972090  10.70.35.98 -> 10.70.35.27  GlusterFS 452 V330 LOOKUP Call
7287 12:02:24.972112  10.70.35.98 -> 10.70.35.98  GlusterFS 452 V330 LOOKUP Call
7288 12:02:24.972139  10.70.35.98 -> 10.70.35.64  GlusterFS 452 V330 LOOKUP Call
7289 12:02:24.972159  10.70.35.98 -> 10.70.35.44  GlusterFS 452 V330 LOOKUP Call
7290 12:02:24.972179  10.70.35.98 -> 10.70.35.114 GlusterFS 452 V330 LOOKUP Call
7291 12:02:24.972576  10.70.35.98 -> 10.70.35.98  GlusterFS 424 V330 LOOKUP Reply (Call In 7287)
7292 12:02:24.972600 10.70.35.191 -> 10.70.35.98  GlusterFS 424 V330 LOOKUP Reply (Call In 7285)
7293 12:02:24.972672  10.70.35.64 -> 10.70.35.98  GlusterFS 424 V330 LOOKUP Reply (Call In 7288)
7294 12:02:24.972703  10.70.35.98 -> 10.70.35.98  GlusterFS Callback 468 [RPC retransmission of #12]V1 proc-5 Call
7295 12:02:24.972711  10.70.35.98 -> 10.70.35.98  TCP 68 1012→49155 [ACK] Seq=138241 Ack=272161 Win=6147 Len=0 TSval=5579189 TSecr=5579188
7296 12:02:24.972733  10.70.35.44 -> 10.70.35.98  GlusterFS 424 V330 LOOKUP Reply (Call In 7289)
7297 12:02:24.972795  10.70.35.64 -> 10.70.35.98  GlusterFS Callback 468 [RPC retransmission of #19]V1 proc-5 Call
7298 12:02:24.972837  10.70.35.98 -> 10.70.35.64  TCP 68 1010→49155 [ACK] Seq=138241 Ack=272161 Win=4616 Len=0 TSval=5579189 TSecr=6101597
7299 12:02:24.972858  10.70.35.44 -> 10.70.35.98  GlusterFS Callback 468 [RPC retransmission of #2]V1 proc-5 Call
7300 12:02:24.972879  10.70.35.98 -> 10.70.35.44  TCP 68 1008→49155 [ACK] Seq=138241 Ack=272917 Win=1424 Len=0 TSval=5579189 TSecr=5494472
7301 12:02:24.972907 10.70.35.114 -> 10.70.35.98  GlusterFS 424 V330 LOOKUP Reply (Call In 7290)
7302 12:02:24.973029 10.70.35.114 -> 10.70.35.98  GlusterFS Callback 468 [RPC retransmission of #22]V1 proc-5 Call
7303 12:02:24.973045  10.70.35.98 -> 10.70.35.114 TCP 68 986→49155 [ACK] Seq=138241 Ack=272161 Win=3076 Len=0 TSval=5579189 TSecr=6100026
7304 12:02:24.976909  10.70.35.27 -> 10.70.35.98  GlusterFS 424 V330 LOOKUP Reply (Call In 7286)


Also below messages seem to be related to cache_invalidation calls. Since there is a single client involved, these msgs need to be investigated as well.

>>> 7302 12:02:24.973029 10.70.35.114 -> 10.70.35.98  GlusterFS Callback 468 [RPC retransmission of #22]V1 proc-5 Call

Comment 2 Soumya Koduri 2016-06-13 12:55:36 UTC
(gdb) b upcall_client_cache_invalidate
Breakpoint 1 at 0x7f1ee6c99760: file upcall-internal.c, line 580.
(gdb) c
Continuing.
[Switching to Thread 0x7f1ec6d97700 (LWP 2853)]

Breakpoint 1, upcall_client_cache_invalidate (this=this@entry=0x7f1ee80130d0, 
    gfid=gfid@entry=0x7f1ea004714c "\371\273\311|\265\356G\f\261\346\206\022\351\225Yx", up_client_entry=up_client_entry@entry=0x7f1ea007a820, 
    flags=flags@entry=256, stbuf=stbuf@entry=0x0, p_stbuf=p_stbuf@entry=0x0, 
    oldp_stbuf=oldp_stbuf@entry=0x0) at upcall-internal.c:580
580	{
(gdb) bt
#0  upcall_client_cache_invalidate (this=this@entry=0x7f1ee80130d0, 
    gfid=gfid@entry=0x7f1ea004714c "\371\273\311|\265\356G\f\261\346\206\022\351\225Yx", up_client_entry=up_client_entry@entry=0x7f1ea007a820, 
    flags=flags@entry=256, stbuf=stbuf@entry=0x0, p_stbuf=p_stbuf@entry=0x0, 
    oldp_stbuf=oldp_stbuf@entry=0x0) at upcall-internal.c:580
#1  0x00007f1ee6c99f62 in upcall_cache_forget (this=0x7f1ee80130d0, 
    inode=<optimized out>, up_inode_ctx=0x7f1ea0047100) at upcall-internal.c:665
#2  0x00007f1ee6c9a004 in upcall_cleanup_inode_ctx (this=<optimized out>, 
    inode=0x7f1ec7c0482c) at upcall-internal.c:368
#3  0x00007f1ee6c8dca9 in upcall_forget (this=<optimized out>, 
    inode=<optimized out>) at upcall.c:1722
#4  0x00007f1efa4bcf49 in __inode_ctx_free (inode=inode@entry=0x7f1ec7c0482c)
    at inode.c:337
#5  0x00007f1efa4bdfb7 in __inode_destroy (inode=0x7f1ec7c0482c) at inode.c:358
#6  inode_table_prune (table=table@entry=0x7f1ee80cbf60) at inode.c:1527
#7  0x00007f1efa4be244 in inode_unref (inode=0x7f1ec7bf5ec8) at inode.c:529
#8  0x00007f1efa491ee2 in loc_wipe (loc=loc@entry=0x7f1ef7a4aa30) at xlator.c:695
#9  0x00007f1efa4d0244 in call_stub_wipe_args (stub=0x7f1ef7a4a9f0)
    at call-stub.c:2511
#10 call_stub_destroy (stub=0x7f1ef7a4a9f0) at call-stub.c:2550
#11 0x00007f1ee6a84363 in iot_worker (data=0x7f1ee8054f50) at io-threads.c:215
#12 0x00007f1ef92fcdc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f1ef8c4321d in clone () from /lib64/libc.so.6
(gdb) f 6
#6  inode_table_prune (table=table@entry=0x7f1ee80cbf60) at inode.c:1527
1527	                        __inode_destroy (del);
(gdb) l
1522	
1523	        {
1524	                list_for_each_entry_safe (del, tmp, &purge, list) {
1525	                        list_del_init (&del->list);
1526	                        __inode_forget (del, 0);
1527	                        __inode_destroy (del);
1528	                }
1529	        }
1530	
1531	        return ret;
(gdb) 


As suspected, cache-invalidation messages seem to have been sent as part of inode_table_prune (lru reaping). When server does inode_forget, since it can no more maintain upcall state, there is a cache_invalidation message sent to each client notifying about the same.

Comment 5 Soumya Koduri 2016-06-15 11:11:16 UTC
When checked with Pranith, he mentioned that its by design i.e, lookup on any entry will be sent to all the bricks in a subvol. So now the best approach seems like to optimize nfs-ganesha to avoid lookups as part of readdir. Adjusting the BZ components.

When discussed within nfs-ganesha community, there is a consensus that we could optimize readdir(/plus) operations by allowing FSALs to send attributes to readdir callback routine using which cache-inode(/md-cache) entry for that dirent can be created avoiding lookup and stat. However please note that this enhancement may be pursued only on current dev branch (upstream 2.4).

Comment 6 Kaleb KEITHLEY 2016-06-20 12:59:00 UTC
bug is really in ganesha. stretch goal to optimize ganesha

Comment 7 Nag Pavan Chilakam 2016-11-21 09:01:29 UTC
As pranith has explained the problem , i am taking it as the details are provided


Note You need to log in before you can comment on or make changes to this bug.