As part of bug verification of 1330997 - [Disperse volume]: IO hang seen on mount with file ops I populated lot of data(many files) in the volume and then issued an ls -lRt on root, It took more than 15 hours in one case to start display the contents(it may initally display only the root contents, but post that you won't see any other contents) In this case I am just running ls -lRt on the client without doing any other IOs(single point mount) Details of IO Etc which were done previously on the vol can be found in bugs like 1) 1330997 - [Disperse volume]: IO hang seen on mount with file ops 2)1344675 - Stale file handle seen on the mount of dist-disperse volume when doing IOs with nfs-ganesha protocol 3)1342426 - self heal deamon killed due to oom kills on a dist-disperse volume using nfs ganesha [root@dhcp35-98 log]# rpm -qa|grep gluster glusterfs-client-xlators-3.7.9-9.el7rhgs.x86_64 nfs-ganesha-gluster-2.3.1-8.el7rhgs.x86_64 glusterfs-server-3.7.9-9.el7rhgs.x86_64 glusterfs-api-3.7.9-9.el7rhgs.x86_64 glusterfs-debuginfo-3.7.9-9.el7rhgs.x86_64 glusterfs-libs-3.7.9-9.el7rhgs.x86_64 glusterfs-fuse-3.7.9-9.el7rhgs.x86_64 glusterfs-cli-3.7.9-9.el7rhgs.x86_64 glusterfs-3.7.9-9.el7rhgs.x86_64 glusterfs-ganesha-3.7.9-9.el7rhgs.x86_64 python-gluster-3.7.9-9.el7rhgs.noarch [root@dhcp35-98 log]# rpm -qa|grep ganesha nfs-ganesha-gluster-2.3.1-8.el7rhgs.x86_64 nfs-ganesha-2.3.1-8.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.3.1-8.el7rhgs.x86_64 glusterfs-ganesha-3.7.9-9.el7rhgs.x86_64 [root@dhcp35-98 log]# rpm -qa|grep pcs pcs-0.9.143-15.el7.x86_64
[root@dhcp35-126 ~]# mount -t nfs -o vers=3 10.70.35.98:/ec-nfsganesha /mnt/nfs [root@dhcp35-126 ~]# cd /mnt/nfs [root@dhcp35-126 nfs]# ls -lRt .: total 36910 -rw-r--r--. 1 root root 7561690 Jun 10 23:05 refreshed.ls.log -rw-r--r--. 1 root root 55 Jun 10 18:35 sample.log drwxr-xr-x. 2 root root 4096 Jun 10 16:08 126 -rw-r--r--. 1 root root 30212125 Jun 10 16:07 time_ls.log drwxr-xr-x. 8 root root 4096 Jun 9 20:08 64 drwxr-xr-x. 10002 root root 4096 Jun 9 19:07 103 drwxr-xr-x. 2 root root 4096 Jun 9 17:46 162 ./126: <hang/pause> From other mount point, [root@dhcp35-126 ~]# ls /mnt/fuse-mnt/126 | wc -l 438717 There are 438717 entries under '126' directory. When taken pkt trace, could see that majority of the traffic is 'lookup'. As part of readdir(/plus), since nfs-ganesha caches the attributes, there must have been lookup sent for each directory entry to fetch its attributes which in turn got multiplied by 6 as EC seem to be sending each such lookup request to all the (4+2) bricks comprising a replica set. I think that's how EC functions. @Pranithk, Could you please confirm the same. What is the reason behind doing lookup on all the bricks in a replica set. Can one(or maybe 4) not enough to lookup and read attributes? 7285 12:02:24.972065 10.70.35.98 -> 10.70.35.191 GlusterFS 452 V330 LOOKUP Call 7286 12:02:24.972090 10.70.35.98 -> 10.70.35.27 GlusterFS 452 V330 LOOKUP Call 7287 12:02:24.972112 10.70.35.98 -> 10.70.35.98 GlusterFS 452 V330 LOOKUP Call 7288 12:02:24.972139 10.70.35.98 -> 10.70.35.64 GlusterFS 452 V330 LOOKUP Call 7289 12:02:24.972159 10.70.35.98 -> 10.70.35.44 GlusterFS 452 V330 LOOKUP Call 7290 12:02:24.972179 10.70.35.98 -> 10.70.35.114 GlusterFS 452 V330 LOOKUP Call 7291 12:02:24.972576 10.70.35.98 -> 10.70.35.98 GlusterFS 424 V330 LOOKUP Reply (Call In 7287) 7292 12:02:24.972600 10.70.35.191 -> 10.70.35.98 GlusterFS 424 V330 LOOKUP Reply (Call In 7285) 7293 12:02:24.972672 10.70.35.64 -> 10.70.35.98 GlusterFS 424 V330 LOOKUP Reply (Call In 7288) 7294 12:02:24.972703 10.70.35.98 -> 10.70.35.98 GlusterFS Callback 468 [RPC retransmission of #12]V1 proc-5 Call 7295 12:02:24.972711 10.70.35.98 -> 10.70.35.98 TCP 68 1012→49155 [ACK] Seq=138241 Ack=272161 Win=6147 Len=0 TSval=5579189 TSecr=5579188 7296 12:02:24.972733 10.70.35.44 -> 10.70.35.98 GlusterFS 424 V330 LOOKUP Reply (Call In 7289) 7297 12:02:24.972795 10.70.35.64 -> 10.70.35.98 GlusterFS Callback 468 [RPC retransmission of #19]V1 proc-5 Call 7298 12:02:24.972837 10.70.35.98 -> 10.70.35.64 TCP 68 1010→49155 [ACK] Seq=138241 Ack=272161 Win=4616 Len=0 TSval=5579189 TSecr=6101597 7299 12:02:24.972858 10.70.35.44 -> 10.70.35.98 GlusterFS Callback 468 [RPC retransmission of #2]V1 proc-5 Call 7300 12:02:24.972879 10.70.35.98 -> 10.70.35.44 TCP 68 1008→49155 [ACK] Seq=138241 Ack=272917 Win=1424 Len=0 TSval=5579189 TSecr=5494472 7301 12:02:24.972907 10.70.35.114 -> 10.70.35.98 GlusterFS 424 V330 LOOKUP Reply (Call In 7290) 7302 12:02:24.973029 10.70.35.114 -> 10.70.35.98 GlusterFS Callback 468 [RPC retransmission of #22]V1 proc-5 Call 7303 12:02:24.973045 10.70.35.98 -> 10.70.35.114 TCP 68 986→49155 [ACK] Seq=138241 Ack=272161 Win=3076 Len=0 TSval=5579189 TSecr=6100026 7304 12:02:24.976909 10.70.35.27 -> 10.70.35.98 GlusterFS 424 V330 LOOKUP Reply (Call In 7286) Also below messages seem to be related to cache_invalidation calls. Since there is a single client involved, these msgs need to be investigated as well. >>> 7302 12:02:24.973029 10.70.35.114 -> 10.70.35.98 GlusterFS Callback 468 [RPC retransmission of #22]V1 proc-5 Call
(gdb) b upcall_client_cache_invalidate Breakpoint 1 at 0x7f1ee6c99760: file upcall-internal.c, line 580. (gdb) c Continuing. [Switching to Thread 0x7f1ec6d97700 (LWP 2853)] Breakpoint 1, upcall_client_cache_invalidate (this=this@entry=0x7f1ee80130d0, gfid=gfid@entry=0x7f1ea004714c "\371\273\311|\265\356G\f\261\346\206\022\351\225Yx", up_client_entry=up_client_entry@entry=0x7f1ea007a820, flags=flags@entry=256, stbuf=stbuf@entry=0x0, p_stbuf=p_stbuf@entry=0x0, oldp_stbuf=oldp_stbuf@entry=0x0) at upcall-internal.c:580 580 { (gdb) bt #0 upcall_client_cache_invalidate (this=this@entry=0x7f1ee80130d0, gfid=gfid@entry=0x7f1ea004714c "\371\273\311|\265\356G\f\261\346\206\022\351\225Yx", up_client_entry=up_client_entry@entry=0x7f1ea007a820, flags=flags@entry=256, stbuf=stbuf@entry=0x0, p_stbuf=p_stbuf@entry=0x0, oldp_stbuf=oldp_stbuf@entry=0x0) at upcall-internal.c:580 #1 0x00007f1ee6c99f62 in upcall_cache_forget (this=0x7f1ee80130d0, inode=<optimized out>, up_inode_ctx=0x7f1ea0047100) at upcall-internal.c:665 #2 0x00007f1ee6c9a004 in upcall_cleanup_inode_ctx (this=<optimized out>, inode=0x7f1ec7c0482c) at upcall-internal.c:368 #3 0x00007f1ee6c8dca9 in upcall_forget (this=<optimized out>, inode=<optimized out>) at upcall.c:1722 #4 0x00007f1efa4bcf49 in __inode_ctx_free (inode=inode@entry=0x7f1ec7c0482c) at inode.c:337 #5 0x00007f1efa4bdfb7 in __inode_destroy (inode=0x7f1ec7c0482c) at inode.c:358 #6 inode_table_prune (table=table@entry=0x7f1ee80cbf60) at inode.c:1527 #7 0x00007f1efa4be244 in inode_unref (inode=0x7f1ec7bf5ec8) at inode.c:529 #8 0x00007f1efa491ee2 in loc_wipe (loc=loc@entry=0x7f1ef7a4aa30) at xlator.c:695 #9 0x00007f1efa4d0244 in call_stub_wipe_args (stub=0x7f1ef7a4a9f0) at call-stub.c:2511 #10 call_stub_destroy (stub=0x7f1ef7a4a9f0) at call-stub.c:2550 #11 0x00007f1ee6a84363 in iot_worker (data=0x7f1ee8054f50) at io-threads.c:215 #12 0x00007f1ef92fcdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f1ef8c4321d in clone () from /lib64/libc.so.6 (gdb) f 6 #6 inode_table_prune (table=table@entry=0x7f1ee80cbf60) at inode.c:1527 1527 __inode_destroy (del); (gdb) l 1522 1523 { 1524 list_for_each_entry_safe (del, tmp, &purge, list) { 1525 list_del_init (&del->list); 1526 __inode_forget (del, 0); 1527 __inode_destroy (del); 1528 } 1529 } 1530 1531 return ret; (gdb) As suspected, cache-invalidation messages seem to have been sent as part of inode_table_prune (lru reaping). When server does inode_forget, since it can no more maintain upcall state, there is a cache_invalidation message sent to each client notifying about the same.
When checked with Pranith, he mentioned that its by design i.e, lookup on any entry will be sent to all the bricks in a subvol. So now the best approach seems like to optimize nfs-ganesha to avoid lookups as part of readdir. Adjusting the BZ components. When discussed within nfs-ganesha community, there is a consensus that we could optimize readdir(/plus) operations by allowing FSALs to send attributes to readdir callback routine using which cache-inode(/md-cache) entry for that dirent can be created avoiding lookup and stat. However please note that this enhancement may be pursued only on current dev branch (upstream 2.4).
bug is really in ganesha. stretch goal to optimize ganesha
As pranith has explained the problem , i am taking it as the details are provided