Description of problem: Find's are hunged on client for more then 12+ Hours when new writes are running in parallel Version-Release number of selected component (if applicable): glusterfs-ganesha-3.12.2-11.el7rhgs.x86_64 nfs-ganesha-2.5.5-7.el7rhgs.x86_64 nfs-ganesha-gluster-2.5.5-7.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-7.el7rhgs.x86_64 How reproducible: 2/2 Steps to Reproduce: 1.Create 6 node ganesha cluster 2.Create 6*(4+2) Distributed-Disperse Volume.Export the volume via ganesha 3.Mount the volume on 4 clients with 4 different VIP's Client 1,Client 2,Client 3- Run dd command in loop from 3 clients Client4 - Run find's in loop ( while true;do find . -mindepth 1 -type f;done) Actual results: After nearly around ~2 Hrs,Find got hung on client 4 for more then ~12 Hrs when new writes were running in parallel Expected results: Find should not hung when new writes are running in parallel Additional info: Attaching gstack,tcpdumps and sosreports shortly
Based on the packet capture, I believe this is not a Ganesha issue. The NFS traffic in that file consists almost entirely of READDIR and READDIR replies (with the odd RENEW). The average time between READDIR and REPLY is 0.0003 (!) with occasional delays as long as 0.001. However, there are many many delays of up to 2 seconds between the REPLY and the next READDIR, which is the client's fault, not Ganesha's fault. Something on the client is causing huge delays. This is likely to be Gluster traffic, as there's 915 NFS packets in the trace, and there are 1.2 million Gluster packets in the trace (!2k / second), so the network is spending most of it's time on Gluster traffic.
(For the record, you can do either of the 2 workarounds, but don't need to do both.)