Description of problem: ----------------------- 4 node cluster,1 EC volume exported via Ganesha. Mounted the EC volume on 4 clients via v3/v4. Ran find,du -sh,ll -R on each mount . On one of the clients(gqac024/mounted via gqas014),find and ll-R were hung (for close to more than 2 hours). I took a packet trace on the client,and it showed no packets sent from the client itself. sosreports,packet traces in comments. Version-Release number of selected component (if applicable): ------------------------------------------------------------ nfs-ganesha-2.4.1-6.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.4.1-6.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64 How reproducible: ------------------ 1/1 Steps to Reproduce: ------------------ 1. Mount EC volume via v3/v4 on multiple clients. 2. Run no new writes..Only ls,stat,du -sh.finds. 3. Monitor . Actual results: --------------- ll/finds hang. Expected results: ---------------- No hangs on mount point. Additional info: ---------------- [root@gqas009 ~]# gluster v status Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas015.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 26457 Brick gqas014.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 25391 Brick gqas009.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 25747 Self-heal Daemon on localhost N/A N/A Y 17960 Self-heal Daemon on gqas010.sbu.lab.eng.bos .redhat.com N/A N/A Y 13756 Self-heal Daemon on gqas015.sbu.lab.eng.bos .redhat.com N/A N/A Y 17415 Self-heal Daemon on gqas014.sbu.lab.eng.bos .redhat.com N/A N/A Y 17200 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: replicate Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks12/bricknew 49153 0 Y 27931 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks12/bricknew 49152 0 Y 27177 Self-heal Daemon on localhost N/A N/A Y 17960 Self-heal Daemon on gqas015.sbu.lab.eng.bos .redhat.com N/A N/A Y 17415 Self-heal Daemon on gqas010.sbu.lab.eng.bos .redhat.com N/A N/A Y 13756 Self-heal Daemon on gqas014.sbu.lab.eng.bos .redhat.com N/A N/A Y 17200 Task Status of Volume replicate ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks1/brick1 49158 0 Y 29725 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49164 0 Y 24750 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49164 0 Y 24867 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49164 0 Y 25931 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49165 0 Y 25201 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49165 0 Y 24769 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks2/brick1 49153 0 Y 29771 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49166 0 Y 24788 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49165 0 Y 24886 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49165 0 Y 25950 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49166 0 Y 24905 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49166 0 Y 25969 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks4/brick1 49154 0 Y 29827 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49167 0 Y 24807 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49167 0 Y 25988 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks4/brick 49167 0 Y 24924 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49168 0 Y 25258 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49168 0 Y 24826 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49169 0 Y 25277 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49169 0 Y 24845 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49168 0 Y 26007 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks6/brick 49168 0 Y 24943 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49169 0 Y 24962 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks5/brick 49169 0 Y 26026 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks7/brick1 49155 0 Y 29909 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49170 0 Y 24864 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49170 0 Y 26045 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks7/brick 49170 0 Y 24981 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49171 0 Y 24883 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49171 0 Y 25315 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49172 0 Y 25336 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49172 0 Y 24902 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49171 0 Y 26064 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49171 0 Y 25000 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49172 0 Y 25019 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks8/brick 49172 0 Y 26083 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49173 0 Y 25355 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49173 0 Y 24921 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49173 0 Y 26102 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49173 0 Y 25038 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks11/brick1 49156 0 Y 30009 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49174 0 Y 24940 Brick gqas009.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49175 0 Y 25393 Brick gqas010.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49175 0 Y 24959 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49174 0 Y 26121 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks12/brick 49174 0 Y 25057 Brick gqas014.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49175 0 Y 25076 Brick gqas015.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49175 0 Y 26140 Self-heal Daemon on localhost N/A N/A Y 17960 Self-heal Daemon on gqas010.sbu.lab.eng.bos .redhat.com N/A N/A Y 13756 Self-heal Daemon on gqas015.sbu.lab.eng.bos .redhat.com N/A N/A Y 17415 Self-heal Daemon on gqas014.sbu.lab.eng.bos .redhat.com N/A N/A Y 17200 Task Status of Volume testvol ------------------------------------------------------------------------------ Task : Rebalance ID : 60d3e3e4-661c-4520-9f5f-482d95d81a82 Status : in progress [root@gqas009 ~]#
Proposing this to block 3.2 since application side is impacted.
*Sidenote* : The test passed on EC over gNFS.
Setup is in same state in case someone wants to take a look.
Verified this with (Readdir chunking code disable) - # rpm -qa | grep ganesha nfs-ganesha-gluster-2.5.5-10.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.5.5-10.el7rhgs.x86_64 nfs-ganesha-2.5.5-10.el7rhgs.x86_64 glusterfs-ganesha-3.12.2-16.el7rhgs.x86_64 Steps performed to verify the issue- 1.Create 6 node ganesha cluster 2.Create 6 x (4 + 2) Distributed-Disperse Volume.Enable ganesha on the volume 3.Mount the volume on 4 clients v3/v4 via same VIP. 4.Create huge data set consisting of small,large and empty directories- Detailed- Equal number of files approximately 1.1 million files averaging to 8k file size in large and small directory sets. The small directory set had 12.5k directory with less than or equal to 100 files per directory and the large directory set comprised of 50 directories with approximately 20k files per directory.Empty directory sets consist of 12.5k directories. 5.Once the data set is created, trigger recursive find,du -sh,ll -R from 3 clients No hangs were observed for ll -R and find'd running recursively. But du -sh took longer time ( ~ 2.5 Hrs) for data set of ~ 134 GB. This is been tracked as part of BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1622281 Moving this bug to verified state,since ll -R and find'd hung issue seems to be resolved.
This should be moved out of 3.4, since dirent chunk is removed.
Verified this BZ with # rpm -qa | grep ganesha nfs-ganesha-2.7.3-7.el7rhgs.x86_64 glusterfs-ganesha-6.0-11.el7rhgs.x86_64 nfs-ganesha-gluster-2.7.3-7.el7rhgs.x86_64 Steps performed for verification- 1.Create 4 node ganesha cluster 2.Create 2 x (4 + 2) Distributed-Disperse Volume.Enable ganesha on the volume 3.Mount the volume on 4 clients v3/v4.1 via same VIP. 4.Create huge data set consisting of small,large and empty directories- 5.Once the data set is created, trigger recursive find,du -sh,ll -R from 4 clients No hangs were observed for ll -R and find'd running recursively.Moving this BZ to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3252