Description of problem: During automation runs of BVT, found that one of the client mount point went to hung state on gluster-NFS volume. Version-Release number of selected component (if applicable): glusterfs-3.12.2-27.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: Below are steps from the test case automation 1. create Distributed-Disperse- 2 x (4 + 2) volume 2. write IO from 2 clients 3. Add bricks while IO is in progress 4. start re-balance 5. wait for re-balance to complete 6. check for IO After step 6), mount point is hung Actual results: mount point is hung on client Expected results: IO should be success Additional info: [root@rhsauto052 glusterfs]# gluster vol info Volume Name: testvol_dispersed Type: Distributed-Disperse Volume ID: 8b194e04-200e-4d61-b8ec-2a47c036d9b0 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: rhsauto052.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick0 Brick2: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick1 Brick3: rhsauto053.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick2 Brick4: rhsauto056.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick3 Brick5: rhsauto026.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick4 Brick6: rhsauto049.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick5 Brick7: rhsauto052.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick6 Brick8: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick7 Brick9: rhsauto053.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick8 Brick10: rhsauto056.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick9 Brick11: rhsauto026.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick10 Brick12: rhsauto049.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick11 Options Reconfigured: transport.address-family: inet nfs.disable: off [root@rhsauto052 glusterfs]# > Volume Status: [root@rhsauto052 glusterfs]# gluster vol status Status of volume: testvol_dispersed Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhsauto052.lab.eng.blr.redhat.com:/br icks/brick0/testvol_dispersed_brick0 49152 0 Y 24674 Brick rhsauto023.lab.eng.blr.redhat.com:/br icks/brick0/testvol_dispersed_brick1 49153 0 Y 9935 Brick rhsauto053.lab.eng.blr.redhat.com:/br icks/brick0/testvol_dispersed_brick2 49153 0 Y 9714 Brick rhsauto056.lab.eng.blr.redhat.com:/br icks/brick0/testvol_dispersed_brick3 49152 0 Y 8987 Brick rhsauto026.lab.eng.blr.redhat.com:/br icks/brick0/testvol_dispersed_brick4 49152 0 Y 7870 Brick rhsauto049.lab.eng.blr.redhat.com:/br icks/brick0/testvol_dispersed_brick5 49152 0 Y 8388 Brick rhsauto052.lab.eng.blr.redhat.com:/br icks/brick1/testvol_dispersed_brick6 49153 0 Y 25292 Brick rhsauto023.lab.eng.blr.redhat.com:/br icks/brick1/testvol_dispersed_brick7 49152 0 Y 10068 Brick rhsauto053.lab.eng.blr.redhat.com:/br icks/brick1/testvol_dispersed_brick8 49152 0 Y 9855 Brick rhsauto056.lab.eng.blr.redhat.com:/br icks/brick1/testvol_dispersed_brick9 49153 0 Y 9133 Brick rhsauto026.lab.eng.blr.redhat.com:/br icks/brick1/testvol_dispersed_brick10 49153 0 Y 8007 Brick rhsauto049.lab.eng.blr.redhat.com:/br icks/brick1/testvol_dispersed_brick11 49153 0 Y 8529 NFS Server on localhost 2049 0 Y 25313 Self-heal Daemon on localhost N/A N/A Y 25326 NFS Server on rhsauto053.lab.eng.blr.redhat .com 2049 0 Y 9876 Self-heal Daemon on rhsauto053.lab.eng.blr. redhat.com N/A N/A Y 9885 NFS Server on rhsauto049.lab.eng.blr.redhat .com 2049 0 Y 8550 Self-heal Daemon on rhsauto049.lab.eng.blr. redhat.com N/A N/A Y 8559 NFS Server on rhsauto023.lab.eng.blr.redhat .com 2049 0 Y 10089 Self-heal Daemon on rhsauto023.lab.eng.blr. redhat.com N/A N/A Y 10098 NFS Server on rhsauto026.lab.eng.blr.redhat .com 2049 0 Y 8028 Self-heal Daemon on rhsauto026.lab.eng.blr. redhat.com N/A N/A Y 8037 NFS Server on rhsauto056.lab.eng.blr.redhat .com 2049 0 Y 9154 Self-heal Daemon on rhsauto056.lab.eng.blr. redhat.com N/A N/A Y 9163 Task Status of Volume testvol_dispersed ------------------------------------------------------------------------------ Task : Rebalance ID : 33d42dfa-5841-4a60-9c9d-942cbcf3f47c Status : completed [root@rhsauto052 glusterfs]# > No messages logged in nfs.log SOS Report: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/nfs_hung_bvt/
Update: ========== From the nfs.log-20181111, test case started on 2018-11-09 12:32:43 Starting Test : functional.bvt.test_cvt.TestGlusterExpandVolumeSanity_cplex_dispersed_nfs.test_expanding_volume_when_io_in_progress : 06_25_09_11_2018 [2018-11-09 12:32:43.517958] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.12.2 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/427e2a195b8f1bc9.socket) From the glusto logs ( glusto logs are in ESt time zone ), nfs volume is mounted on 2018-11-09 12:33:16 UTC and writes started on 2018-11-09 12:33:17 UTC Hung is observed on 2018-11-09 12:36:29 UTC > From above time lines, hung happens between 2018-11-09 12:33:16 UTC to 2018-11-09 12:36:29 UTC > I can able to mount same nfs volume on different client [root@dhcp47-46 ~]# mount -t nfs -o vers=3 rhsauto052.lab.eng.blr.redhat.com:/testvol_dispersed /mnt/nfs_hung [root@dhcp47-46 ~]# [root@dhcp47-46 ~]# df -h | grep -i nfs rhsauto052.lab.eng.blr.redhat.com:/testvol_dispersed 398G 4.8G 394G 2% /mnt/nfs_hung [root@dhcp47-46 ~]# Glusto logs: http://jenkins-rhs.lab.eng.blr.redhat.com:8080/view/Auto%20RHEL%207.6/job/auto-RHGS_Downstream_BVT_RHEL_7_6_RHGS_3_4_2_brew/ws/glusto_2.log In steps to reproduce, I mentioned create Distributed-Disperse- 2 x (4 + 2) volume but its Disperse 1 x (4 + 2) and all the steps remains same.
Update: ======== > Reproduced the issue on another setup with Debug log level enabled for brick-log-level and client-log-level at server side. > enabled "rpcdebug -m nfs -s all" on both clients. > started capturing packets before adding-bricks to the volume and ended after client hung > tcpdumps are uploaded to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/nfs_hung_on_new-setup/ > System are kept in the same hung state. [root@rhsauto030 ~]# time df -h ^C real 13m9.348s user 0m0.000s sys 0m0.005s [root@rhsauto027 ~]# gluster vol info Volume Name: testvol_dispersed Type: Distributed-Disperse Volume ID: 46280d4d-a2cd-4886-a07e-5075c59deb2d Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: rhsauto027.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick0 Brick2: rhsauto025.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick1 Brick3: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick2 Brick4: rhsauto022.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick3 Brick5: rhsauto024.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick4 Brick6: rhsauto029.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick5 Brick7: rhsauto027.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick6 Brick8: rhsauto025.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick7 Brick9: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick8 Brick10: rhsauto022.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick9 Brick11: rhsauto024.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick10 Brick12: rhsauto029.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick11 Options Reconfigured: diagnostics.client-log-level: DEBUG diagnostics.brick-log-level: DEBUG transport.address-family: inet nfs.disable: off [root@rhsauto027 ~]#
Jiffin - Could you please take a look at this and see if this is indeed a regression or not?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days