Bug 1648783

Summary: client mount point is hung on gluster-NFS volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijay Avuthu <vavuthu>
Component: gluster-nfsAssignee: Jiffin <jthottan>
Status: CLOSED WONTFIX QA Contact: Jilju Joy <jijoy>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.4CC: amukherj, apaladug, dang, grajoria, jiyin, jthottan, kkeithle, mbenjamin, mchangir, rcyriac, rhinduja, rhs-bugs, sanandpa, sankarshan, skoduri, storage-qa-internal, ubansal, vavuthu, ykaul
Target Milestone: ---Keywords: AutomationBlocker, AutomationTriaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: 3.5-qe-proposed
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-23 04:57:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1655129    
Bug Blocks:    

Description Vijay Avuthu 2018-11-12 06:45:48 UTC
Description of problem:

During automation runs of BVT, found that one of the client mount point went to hung state on gluster-NFS volume.


Version-Release number of selected component (if applicable):

glusterfs-3.12.2-27.el7rhgs.x86_64


How reproducible: 1/1


Steps to Reproduce:

Below are steps from the test case automation

1. create Distributed-Disperse- 2 x (4 + 2) volume 
2. write IO from 2 clients
3. Add bricks while IO is in progress
4. start re-balance
5. wait for re-balance to complete
6. check for IO

After step 6), mount point is hung 

Actual results:

mount point is hung  on client

Expected results:

IO should be success


Additional info:

[root@rhsauto052 glusterfs]# gluster vol info
 
Volume Name: testvol_dispersed
Type: Distributed-Disperse
Volume ID: 8b194e04-200e-4d61-b8ec-2a47c036d9b0
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: rhsauto052.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick0
Brick2: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick1
Brick3: rhsauto053.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick2
Brick4: rhsauto056.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick3
Brick5: rhsauto026.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick4
Brick6: rhsauto049.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick5
Brick7: rhsauto052.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick6
Brick8: rhsauto023.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick7
Brick9: rhsauto053.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick8
Brick10: rhsauto056.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick9
Brick11: rhsauto026.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick10
Brick12: rhsauto049.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick11
Options Reconfigured:
transport.address-family: inet
nfs.disable: off
[root@rhsauto052 glusterfs]# 

> Volume Status:

[root@rhsauto052 glusterfs]# gluster vol status
Status of volume: testvol_dispersed
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsauto052.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_dispersed_brick0        49152     0          Y       24674
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_dispersed_brick1        49153     0          Y       9935 
Brick rhsauto053.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_dispersed_brick2        49153     0          Y       9714 
Brick rhsauto056.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_dispersed_brick3        49152     0          Y       8987 
Brick rhsauto026.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_dispersed_brick4        49152     0          Y       7870 
Brick rhsauto049.lab.eng.blr.redhat.com:/br
icks/brick0/testvol_dispersed_brick5        49152     0          Y       8388 
Brick rhsauto052.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_dispersed_brick6        49153     0          Y       25292
Brick rhsauto023.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_dispersed_brick7        49152     0          Y       10068
Brick rhsauto053.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_dispersed_brick8        49152     0          Y       9855 
Brick rhsauto056.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_dispersed_brick9        49153     0          Y       9133 
Brick rhsauto026.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_dispersed_brick10       49153     0          Y       8007 
Brick rhsauto049.lab.eng.blr.redhat.com:/br
icks/brick1/testvol_dispersed_brick11       49153     0          Y       8529 
NFS Server on localhost                     2049      0          Y       25313
Self-heal Daemon on localhost               N/A       N/A        Y       25326
NFS Server on rhsauto053.lab.eng.blr.redhat
.com                                        2049      0          Y       9876 
Self-heal Daemon on rhsauto053.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       9885 
NFS Server on rhsauto049.lab.eng.blr.redhat
.com                                        2049      0          Y       8550 
Self-heal Daemon on rhsauto049.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       8559 
NFS Server on rhsauto023.lab.eng.blr.redhat
.com                                        2049      0          Y       10089
Self-heal Daemon on rhsauto023.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       10098
NFS Server on rhsauto026.lab.eng.blr.redhat
.com                                        2049      0          Y       8028 
Self-heal Daemon on rhsauto026.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       8037 
NFS Server on rhsauto056.lab.eng.blr.redhat
.com                                        2049      0          Y       9154 
Self-heal Daemon on rhsauto056.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       9163 
 
Task Status of Volume testvol_dispersed
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 33d42dfa-5841-4a60-9c9d-942cbcf3f47c
Status               : completed           
 
[root@rhsauto052 glusterfs]# 

> No messages logged in nfs.log

SOS Report: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/nfs_hung_bvt/

Comment 6 Vijay Avuthu 2018-11-13 09:50:28 UTC
Update:
==========

From the nfs.log-20181111, test case started on 2018-11-09 12:32:43 

Starting Test : functional.bvt.test_cvt.TestGlusterExpandVolumeSanity_cplex_dispersed_nfs.test_expanding_volume_when_io_in_progress : 06_25_09_11_2018
[2018-11-09 12:32:43.517958] I [MSGID: 100030] [glusterfsd.c:2504:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.12.2 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/run/gluster/nfs/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/gluster/427e2a195b8f1bc9.socket)


From the glusto logs ( glusto logs are in ESt time zone ), 

nfs volume is mounted on 2018-11-09 12:33:16 UTC and writes started on 2018-11-09 12:33:17 UTC

Hung is observed on 2018-11-09 12:36:29 UTC

> From above time lines, hung happens between 

2018-11-09 12:33:16 UTC to 2018-11-09 12:36:29 UTC


> I can able to mount same nfs volume on different client

[root@dhcp47-46 ~]# mount -t nfs -o vers=3 rhsauto052.lab.eng.blr.redhat.com:/testvol_dispersed /mnt/nfs_hung
[root@dhcp47-46 ~]# 
[root@dhcp47-46 ~]# df -h | grep -i nfs
rhsauto052.lab.eng.blr.redhat.com:/testvol_dispersed  398G  4.8G  394G   2% /mnt/nfs_hung
[root@dhcp47-46 ~]# 



Glusto logs: http://jenkins-rhs.lab.eng.blr.redhat.com:8080/view/Auto%20RHEL%207.6/job/auto-RHGS_Downstream_BVT_RHEL_7_6_RHGS_3_4_2_brew/ws/glusto_2.log


In steps to reproduce, I mentioned create Distributed-Disperse- 2 x (4 + 2) volume but its Disperse 1 x (4 + 2) and all the steps remains same.

Comment 7 Vijay Avuthu 2018-11-13 12:07:16 UTC
Update:
========

> Reproduced the issue on another setup with Debug log level enabled for brick-log-level and client-log-level at server side.

> enabled "rpcdebug -m nfs -s all" on both clients.

> started capturing packets before adding-bricks to the volume and ended after client hung

> tcpdumps are uploaded to 

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/nfs_hung_on_new-setup/

> System are kept in the same hung state.

[root@rhsauto030 ~]# time df -h
^C

real	13m9.348s
user	0m0.000s
sys	0m0.005s


[root@rhsauto027 ~]# gluster vol info
 
Volume Name: testvol_dispersed
Type: Distributed-Disperse
Volume ID: 46280d4d-a2cd-4886-a07e-5075c59deb2d
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: rhsauto027.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick0
Brick2: rhsauto025.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick1
Brick3: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick2
Brick4: rhsauto022.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick3
Brick5: rhsauto024.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick4
Brick6: rhsauto029.lab.eng.blr.redhat.com:/bricks/brick0/testvol_dispersed_brick5
Brick7: rhsauto027.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick6
Brick8: rhsauto025.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick7
Brick9: rhsauto021.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick8
Brick10: rhsauto022.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick9
Brick11: rhsauto024.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick10
Brick12: rhsauto029.lab.eng.blr.redhat.com:/bricks/brick1/testvol_dispersed_brick11
Options Reconfigured:
diagnostics.client-log-level: DEBUG
diagnostics.brick-log-level: DEBUG
transport.address-family: inet
nfs.disable: off
[root@rhsauto027 ~]#

Comment 8 Atin Mukherjee 2018-11-14 16:17:54 UTC
Jiffin - Could you please take a look at this and see if this is indeed a regression or not?

Comment 40 Red Hat Bugzilla 2023-09-14 04:42:09 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days