Bug 1382912

Summary: [Ganesha] : mount fails when find hangs.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: nfs-ganeshaAssignee: Daniel Gryniewicz <dang>
Status: CLOSED ERRATA QA Contact: Ambarish <asoman>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asoman, ffilz, jthottan, kkeithle, rcyriac, rhinduja, rhs-bugs, sbhaloth, skoduri, storage-qa-internal
Target Milestone: ---Keywords: Triaged
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: nfs-ganesha-2.4.1-3 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1403757 (view as bug list) Environment:
Last Closed: 2017-03-23 06:24:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1379673    
Bug Blocks: 1351528    

Description Ambarish 2016-10-08 08:46:13 UTC
Description of problem:
-----------------------
4 node Ganesha cluster.4 clients,each one mounted from one particular server via its VIP.

Ran I/O from different mounts from 3 different clients.Ran "find" from a 4th client.find did not start even after 36 hours of executing it from the cmd line.dds got hung too from one of the clients.

Tried to mount the volume on 4 new clients.Mounts are unsuccessful from the server from which the client mounted which had a find hang.They get timed out eventually.

Shared setup with Soumya.She suspects find hangs are causing the dd hangs(BZ#1379673).But find hangs and mount fails might need further investigation.

To reiterate,this is the impact/observation :

* Application side hang - find.
* Unable to mount the volume from the server VIP/physical IP(the same one which the client mounted from where find was hanging).

[root@gqac030 ~]#  mount -t nfs -o vers=4 192.168.79.153:/testvol /gluster-mount/ -v
mount.nfs: timeout set for Thu Oct  6 10:59:05 2016
mount.nfs: trying text-based options 'vers=4,addr=192.168.79.152,clientaddr=10.16.157.87'
^C
[root@gqac030 ~]# 

[root@gqac030 ~]# ping 192.168.79.153
PING 192.168.79.153 (192.168.79.153) 56(84) bytes of data.
64 bytes from 192.168.79.153: icmp_seq=1 ttl=64 time=0.151 ms
64 bytes from 192.168.79.153: icmp_seq=2 ttl=64 time=0.096 ms
64 bytes from 192.168.79.153: icmp_seq=3 ttl=64 time=0.091 ms


Mounts from other servers in the cluster are successful though.

pcs status was OK all along.

Unable to take BT.
Setup and workload details in comments.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------


nfs-ganesha-2.4.0-2.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64


How reproducible:
-----------------

Reporting the first occurrence.

Steps to Reproduce:
-------------------

1. Mount gluster volume via Ganesha 

2. Run dd from diff clients

3. Run find on mount point from one of the clients while I/O is in progress.Check for progress continuously

4. Check on another client,if mounts are happening from the same server which the client mounted from where find was hung.

Actual results:
---------------

* Find hangs
* Mounts from the server fails(from the same server which the client mounted from where find was hung)

Expected results:
-----------------

No hangs and successful mounts.

Additional info:
----------------

* mount vers=4

* Client/Server OS : RHEL 7.2

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: b93b99bd-d1d2-4236-98bc-08311f94e7dc
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
ganesha.enable: on
features.cache-invalidation: off
nfs.disable: on
performance.readdir-ahead: on
performance.stat-prefetch: off
server.allow-insecure: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 2 Ambarish 2016-10-08 08:50:52 UTC
EXACT WORKLOAD :
-------------

*Data* - for i in {1..1000000};do dd if=/dev/urandom of=stressc3$i conv=fdatasync bs=100 count=10000;don

*Metadata* - find . -mindepth 1 -type f

Comment 5 Ambarish 2016-10-08 10:01:38 UTC
glibc version on clients n servers : glibc-2.17-149.el7.x86_64

Comment 6 Soumya Koduri 2016-10-14 07:01:28 UTC
As updated in the https://bugzilla.redhat.com/show_bug.cgi?id=1383559#c5, please collect process stack traces as well while the tests are being run.

Comment 8 Ambarish 2016-10-17 12:02:51 UTC
I  managed to delete the "Triaged" keyword added by Jiffin during a mid-air collision.
Re-added.

Comment 11 surabhi 2016-11-29 10:04:07 UTC
As per the triaging we all have the agreement that this BZ has to be fixed in rhgs-3.2.0. Providing qa_ack

Comment 14 Atin Mukherjee 2016-12-06 07:16:00 UTC
Upstream fix:

https://review.gerrithub.io/304278
https://review.gerrithub.io/304279

Comment 16 Ambarish 2016-12-12 10:38:27 UTC
Raised a new BZ for find hangs : 

https://bugzilla.redhat.com/show_bug.cgi?id=1403757

Comment 19 Ambarish 2017-01-27 09:57:01 UTC
Verified on 2.4.1-6/3.8.4-13.

finds were hung.(Expected =>https://bugzilla.redhat.com/show_bug.cgi?id=1403757)

Subsequent mounts were successful.

Comment 21 errata-xmlrpc 2017-03-23 06:24:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html