Bug 1382912

Summary:	[Ganesha] : mount fails when find hangs.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ambarish <asoman>
Component:	nfs-ganesha	Assignee:	Daniel Gryniewicz <dang>
Status:	CLOSED ERRATA	QA Contact:	Ambarish <asoman>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, asoman, ffilz, jthottan, kkeithle, rcyriac, rhinduja, rhs-bugs, sbhaloth, skoduri, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	RHGS 3.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	nfs-ganesha-2.4.1-3	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1403757 (view as bug list)		Environment:
Last Closed:	2017-03-23 06:24:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1379673
Bug Blocks:	1351528

Description Ambarish 2016-10-08 08:46:13 UTC

Description of problem:
-----------------------
4 node Ganesha cluster.4 clients,each one mounted from one particular server via its VIP.

Ran I/O from different mounts from 3 different clients.Ran "find" from a 4th client.find did not start even after 36 hours of executing it from the cmd line.dds got hung too from one of the clients.

Tried to mount the volume on 4 new clients.Mounts are unsuccessful from the server from which the client mounted which had a find hang.They get timed out eventually.

Shared setup with Soumya.She suspects find hangs are causing the dd hangs(BZ#1379673).But find hangs and mount fails might need further investigation.

To reiterate,this is the impact/observation :

* Application side hang - find.
* Unable to mount the volume from the server VIP/physical IP(the same one which the client mounted from where find was hanging).

[root@gqac030 ~]# mount -t nfs -o vers=4 192.168.79.153:/testvol /gluster-mount/ -v
mount.nfs: timeout set for Thu Oct 6 10:59:05 2016
mount.nfs: trying text-based options 'vers=4,addr=192.168.79.152,clientaddr=10.16.157.87'
^C
[root@gqac030 ~]#

[root@gqac030 ~]# ping 192.168.79.153
PING 192.168.79.153 (192.168.79.153) 56(84) bytes of data.
64 bytes from 192.168.79.153: icmp_seq=1 ttl=64 time=0.151 ms
64 bytes from 192.168.79.153: icmp_seq=2 ttl=64 time=0.096 ms
64 bytes from 192.168.79.153: icmp_seq=3 ttl=64 time=0.091 ms

Mounts from other servers in the cluster are successful though.

pcs status was OK all along.

Unable to take BT.
Setup and workload details in comments.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

nfs-ganesha-2.4.0-2.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64

How reproducible:
-----------------

Reporting the first occurrence.

Steps to Reproduce:
-------------------

1. Mount gluster volume via Ganesha

2. Run dd from diff clients

3. Run find on mount point from one of the clients while I/O is in progress.Check for progress continuously

4. Check on another client,if mounts are happening from the same server which the client mounted from where find was hung.

Actual results:
---------------

* Find hangs
* Mounts from the server fails(from the same server which the client mounted from where find was hung)

Expected results:
-----------------

No hangs and successful mounts.

Additional info:
----------------

* mount vers=4

* Client/Server OS : RHEL 7.2

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: b93b99bd-d1d2-4236-98bc-08311f94e7dc
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
ganesha.enable: on
features.cache-invalidation: off
nfs.disable: on
performance.readdir-ahead: on
performance.stat-prefetch: off
server.allow-insecure: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 2 Ambarish 2016-10-08 08:50:52 UTC

EXACT WORKLOAD :
-------------

*Data* - for i in {1..1000000};do dd if=/dev/urandom of=stressc3$i conv=fdatasync bs=100 count=10000;don

*Metadata* - find . -mindepth 1 -type f

Comment 5 Ambarish 2016-10-08 10:01:38 UTC

glibc version on clients n servers : glibc-2.17-149.el7.x86_64

Comment 6 Soumya Koduri 2016-10-14 07:01:28 UTC

As updated in the https://bugzilla.redhat.com/show_bug.cgi?id=1383559#c5, please collect process stack traces as well while the tests are being run.

Comment 8 Ambarish 2016-10-17 12:02:51 UTC

I  managed to delete the "Triaged" keyword added by Jiffin during a mid-air collision.
Re-added.

Comment 11 surabhi 2016-11-29 10:04:07 UTC

As per the triaging we all have the agreement that this BZ has to be fixed in rhgs-3.2.0. Providing qa_ack

Comment 14 Atin Mukherjee 2016-12-06 07:16:00 UTC

Upstream fix:

https://review.gerrithub.io/304278
https://review.gerrithub.io/304279

Comment 16 Ambarish 2016-12-12 10:38:27 UTC

Raised a new BZ for find hangs : 

https://bugzilla.redhat.com/show_bug.cgi?id=1403757

Comment 19 Ambarish 2017-01-27 09:57:01 UTC

Verified on 2.4.1-6/3.8.4-13.

finds were hung.(Expected =>https://bugzilla.redhat.com/show_bug.cgi?id=1403757)

Subsequent mounts were successful.

Comment 21 errata-xmlrpc 2017-03-23 06:24:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html