Bug 1382912 - [Ganesha] : mount fails when find hangs.
Summary: [Ganesha] : mount fails when find hangs.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Daniel Gryniewicz
QA Contact: Ambarish
URL:
Whiteboard:
Depends On: 1379673
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-10-08 08:46 UTC by Ambarish
Modified: 2017-03-28 06:56 UTC (History)
11 users (show)

Fixed In Version: nfs-ganesha-2.4.1-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1403757 (view as bug list)
Environment:
Last Closed: 2017-03-23 06:24:20 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1403757 0 unspecified CLOSED [Ganesha] : find hangs when coupled with new writes. 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHEA-2017:0493 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.2.0 nfs-ganesha bug fix and enhancement update 2017-03-23 09:19:13 UTC

Internal Links: 1403757

Description Ambarish 2016-10-08 08:46:13 UTC
Description of problem:
-----------------------
4 node Ganesha cluster.4 clients,each one mounted from one particular server via its VIP.

Ran I/O from different mounts from 3 different clients.Ran "find" from a 4th client.find did not start even after 36 hours of executing it from the cmd line.dds got hung too from one of the clients.

Tried to mount the volume on 4 new clients.Mounts are unsuccessful from the server from which the client mounted which had a find hang.They get timed out eventually.

Shared setup with Soumya.She suspects find hangs are causing the dd hangs(BZ#1379673).But find hangs and mount fails might need further investigation.

To reiterate,this is the impact/observation :

* Application side hang - find.
* Unable to mount the volume from the server VIP/physical IP(the same one which the client mounted from where find was hanging).

[root@gqac030 ~]#  mount -t nfs -o vers=4 192.168.79.153:/testvol /gluster-mount/ -v
mount.nfs: timeout set for Thu Oct  6 10:59:05 2016
mount.nfs: trying text-based options 'vers=4,addr=192.168.79.152,clientaddr=10.16.157.87'
^C
[root@gqac030 ~]# 

[root@gqac030 ~]# ping 192.168.79.153
PING 192.168.79.153 (192.168.79.153) 56(84) bytes of data.
64 bytes from 192.168.79.153: icmp_seq=1 ttl=64 time=0.151 ms
64 bytes from 192.168.79.153: icmp_seq=2 ttl=64 time=0.096 ms
64 bytes from 192.168.79.153: icmp_seq=3 ttl=64 time=0.091 ms


Mounts from other servers in the cluster are successful though.

pcs status was OK all along.

Unable to take BT.
Setup and workload details in comments.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------


nfs-ganesha-2.4.0-2.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64


How reproducible:
-----------------

Reporting the first occurrence.

Steps to Reproduce:
-------------------

1. Mount gluster volume via Ganesha 

2. Run dd from diff clients

3. Run find on mount point from one of the clients while I/O is in progress.Check for progress continuously

4. Check on another client,if mounts are happening from the same server which the client mounted from where find was hung.

Actual results:
---------------

* Find hangs
* Mounts from the server fails(from the same server which the client mounted from where find was hung)

Expected results:
-----------------

No hangs and successful mounts.

Additional info:
----------------

* mount vers=4

* Client/Server OS : RHEL 7.2

*Vol Config* :

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: b93b99bd-d1d2-4236-98bc-08311f94e7dc
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
ganesha.enable: on
features.cache-invalidation: off
nfs.disable: on
performance.readdir-ahead: on
performance.stat-prefetch: off
server.allow-insecure: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 2 Ambarish 2016-10-08 08:50:52 UTC
EXACT WORKLOAD :
-------------

*Data* - for i in {1..1000000};do dd if=/dev/urandom of=stressc3$i conv=fdatasync bs=100 count=10000;don

*Metadata* - find . -mindepth 1 -type f

Comment 5 Ambarish 2016-10-08 10:01:38 UTC
glibc version on clients n servers : glibc-2.17-149.el7.x86_64

Comment 6 Soumya Koduri 2016-10-14 07:01:28 UTC
As updated in the https://bugzilla.redhat.com/show_bug.cgi?id=1383559#c5, please collect process stack traces as well while the tests are being run.

Comment 8 Ambarish 2016-10-17 12:02:51 UTC
I  managed to delete the "Triaged" keyword added by Jiffin during a mid-air collision.
Re-added.

Comment 11 surabhi 2016-11-29 10:04:07 UTC
As per the triaging we all have the agreement that this BZ has to be fixed in rhgs-3.2.0. Providing qa_ack

Comment 14 Atin Mukherjee 2016-12-06 07:16:00 UTC
Upstream fix:

https://review.gerrithub.io/304278
https://review.gerrithub.io/304279

Comment 16 Ambarish 2016-12-12 10:38:27 UTC
Raised a new BZ for find hangs : 

https://bugzilla.redhat.com/show_bug.cgi?id=1403757

Comment 19 Ambarish 2017-01-27 09:57:01 UTC
Verified on 2.4.1-6/3.8.4-13.

finds were hung.(Expected =>https://bugzilla.redhat.com/show_bug.cgi?id=1403757)

Subsequent mounts were successful.

Comment 21 errata-xmlrpc 2017-03-23 06:24:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0493.html


Note You need to log in before you can comment on or make changes to this bug.