Bug 1401162

Summary:	[Ganesha + Multi-Volume/Multi-Mount] : Logs flooded with Server Fault and Stale File handle errors during writes.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ambarish <asoman>
Component:	nfs-ganesha	Assignee:	Kaleb KEITHLEY <kkeithle>
Status:	CLOSED NEXTRELEASE	QA Contact:	Ambarish <asoman>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, asoman, bturner, dang, ffilz, jthottan, mbenjamin, rgowdapp, rhinduja, rhs-bugs, rkavunga, skoduri, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	---	Flags:	ykaul: needinfo+ ykaul: needinfo+
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	rhgs-3.3.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-23 12:24:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ambarish 2016-12-03 04:36:13 UTC

Description of problem:
------------------------

4 Node cluster containing 3 volumes - testvol{1,2,3}.

4 clients mount these volumes (NOT in a 1:1 way) :

Client 1 : testvol1 via v3 and v4,testvol3(v3)
Client 2 : testvol1(v3) and testvol2(v3)
Client 3 : testvol2(v3) and testvol3 via v3 and v4
Client 4 : testvol1(v3) ,testvol3(v3) ,testvol3(v4)

Almost 2.5 hours into my workload,Ganesha crashed on 3/4 nodes (tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1401160)

Ganesha log is literally flooded with  Server Fault Errors and Stale File Handle(there was no rm,only writes) :

02/12/2016 08:01:37 : epoch d2450000 : gqas009.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31124[work-207] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT

and,

02/12/2016 07:56:04 : epoch 52b20000 : gqas014.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-19431[work-130] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Stale file handle


Now,this is what is concerning - After 2.5 hours of writes from various mounts,the ERR_FSAL_SERVERFAULT message was logged more than 10000 times in 3 of my servers :


[root@gqas015 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l
15563
[root@gqas015 /]# 

[root@gqas010 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l
10196
[root@gqas010 /]# 

[root@gqas009 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l
12784
[root@gqas009 /]# 
 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------


glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64
nfs-ganesha-2.4.1-1.el7rhgs.x86_64


How reproducible:
-----------------

1/1

Steps to Reproduce:
------------------

1. Create a cluster with more than 1 volume.

2. Mount these volumes(more than 1 mount per client) via v3 and v4.

3. Pump IO.


Actual results:
---------------

Ganesha crashes and log flooding.

Expected results:
-----------------

No crashes/errors.

Additional info:
---------------

OS : RHEL 7.3

*Vol Config* :

Volume Name: testvol1
Type: Distribute
Volume ID: 7a2dae27-0646-4284-9a34-e7b8455d439f
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol1_brick0
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
 
Volume Name: testvol2
Type: Distribute
Volume ID: 5a61a980-c8e6-41d7-bd00-9ac7f51cbf5e
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol2_brick1
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
 
Volume Name: testvol3
Type: Replicate
Volume ID: 298bfa41-7469-4ff2-b9d4-aafb67c5cb9b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick2
Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas009 tmp]#

Comment 5 Daniel Gryniewicz 2016-12-05 13:50:24 UTC

107 is ENOTCONN