Bug 1401162 - [Ganesha + Multi-Volume/Multi-Mount] : Logs flooded with Server Fault and Stale File handle errors during writes.
Summary: [Ganesha + Multi-Volume/Multi-Mount] : Logs flooded with Server Fault and St...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Kaleb KEITHLEY
QA Contact: Ambarish
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-03 04:36 UTC by Ambarish
Modified: 2019-12-31 07:17 UTC (History)
13 users (show)

Fixed In Version: rhgs-3.3.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-23 12:24:26 UTC
Target Upstream Version:
ykaul: needinfo+
ykaul: needinfo+


Attachments (Terms of Use)

Description Ambarish 2016-12-03 04:36:13 UTC
Description of problem:
------------------------

4 Node cluster containing 3 volumes - testvol{1,2,3}.

4 clients mount these volumes (NOT in a 1:1 way) :

Client 1 : testvol1 via v3 and v4,testvol3(v3)
Client 2 : testvol1(v3) and testvol2(v3)
Client 3 : testvol2(v3) and testvol3 via v3 and v4
Client 4 : testvol1(v3) ,testvol3(v3) ,testvol3(v4)

Almost 2.5 hours into my workload,Ganesha crashed on 3/4 nodes (tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1401160)

Ganesha log is literally flooded with  Server Fault Errors and Stale File Handle(there was no rm,only writes) :

02/12/2016 08:01:37 : epoch d2450000 : gqas009.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31124[work-207] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT

and,

02/12/2016 07:56:04 : epoch 52b20000 : gqas014.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-19431[work-130] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Stale file handle


Now,this is what is concerning - After 2.5 hours of writes from various mounts,the ERR_FSAL_SERVERFAULT message was logged more than 10000 times in 3 of my servers :


[root@gqas015 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l
15563
[root@gqas015 /]# 

[root@gqas010 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l
10196
[root@gqas010 /]# 

[root@gqas009 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l
12784
[root@gqas009 /]# 
 


Version-Release number of selected component (if applicable):
-------------------------------------------------------------


glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64
nfs-ganesha-2.4.1-1.el7rhgs.x86_64


How reproducible:
-----------------

1/1

Steps to Reproduce:
------------------

1. Create a cluster with more than 1 volume.

2. Mount these volumes(more than 1 mount per client) via v3 and v4.

3. Pump IO.


Actual results:
---------------

Ganesha crashes and log flooding.

Expected results:
-----------------

No crashes/errors.

Additional info:
---------------

OS : RHEL 7.3

*Vol Config* :

Volume Name: testvol1
Type: Distribute
Volume ID: 7a2dae27-0646-4284-9a34-e7b8455d439f
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol1_brick0
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
 
Volume Name: testvol2
Type: Distribute
Volume ID: 5a61a980-c8e6-41d7-bd00-9ac7f51cbf5e
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol2_brick1
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
 
Volume Name: testvol3
Type: Replicate
Volume ID: 298bfa41-7469-4ff2-b9d4-aafb67c5cb9b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick2
Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas009 tmp]#

Comment 5 Daniel Gryniewicz 2016-12-05 13:50:24 UTC
107 is ENOTCONN


Note You need to log in before you can comment on or make changes to this bug.