Description of problem:
------------------------
4 Node cluster containing 3 volumes - testvol{1,2,3}.
4 clients mount these volumes (NOT in a 1:1 way) :
Client 1 : testvol1 via v3 and v4,testvol3(v3)
Client 2 : testvol1(v3) and testvol2(v3)
Client 3 : testvol2(v3) and testvol3 via v3 and v4
Client 4 : testvol1(v3) ,testvol3(v3) ,testvol3(v4)
Almost 2.5 hours into my workload,Ganesha crashed on 3/4 nodes (tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1401160)
Ganesha log is literally flooded with Server Fault Errors and Stale File Handle(there was no rm,only writes) :
02/12/2016 08:01:37 : epoch d2450000 : gqas009.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31124[work-207] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
and,
02/12/2016 07:56:04 : epoch 52b20000 : gqas014.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-19431[work-130] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Stale file handle
Now,this is what is concerning - After 2.5 hours of writes from various mounts,the ERR_FSAL_SERVERFAULT message was logged more than 10000 times in 3 of my servers :
[root@gqas015 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l
15563
[root@gqas015 /]#
[root@gqas010 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l
10196
[root@gqas010 /]#
[root@gqas009 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l
12784
[root@gqas009 /]#
Version-Release number of selected component (if applicable):
-------------------------------------------------------------
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64
nfs-ganesha-2.4.1-1.el7rhgs.x86_64
How reproducible:
-----------------
1/1
Steps to Reproduce:
------------------
1. Create a cluster with more than 1 volume.
2. Mount these volumes(more than 1 mount per client) via v3 and v4.
3. Pump IO.
Actual results:
---------------
Ganesha crashes and log flooding.
Expected results:
-----------------
No crashes/errors.
Additional info:
---------------
OS : RHEL 7.3
*Vol Config* :
Volume Name: testvol1
Type: Distribute
Volume ID: 7a2dae27-0646-4284-9a34-e7b8455d439f
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol1_brick0
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
Volume Name: testvol2
Type: Distribute
Volume ID: 5a61a980-c8e6-41d7-bd00-9ac7f51cbf5e
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol2_brick1
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
Volume Name: testvol3
Type: Replicate
Volume ID: 298bfa41-7469-4ff2-b9d4-aafb67c5cb9b
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick2
Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas009 tmp]#
Comment 5Daniel Gryniewicz
2016-12-05 13:50:24 UTC
Description of problem: ------------------------ 4 Node cluster containing 3 volumes - testvol{1,2,3}. 4 clients mount these volumes (NOT in a 1:1 way) : Client 1 : testvol1 via v3 and v4,testvol3(v3) Client 2 : testvol1(v3) and testvol2(v3) Client 3 : testvol2(v3) and testvol3 via v3 and v4 Client 4 : testvol1(v3) ,testvol3(v3) ,testvol3(v4) Almost 2.5 hours into my workload,Ganesha crashed on 3/4 nodes (tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1401160) Ganesha log is literally flooded with Server Fault Errors and Stale File Handle(there was no rm,only writes) : 02/12/2016 08:01:37 : epoch d2450000 : gqas009.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31124[work-207] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT and, 02/12/2016 07:56:04 : epoch 52b20000 : gqas014.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-19431[work-130] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Stale file handle Now,this is what is concerning - After 2.5 hours of writes from various mounts,the ERR_FSAL_SERVERFAULT message was logged more than 10000 times in 3 of my servers : [root@gqas015 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l 15563 [root@gqas015 /]# [root@gqas010 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l 10196 [root@gqas010 /]# [root@gqas009 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l 12784 [root@gqas009 /]# Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 nfs-ganesha-2.4.1-1.el7rhgs.x86_64 How reproducible: ----------------- 1/1 Steps to Reproduce: ------------------ 1. Create a cluster with more than 1 volume. 2. Mount these volumes(more than 1 mount per client) via v3 and v4. 3. Pump IO. Actual results: --------------- Ganesha crashes and log flooding. Expected results: ----------------- No crashes/errors. Additional info: --------------- OS : RHEL 7.3 *Vol Config* : Volume Name: testvol1 Type: Distribute Volume ID: 7a2dae27-0646-4284-9a34-e7b8455d439f Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol1_brick0 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: testvol2 Type: Distribute Volume ID: 5a61a980-c8e6-41d7-bd00-9ac7f51cbf5e Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol2_brick1 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: testvol3 Type: Replicate Volume ID: 298bfa41-7469-4ff2-b9d4-aafb67c5cb9b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick2 Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas009 tmp]#