Hide Forgot
Description of problem: ------------------------ 4 Node cluster containing 3 volumes - testvol{1,2,3}. 4 clients mount these volumes (NOT in a 1:1 way) : Client 1 : testvol1 via v3 and v4,testvol3(v3) Client 2 : testvol1(v3) and testvol2(v3) Client 3 : testvol2(v3) and testvol3 via v3 and v4 Client 4 : testvol1(v3) ,testvol3(v3) ,testvol3(v4) Almost 2.5 hours into my workload,Ganesha crashed on 3/4 nodes (tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1401160) Ganesha log is literally flooded with Server Fault Errors and Stale File Handle(there was no rm,only writes) : 02/12/2016 08:01:37 : epoch d2450000 : gqas009.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31124[work-207] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT and, 02/12/2016 07:56:04 : epoch 52b20000 : gqas014.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-19431[work-130] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Stale file handle Now,this is what is concerning - After 2.5 hours of writes from various mounts,the ERR_FSAL_SERVERFAULT message was logged more than 10000 times in 3 of my servers : [root@gqas015 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l 15563 [root@gqas015 /]# [root@gqas010 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l 10196 [root@gqas010 /]# [root@gqas009 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l 12784 [root@gqas009 /]# Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 nfs-ganesha-2.4.1-1.el7rhgs.x86_64 How reproducible: ----------------- 1/1 Steps to Reproduce: ------------------ 1. Create a cluster with more than 1 volume. 2. Mount these volumes(more than 1 mount per client) via v3 and v4. 3. Pump IO. Actual results: --------------- Ganesha crashes and log flooding. Expected results: ----------------- No crashes/errors. Additional info: --------------- OS : RHEL 7.3 *Vol Config* : Volume Name: testvol1 Type: Distribute Volume ID: 7a2dae27-0646-4284-9a34-e7b8455d439f Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol1_brick0 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: testvol2 Type: Distribute Volume ID: 5a61a980-c8e6-41d7-bd00-9ac7f51cbf5e Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol2_brick1 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: testvol3 Type: Replicate Volume ID: 298bfa41-7469-4ff2-b9d4-aafb67c5cb9b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick2 Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas009 tmp]#
107 is ENOTCONN