| Summary: | [Ganesha + Multi-Volume/Multi-Mount] : Logs flooded with Server Fault and Stale File handle errors during writes. | ||
|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Ambarish <asoman> |
| Component: | nfs-ganesha | Assignee: | Kaleb KEITHLEY <kkeithle> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | Ambarish <asoman> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.2 | CC: | amukherj, asoman, bturner, dang, ffilz, jthottan, mbenjamin, rgowdapp, rhinduja, rhs-bugs, rkavunga, skoduri, storage-qa-internal |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | --- | Flags: | ykaul:
needinfo+
ykaul: needinfo+ |
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | rhgs-3.3.0 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-23 12:24:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
107 is ENOTCONN |
Description of problem: ------------------------ 4 Node cluster containing 3 volumes - testvol{1,2,3}. 4 clients mount these volumes (NOT in a 1:1 way) : Client 1 : testvol1 via v3 and v4,testvol3(v3) Client 2 : testvol1(v3) and testvol2(v3) Client 3 : testvol2(v3) and testvol3 via v3 and v4 Client 4 : testvol1(v3) ,testvol3(v3) ,testvol3(v4) Almost 2.5 hours into my workload,Ganesha crashed on 3/4 nodes (tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1401160) Ganesha log is literally flooded with Server Fault Errors and Stale File Handle(there was no rm,only writes) : 02/12/2016 08:01:37 : epoch d2450000 : gqas009.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31124[work-207] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT and, 02/12/2016 07:56:04 : epoch 52b20000 : gqas014.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-19431[work-130] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Stale file handle Now,this is what is concerning - After 2.5 hours of writes from various mounts,the ERR_FSAL_SERVERFAULT message was logged more than 10000 times in 3 of my servers : [root@gqas015 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l 15563 [root@gqas015 /]# [root@gqas010 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l 10196 [root@gqas010 /]# [root@gqas009 /]# cat /var/log/ganesha.log |grep -i "ERR_FSAL_SERVERFAULT" | wc -l 12784 [root@gqas009 /]# Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 nfs-ganesha-2.4.1-1.el7rhgs.x86_64 How reproducible: ----------------- 1/1 Steps to Reproduce: ------------------ 1. Create a cluster with more than 1 volume. 2. Mount these volumes(more than 1 mount per client) via v3 and v4. 3. Pump IO. Actual results: --------------- Ganesha crashes and log flooding. Expected results: ----------------- No crashes/errors. Additional info: --------------- OS : RHEL 7.3 *Vol Config* : Volume Name: testvol1 Type: Distribute Volume ID: 7a2dae27-0646-4284-9a34-e7b8455d439f Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks/testvol1_brick0 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: testvol2 Type: Distribute Volume ID: 5a61a980-c8e6-41d7-bd00-9ac7f51cbf5e Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol2_brick1 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable Volume Name: testvol3 Type: Replicate Volume ID: 298bfa41-7469-4ff2-b9d4-aafb67c5cb9b Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gqas010.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick2 Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks/testvol3_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas009 tmp]#