Description of problem: ----------------------- *Setup * - 4 node Ganesha cluster,2*2 volume,mounted via v3 and v4 on 8 clients. *Workload Description* : I ran writes on all 8 clients,the description of which is under : *FIO sequential writes* - Client1,Client2,Client3,Client4 *Smallfile creates * - Client5,Client6,Client7,Client8 *dd(2500 * 1 MB files)* - Client2,Client3,Client4,Client6,Client7,Client8 *Observation* : > Ganesha crashed on 4/4 nodes(Separate BZ). > EBADFD Error message from ganesha-gfapi/brick logs from one of the nodes* : [2016-11-13 18:12:07.130327] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-3: remote operation failed [Bad file descriptor] [2016-11-13 18:12:07.130408] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-2: remote operation failed [Bad file descriptor] bricks/bricks-testvol_brick3.log:[2016-11-13 18:12:07.130134] E [MSGID: 113072] [posix.c:3380:posix_writev] 0-testvol-posix: write failed: offset 0, [Bad file descriptor] bricks/bricks-testvol_brick3.log:[2016-11-13 18:12:07.130216] E [MSGID: 115067] [server-rpc-fops.c:1348:server_writev_cbk] 0-testvol-server: 683223: WRITEV 10073 (3bc2aeb8-d197-4f9b-b6d2-c10588e17ba5) ==> (Bad file descriptor) [Bad file descriptor] > Assertion failure error messages from gfapi log on another server : [2016-11-13 18:40:41.356119] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x59a54) [0x7faf748d2a54] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:42.178843] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) [0x7faf748a406c] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d648) [0x7faf748c6648] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) > Other messages from ganesha logs : Acquire/Destroy Mutex - --------------------- 13/11/2016 13:16:51 : epoch 16f70000 : gqas013.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31855[work-188] nfs_dupreq_finish :RW LOCK :CRIT :Error 22, acquiring mutex 0x7f67c40d0980 (&dv->mtx) at /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1085 13/11/2016 13:55:07 : epoch 0e630000 : gqas006.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31410[work-200] nfs_dupreq_free_dupreq :RW LOCK :CRIT :Error 16, Destroy mutex 0x7f5bf80417b0 (&dv->mtx) at /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:786 --- 13/11/2016 12:56:14 : epoch ccf20000 : gqas011.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31407[main] nfs_Init_svc :DISP :CRIT :Cannot acquire credentials for principal nfs --- Attribute Errors : ------------------- 13/11/2016 13:17:13 : epoch ccf20000 : gqas011.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-31407[work-60] nfs3_FSALattr_To_Fattr :NFS3 :CRIT :Likely bug: FSAL did not fill in a standard NFSv3 attribute: missing 1dfce Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-4.el7rhgs.x86_64 How reproducible: ----------------- Reporting the first occurence. Steps to Reproduce: ------------------- 1. Prepare 4 node Ganesha cluster,mount the volume via v3/v4 on 8 clients. 2. Run fio,smallfile creates,dds,kernel untart from various clients. 3. Check for crashes and error messages in logs Actual results: -------------- Ganesha Crashes,EBADFD and other error messages. Expected results: ----------------- No crashes/error messages Additional info: ----------------- OS : RHEL 7.3 *Vol Config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: e84889ee-7bed-426f-b187-2b15fb244175 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on server.allow-insecure: on performance.stat-prefetch: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable
> > Ganesha crashed on 4/4 nodes(Separate BZ). This is being tracker as part of bug1394717. > > > EBADFD Error message from ganesha-gfapi/brick logs from one of the nodes* : > > [2016-11-13 18:12:07.130327] W [MSGID: 114031] > [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-3: remote > operation failed [Bad file descriptor] > [2016-11-13 18:12:07.130408] W [MSGID: 114031] > [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-2: remote > operation failed [Bad file descriptor] These errors were returned from POSIX layer. bug1394773 tracks it. > > bricks/bricks-testvol_brick3.log:[2016-11-13 18:12:07.130134] E [MSGID: > 113072] [posix.c:3380:posix_writev] 0-testvol-posix: write failed: offset 0, > [Bad file descriptor] > bricks/bricks-testvol_brick3.log:[2016-11-13 18:12:07.130216] E [MSGID: > 115067] [server-rpc-fops.c:1348:server_writev_cbk] 0-testvol-server: 683223: > WRITEV 10073 (3bc2aeb8-d197-4f9b-b6d2-c10588e17ba5) ==> (Bad file > descriptor) [Bad file descriptor] > > > > Assertion failure error messages from gfapi log on another server : > > [2016-11-13 18:40:41.356119] E [mem-pool.c:315:__gf_free] > (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x59a54) > [0x7faf748d2a54] > -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) > [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) > [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t > *)((char *)free_ptr + header->size) > [2016-11-13 18:40:42.178843] E [mem-pool.c:315:__gf_free] > (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) > [0x7faf748a406c] > -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d648) > [0x7faf748c6648] -->/lib64/libglusterfs.so.0(__gf_free+0x104) > [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t > *)((char *)free_ptr + header->size) crash reported in bug1394717 addresses this. > > > Other messages from ganesha logs : > > Acquire/Destroy Mutex - > --------------------- > > 13/11/2016 13:16:51 : epoch 16f70000 : gqas013.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31855[work-188] nfs_dupreq_finish :RW LOCK :CRIT :Error 22, > acquiring mutex 0x7f67c40d0980 (&dv->mtx) at > /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1085 > > > 13/11/2016 13:55:07 : epoch 0e630000 : gqas006.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31410[work-200] nfs_dupreq_free_dupreq :RW LOCK :CRIT :Error > 16, Destroy mutex 0x7f5bf80417b0 (&dv->mtx) at > /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:786 > As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1394717#c4, these messages/failures may have been reported when the system is under high-load and the resources could have been unavailable. Request to update if these errors are seen even with limited load. > --- > > 13/11/2016 12:56:14 : epoch ccf20000 : gqas011.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31407[main] nfs_Init_svc :DISP :CRIT :Cannot acquire > credentials for principal nfs > > --- > > Attribute Errors : > ------------------- > > 13/11/2016 13:17:13 : epoch ccf20000 : gqas011.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31407[work-60] nfs3_FSALattr_To_Fattr :NFS3 :CRIT :Likely bug: > FSAL did not fill in a standard NFSv3 attribute: missing 1dfce > That leaves only this error message to be addressed as part of this BZ. Will update once have more details on it.
*** Bug 1394773 has been marked as a duplicate of this bug. ***
> 13/11/2016 13:17:13 : epoch ccf20000 : gqas011.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31407[work-60] nfs3_FSALattr_To_Fattr :NFS3 :CRIT :Likely bug: > FSAL did not fill in a standard NFSv3 attribute: missing 1dfce This error seem to have been returned when underlying FSAL (in this case Gluster) has n't returned the requested attributes. From 1dfce, looks like the following attributes were missing - #define ATTR_CREATION 0x0000000000002000LL /* Metadata modification time */ #define ATTR_CTIME 0x0000000000004000LL /* data modification time */ #define ATTR_MTIME 0x0000000000008000LL /* space used by this file. */ #define ATTR_SPACEUSED 0x0000000000010000LL #define ATTR_FILEID 0x0000000000000040LL /* Access permission flag */ #define ATTR_MODE 0x0000000000000080LL /* Number of hard links */ #define ATTR_NUMLINKS 0x0000000000000100LL /* owner ID */ #define ATTR_OWNER 0x0000000000000200LL /* group ID */ #define ATTR_GROUP 0x0000000000000400LL /* ID of device for block special or character special files*/ #define ATTR_RAWDEV 0x0000000000000800LL #define ATTR_TYPE 0x0000000000000002LL /* file size */ #define ATTR_SIZE 0x0000000000000004LL /* filesystem id */ #define ATTR_FSID 0x0000000000000008LL /* file space reserve */ #define ATTR4_SPACE_RESERVED 0x0000000000000010LL Few of them are very basic attributes which one would expect any filesystem to always return. These errors were logged on node-11 and from gfapi.log , [2016-11-13 18:40:03.935733] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) [0x7faf748a406c] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d63c) [0x7faf748c663c] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:04.041093] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x1ed35) [0x7faf74897d35] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d630) [0x7faf748c6630] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:04.093364] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d5aa) [0x7faf748c65aa] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d464) [0x7faf748c6464] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:04.278724] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) [0x7faf748a406c] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:06.037872] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x1ed35) [0x7faf74897d35] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:06.220952] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d5aa) [0x7faf748c65aa] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d525) [0x7faf748c6525] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:29.052561] I [MSGID: 108031] [afr-common.c:2070:afr_local_discovery_cbk] 0-testvol-replicate-1: selecting local read_child testvol-client-3 [2016-11-13 18:40:41.356119] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x59a54) [0x7faf748d2a54] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:42.178843] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) [0x7faf748a406c] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d648) [0x7faf748c6648] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:42.196639] E [mem-pool.c:315:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x50) [0x7faf82d2cf00] -->/lib64/libglusterfs.so.0(data_destroy+0x5d) [0x7faf82d2c51d] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) the crash happened at about sametime (bug1394717). Maybe the system went into inconsistent state leading to gfapi/gluser not returning these attributes. Probably, as mentioned in the bug1394717, once we get a consistent reproducer, we could collect tcpdump and monitor the traffic.
> > EBADFD Error message from ganesha-gfapi/brick logs from one of the nodes* : > > [2016-11-13 18:12:07.130327] W [MSGID: 114031] > [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-3: remote > operation failed [Bad file descriptor] > [2016-11-13 18:12:07.130408] W [MSGID: 114031] > [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-2: remote > operation failed [Bad file descriptor] Even wrt to these errors, we could collect tcpdump and verify if the fd (being quoted as BADFD) is returned by the server or not. If yes, we can check if it was closed by any chance before WRITE is sent.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days