Bug 1394702
| Summary: | [Perf] : EBADFD , Errors while acquiring/destroying mutex , Assertion failures and Attribute errors while I/O is run from various clients on Ganesha mounts. | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ambarish <asoman> | |
| Component: | nfs-ganesha | Assignee: | Kaleb KEITHLEY <kkeithle> | |
| Status: | CLOSED NEXTRELEASE | QA Contact: | Ambarish <asoman> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.2 | CC: | asoman, bturner, rhinduja, rhs-bugs, skoduri, storage-qa-internal | |
| Target Milestone: | --- | Keywords: | ZStream | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | rhgs-3.3.0 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1394773 (view as bug list) | Environment: | ||
| Last Closed: | 2017-08-23 12:39:02 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1394773 | |||
| Bug Blocks: | ||||
|
Description
Ambarish
2016-11-14 09:43:34 UTC
> > Ganesha crashed on 4/4 nodes(Separate BZ). This is being tracker as part of bug1394717. > > > EBADFD Error message from ganesha-gfapi/brick logs from one of the nodes* : > > [2016-11-13 18:12:07.130327] W [MSGID: 114031] > [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-3: remote > operation failed [Bad file descriptor] > [2016-11-13 18:12:07.130408] W [MSGID: 114031] > [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-2: remote > operation failed [Bad file descriptor] These errors were returned from POSIX layer. bug1394773 tracks it. > > bricks/bricks-testvol_brick3.log:[2016-11-13 18:12:07.130134] E [MSGID: > 113072] [posix.c:3380:posix_writev] 0-testvol-posix: write failed: offset 0, > [Bad file descriptor] > bricks/bricks-testvol_brick3.log:[2016-11-13 18:12:07.130216] E [MSGID: > 115067] [server-rpc-fops.c:1348:server_writev_cbk] 0-testvol-server: 683223: > WRITEV 10073 (3bc2aeb8-d197-4f9b-b6d2-c10588e17ba5) ==> (Bad file > descriptor) [Bad file descriptor] > > > > Assertion failure error messages from gfapi log on another server : > > [2016-11-13 18:40:41.356119] E [mem-pool.c:315:__gf_free] > (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x59a54) > [0x7faf748d2a54] > -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) > [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) > [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t > *)((char *)free_ptr + header->size) > [2016-11-13 18:40:42.178843] E [mem-pool.c:315:__gf_free] > (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) > [0x7faf748a406c] > -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d648) > [0x7faf748c6648] -->/lib64/libglusterfs.so.0(__gf_free+0x104) > [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t > *)((char *)free_ptr + header->size) crash reported in bug1394717 addresses this. > > > Other messages from ganesha logs : > > Acquire/Destroy Mutex - > --------------------- > > 13/11/2016 13:16:51 : epoch 16f70000 : gqas013.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31855[work-188] nfs_dupreq_finish :RW LOCK :CRIT :Error 22, > acquiring mutex 0x7f67c40d0980 (&dv->mtx) at > /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:1085 > > > 13/11/2016 13:55:07 : epoch 0e630000 : gqas006.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31410[work-200] nfs_dupreq_free_dupreq :RW LOCK :CRIT :Error > 16, Destroy mutex 0x7f5bf80417b0 (&dv->mtx) at > /builddir/build/BUILD/nfs-ganesha-2.4.1/src/RPCAL/nfs_dupreq.c:786 > As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1394717#c4, these messages/failures may have been reported when the system is under high-load and the resources could have been unavailable. Request to update if these errors are seen even with limited load. > --- > > 13/11/2016 12:56:14 : epoch ccf20000 : gqas011.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31407[main] nfs_Init_svc :DISP :CRIT :Cannot acquire > credentials for principal nfs > > --- > > Attribute Errors : > ------------------- > > 13/11/2016 13:17:13 : epoch ccf20000 : gqas011.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31407[work-60] nfs3_FSALattr_To_Fattr :NFS3 :CRIT :Likely bug: > FSAL did not fill in a standard NFSv3 attribute: missing 1dfce > That leaves only this error message to be addressed as part of this BZ. Will update once have more details on it. *** Bug 1394773 has been marked as a duplicate of this bug. *** > 13/11/2016 13:17:13 : epoch ccf20000 : gqas011.sbu.lab.eng.bos.redhat.com : > ganesha.nfsd-31407[work-60] nfs3_FSALattr_To_Fattr :NFS3 :CRIT :Likely bug: > FSAL did not fill in a standard NFSv3 attribute: missing 1dfce This error seem to have been returned when underlying FSAL (in this case Gluster) has n't returned the requested attributes. From 1dfce, looks like the following attributes were missing - #define ATTR_CREATION 0x0000000000002000LL /* Metadata modification time */ #define ATTR_CTIME 0x0000000000004000LL /* data modification time */ #define ATTR_MTIME 0x0000000000008000LL /* space used by this file. */ #define ATTR_SPACEUSED 0x0000000000010000LL #define ATTR_FILEID 0x0000000000000040LL /* Access permission flag */ #define ATTR_MODE 0x0000000000000080LL /* Number of hard links */ #define ATTR_NUMLINKS 0x0000000000000100LL /* owner ID */ #define ATTR_OWNER 0x0000000000000200LL /* group ID */ #define ATTR_GROUP 0x0000000000000400LL /* ID of device for block special or character special files*/ #define ATTR_RAWDEV 0x0000000000000800LL #define ATTR_TYPE 0x0000000000000002LL /* file size */ #define ATTR_SIZE 0x0000000000000004LL /* filesystem id */ #define ATTR_FSID 0x0000000000000008LL /* file space reserve */ #define ATTR4_SPACE_RESERVED 0x0000000000000010LL Few of them are very basic attributes which one would expect any filesystem to always return. These errors were logged on node-11 and from gfapi.log , [2016-11-13 18:40:03.935733] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) [0x7faf748a406c] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d63c) [0x7faf748c663c] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:04.041093] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x1ed35) [0x7faf74897d35] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d630) [0x7faf748c6630] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:04.093364] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d5aa) [0x7faf748c65aa] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d464) [0x7faf748c6464] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:04.278724] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) [0x7faf748a406c] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:06.037872] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x1ed35) [0x7faf74897d35] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:06.220952] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d5aa) [0x7faf748c65aa] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d525) [0x7faf748c6525] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:29.052561] I [MSGID: 108031] [afr-common.c:2070:afr_local_discovery_cbk] 0-testvol-replicate-1: selecting local read_child testvol-client-3 [2016-11-13 18:40:41.356119] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x59a54) [0x7faf748d2a54] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d654) [0x7faf748c6654] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:42.178843] E [mem-pool.c:315:__gf_free] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x2b06c) [0x7faf748a406c] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/replicate.so(+0x4d648) [0x7faf748c6648] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) [2016-11-13 18:40:42.196639] E [mem-pool.c:315:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x50) [0x7faf82d2cf00] -->/lib64/libglusterfs.so.0(data_destroy+0x5d) [0x7faf82d2c51d] -->/lib64/libglusterfs.so.0(__gf_free+0x104) [0x7faf82d5ac54] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) the crash happened at about sametime (bug1394717). Maybe the system went into inconsistent state leading to gfapi/gluser not returning these attributes. Probably, as mentioned in the bug1394717, once we get a consistent reproducer, we could collect tcpdump and monitor the traffic. > > EBADFD Error message from ganesha-gfapi/brick logs from one of the nodes* :
>
> [2016-11-13 18:12:07.130327] W [MSGID: 114031]
> [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-3: remote
> operation failed [Bad file descriptor]
> [2016-11-13 18:12:07.130408] W [MSGID: 114031]
> [client-rpc-fops.c:855:client3_3_writev_cbk] 0-testvol-client-2: remote
> operation failed [Bad file descriptor]
Even wrt to these errors, we could collect tcpdump and verify if the fd (being quoted as BADFD) is returned by the server or not. If yes, we can check if it was closed by any chance before WRITE is sent.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |