+++ This bug was initially created as a clone of Bug #1717757 +++ Description of problem: Setup: 2-Node VM Cluster with a Replica 2 Volume After doing several "wild" write and delete operations from a Win Client, one of the brick crashes. The crash report says: [2019-06-05 09:05:05.137156] I [MSGID: 139001] [posix-acl.c:263:posix_acl_log_permit_denied] 0-archive1-access-control: client: CTX_ID:fcab5e67-b9d9-4b72-8c15-f29de2084af3-GRAPH_ID:0-PID:189 16-HOST:fs-detlefh-c1-n2-PC_NAME:archive1-client-0-RECON_NO:-0, gfid: 494b42ad-7e40-4e27-8878-99387a80b5dc, req(uid:2000,gid:2000,perm:3,ngrps:1), ctx(uid:0,gid:0,in-groups:0,perm:755,update d-fop:LOOKUP, acl:-) [Permission denied] pending frames: frame : type(0) op(0) frame : type(0) op(23) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2019-06-05 09:05:05 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 5.5 /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f89faa7264c] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f89faa7cd26] /lib64/libc.so.6(+0x361a0)[0x7f89f9c391a0] /usr/lib64/glusterfs/5.5/xlator/features/bitrot-stub.so(+0x13441)[0x7f89f22ae441] /usr/lib64/libglusterfs.so.0(default_fsetxattr+0xce)[0x7f89faaf9f8e] /usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0x22636)[0x7f89f1e68636] /usr/lib64/libglusterfs.so.0(default_fsetxattr+0xce)[0x7f89faaf9f8e] /usr/lib64/libglusterfs.so.0(syncop_fsetxattr+0x26b)[0x7f89faab319b] /usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0xa901)[0x7f89f1c3d901] /usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0x11b66)[0x7f89f1e57b66] /usr/lib64/glusterfs/5.5/xlator/features/access-control.so(+0xaebe)[0x7f89f208febe] /usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0xb081)[0x7f89f1e51081] /usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0x8c23)[0x7f89f1c3bc23] /usr/lib64/glusterfs/5.5/xlator/features/read-only.so(+0x4e30)[0x7f89f1a2de30] /usr/lib64/glusterfs/5.5/xlator/features/leases.so(+0xa444)[0x7f89f181b444] /usr/lib64/glusterfs/5.5/xlator/features/upcall.so(+0x10a68)[0x7f89f1600a68] /usr/lib64/libglusterfs.so.0(default_create_resume+0x212)[0x7f89fab10132] /usr/lib64/libglusterfs.so.0(call_resume_wind+0x2cf)[0x7f89faa97e5f] /usr/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f89faa983a5] /usr/lib64/glusterfs/5.5/xlator/performance/io-threads.so(+0x6088)[0x7f89f13e7088] /lib64/libpthread.so.0(+0x7569)[0x7f89f9fc4569] /lib64/libc.so.6(clone+0x3f)[0x7f89f9cfb9af] --------- Version-Release number of selected component (if applicable): v5.5 Additional info: The backtrace shows that there is a Nulllpointer for *fd in br_stub_fsetxattr: Thread 1 (Thread 0x7f89f0099700 (LWP 2171)): #0 br_stub_fsetxattr (frame=0x7f89b846a6e8, this=0x7f89ec015c00, fd=0x0, dict=0x7f89b84e9ad8, flags=0, xdata=0x0) at bit-rot-stub.c:1328 ret = 0 val = 0 sign = 0x0 priv = 0x7f89ec07ed60 op_errno = 22 __FUNCTION__ = "br_stub_fsetxattr" This results in a segmentation fault in line 1328 of bit-rot_stub.c : if (!IA_ISREG(fd->inode->ia_type)) goto wind; The bitrot-stub wants to signate a file but the corresponding fd is a Nullpointer. The full backtrace is attached!!! --- Additional comment from Amar Tumballi on 2019-06-06 06:57:25 UTC --- Not sure why this happened, because, for bitrot, a fsetxattr() call shouldn't come at all if fd is NULL. It should have been prevented at higher level itself. I found the reason after digging a bit. Ideally, in case of failure (here, worm_create_cbk() received -1, which means fd is NULL), one shouldn't consume fd and call fsetxattr(). If there is a need to do a xattr op in failure, then one should call setxattr with 'loc' passed in create() call. (you can store it in local). ---- #0 br_stub_fsetxattr (frame=0x7f89b846a6e8, this=0x7f89ec015c00, fd=0x0, dict=0x7f89b84e9ad8, flags=0, xdata=0x0) at bit-rot-stub.c:1328 ret = 0 val = 0 sign = 0x0 priv = 0x7f89ec07ed60 op_errno = 22 __FUNCTION__ = "br_stub_fsetxattr" #1 0x00007f89faaf9f8e in default_fsetxattr () from /usr/lib64/libglusterfs.so.0 No symbol table info available. #2 0x00007f89f1e68636 in pl_fsetxattr (frame=0x7f89b825ab48, this=0x7f89ec0194a0, fd=0x0, dict=0x7f89b84e9ad8, flags=0, xdata=0x0) at posix.c:1566 _new = 0x7f89b846a6e8 old_THIS = 0x7f89ec0194a0 next_xl_fn = 0x7f89faaf9ec0 <default_fsetxattr> tmp_cbk = 0x7f89f1e56680 <pl_fsetxattr_cbk> op_ret = <optimized out> op_errno = 0 lockinfo_buf = 0x0 len = 0 __FUNCTION__ = "pl_fsetxattr" #3 0x00007f89faaf9f8e in default_fsetxattr () from /usr/lib64/libglusterfs.so.0 No symbol table info available. #4 0x00007f89faab319b in syncop_fsetxattr () from /usr/lib64/libglusterfs.so.0 No symbol table info available. #5 0x00007f89f1c3d901 in worm_create_cbk (frame=frame@entry=0x7f89b8302fe8, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=-1, op_errno=op_errno@entry=13, fd=fd@entry=0x0, inode=0x0, buf=0x0, preparent=0x0, postparent=0x0, xdata=0x0) at worm.c:492 ret = 0 priv = 0x7f89ec074b38 dict = 0x7f89b84e9ad8 __FUNCTION__ = "worm_create_cbk" ---- Hopefully this helps. --- Additional comment from Amar Tumballi on 2019-06-06 06:59:29 UTC --- Can you check if below works? diff --git a/xlators/features/read-only/src/worm.c b/xlators/features/read-only/src/worm.c index cc3d15b8b2..6b44eae966 100644 --- a/xlators/features/read-only/src/worm.c +++ b/xlators/features/read-only/src/worm.c @@ -431,7 +431,7 @@ worm_create_cbk(call_frame_t *frame, void *cookie, xlator_t *this, priv = this->private; GF_ASSERT(priv); - if (priv->worm_file) { + if (priv->worm_file && (op_ret >= 0)) { dict = dict_new(); if (!dict) { gf_log(this->name, GF_LOG_ERROR, ---- Great if you can confirm this. --- Additional comment from on 2019-06-06 07:08:49 UTC --- I will check it! --- Additional comment from on 2019-06-07 08:30:08 UTC --- @Amar I wrote a patch with debug logs and I will observe the bricks now. During this time I have some questions concerning your patch suggestion: 1. According to crash report from the brick locks, there was a failure in [2019-06-05 09:05:05.137156] I [MSGID: 139001] [posix-acl.c:263:posix_acl_log_permit_denied] 0-archive1-access-control: client: CTX_ID:fcab5e67-b9d9-4b72-8c15-f29de2084af3-GRAPH_ID:0-PID:189 16-HOST:fs-detlefh-c1-n2-PC_NAME:archive1-client-0-RECON_NO:-0, gfid: 494b42ad-7e40-4e27-8878-99387a80b5dc, req(uid:2000,gid:2000,perm:3,ngrps:1), ctx(uid:0,gid:0,in-groups:0,perm:755,update d-fop:LOOKUP, acl:-) [Permission denied] just before the crash. What can be the reason for this? 2. If this LOOKUP for acls fails, is it problematic to do a setxattr with loc? If we skip setting xattr when fd is NULL the file on that brick won't have the necessary xattr like trusted.worm_file and other. See an example directly after the crash: # file: gluster/brick3/glusterbrick/test/data/BC/storage.log trusted.gfid=0sag3y6RuoTgqAw//fx3ZB1Q== trusted.gfid2path.273f2255a25b2961="bd910b86-d51a-4006-a2c4-515ef5f1777a/storage.log" trusted.pgfid.bd910b86-d51a-4006-a2c4-515ef5f1777a=0sAAAAAQ== On the healthy brick I got: # file: gluster/brick3/glusterbrick/test/data/BC/storage.log trusted.afr.dirty=0sAAAAAAAAAAAAAAAA trusted.afr.test-client-0=0sAAAABAAAAAMAAAAA trusted.bit-rot.version=0sAgAAAAAAAABc+P64AAEhGQ== trusted.gfid=0sag3y6RuoTgqAw//fx3ZB1Q== trusted.gfid2path.273f2255a25b2961="bd910b86-d51a-4006-a2c4-515ef5f1777a/storage.log" trusted.glusterfs.mdata=0sAQAAAAAAAAAAAAAAAFz5AJEAAAAAMqdgMwAAAABcRwJEAAAAAAAAAAAAAAAAXPkAkQAAAAAAAAAA trusted.pgfid.bd910b86-d51a-4006-a2c4-515ef5f1777a=0sAAAAAQ== trusted.start_time="1559822481" trusted.worm_file=0sMQA= After restarting the faulty brick a heal was triggered and afterwards the file on the faulty brick is heal.It should be ensured that the broken file gets all necessary xattr. What is the better way? Triggering a setxattr with loc in worm_create_cbk or do a heal afterwards? --- Additional comment from Amar Tumballi on 2019-06-07 08:35:33 UTC --- 1. permission denied is mostly probably a issue of missing permission (uid 2000, trying to create an entry in a directory with 755, owned by uid-0 (root)). 2. I think it is better to leave it to heal. If it is a create failure, we should anyways fail the operation is my opinion. --- Additional comment from on 2019-06-07 09:57:15 UTC --- Allrigth, I will stress the system for a while and if everything is stable I will commit the patch to gerrit --- Additional comment from Amar Tumballi on 2019-06-18 13:38:51 UTC --- Looks like we need to check for 'op_ret' in most of the places in WORM code. --- Additional comment from Worker Ant on 2019-06-19 11:10:56 UTC --- REVIEW: https://review.gluster.org/22898 (WORM-Xlator: Avoid performing fsetxattr if fd is NULL) posted (#1) for review on master by David Spisla --- Additional comment from on 2019-06-19 11:26:35 UTC --- I have send a patch to gerrit. @Amar if there is any other place in the WORM Xlator whcih can cause an segfault please tell me. I will swrite some patches soon. At the moment the worm_create_cbk is the only one callback function in this xlator --- Additional comment from Worker Ant on 2019-06-21 04:21:26 UTC --- REVIEW: https://review.gluster.org/22898 (WORM-Xlator: Avoid performing fsetxattr if fd is NULL) merged (#5) on master by Amar Tumballi
Above patch is merged to master. Do you plan to backport it?
I think it should be backported to v5.x
https://review.gluster.org/#/c/glusterfs/+/22915/
This bug is moved to https://github.com/gluster/glusterfs/issues/1062, and will be tracked there from now on. Visit GitHub issues URL for further details