Bug 1722709

Summary: WORM: Segmentation Fault if bitrot stub do signature (backport to Gluster 6)
Product: [Community] GlusterFS Reporter: david.spisla
Component: bitrotAssignee: david.spisla
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 6CC: bugs, pasik, sunkumar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-7.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1717757 Environment:
Last Closed: 2020-02-11 14:10:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description david.spisla 2019-06-21 05:17:13 UTC
+++ This bug was initially created as a clone of Bug #1717757 +++

Description of problem:

Setup: 2-Node VM Cluster with a Replica 2 Volume
After doing several "wild" write and delete operations from a Win Client, one of the brick crashes. The crash report says:

[2019-06-05 09:05:05.137156] I [MSGID: 139001] [posix-acl.c:263:posix_acl_log_permit_denied] 0-archive1-access-control: client: CTX_ID:fcab5e67-b9d9-4b72-8c15-f29de2084af3-GRAPH_ID:0-PID:189
16-HOST:fs-detlefh-c1-n2-PC_NAME:archive1-client-0-RECON_NO:-0, gfid: 494b42ad-7e40-4e27-8878-99387a80b5dc, req(uid:2000,gid:2000,perm:3,ngrps:1), ctx(uid:0,gid:0,in-groups:0,perm:755,update
d-fop:LOOKUP, acl:-) [Permission denied]
pending frames:
frame : type(0) op(0)
frame : type(0) op(23)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-06-05 09:05:05
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.5
/usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f89faa7264c]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f89faa7cd26]
/lib64/libc.so.6(+0x361a0)[0x7f89f9c391a0]
/usr/lib64/glusterfs/5.5/xlator/features/bitrot-stub.so(+0x13441)[0x7f89f22ae441]
/usr/lib64/libglusterfs.so.0(default_fsetxattr+0xce)[0x7f89faaf9f8e]
/usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0x22636)[0x7f89f1e68636]
/usr/lib64/libglusterfs.so.0(default_fsetxattr+0xce)[0x7f89faaf9f8e]
/usr/lib64/libglusterfs.so.0(syncop_fsetxattr+0x26b)[0x7f89faab319b]
/usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0xa901)[0x7f89f1c3d901]
/usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0x11b66)[0x7f89f1e57b66]
/usr/lib64/glusterfs/5.5/xlator/features/access-control.so(+0xaebe)[0x7f89f208febe]
/usr/lib64/glusterfs/5.5/xlator/features/locks.so(+0xb081)[0x7f89f1e51081]
/usr/lib64/glusterfs/5.5/xlator/features/worm.so(+0x8c23)[0x7f89f1c3bc23]
/usr/lib64/glusterfs/5.5/xlator/features/read-only.so(+0x4e30)[0x7f89f1a2de30]
/usr/lib64/glusterfs/5.5/xlator/features/leases.so(+0xa444)[0x7f89f181b444]
/usr/lib64/glusterfs/5.5/xlator/features/upcall.so(+0x10a68)[0x7f89f1600a68]
/usr/lib64/libglusterfs.so.0(default_create_resume+0x212)[0x7f89fab10132]
/usr/lib64/libglusterfs.so.0(call_resume_wind+0x2cf)[0x7f89faa97e5f]
/usr/lib64/libglusterfs.so.0(call_resume+0x75)[0x7f89faa983a5]
/usr/lib64/glusterfs/5.5/xlator/performance/io-threads.so(+0x6088)[0x7f89f13e7088]
/lib64/libpthread.so.0(+0x7569)[0x7f89f9fc4569]
/lib64/libc.so.6(clone+0x3f)[0x7f89f9cfb9af]
---------

Version-Release number of selected component (if applicable): v5.5


Additional info:

The backtrace shows that there is a Nulllpointer for *fd in br_stub_fsetxattr:
Thread 1 (Thread 0x7f89f0099700 (LWP 2171)):
#0  br_stub_fsetxattr (frame=0x7f89b846a6e8, this=0x7f89ec015c00, fd=0x0, dict=0x7f89b84e9ad8, flags=0, xdata=0x0) at bit-rot-stub.c:1328
        ret = 0
        val = 0
        sign = 0x0
        priv = 0x7f89ec07ed60
        op_errno = 22
        __FUNCTION__ = "br_stub_fsetxattr"

This results in a segmentation fault in line 1328 of bit-rot_stub.c :

if (!IA_ISREG(fd->inode->ia_type))
        goto wind;

The bitrot-stub wants to signate a file but the corresponding fd is a Nullpointer.

The full backtrace is attached!!!

--- Additional comment from Amar Tumballi on 2019-06-06 06:57:25 UTC ---

Not sure why this happened, because, for bitrot, a fsetxattr() call shouldn't come at all if fd is NULL. It should have been prevented at higher level itself.


I found the reason after digging a bit. Ideally, in case of failure (here, worm_create_cbk() received -1, which means fd is NULL), one shouldn't consume fd and call fsetxattr(). If there is a need to do a xattr op in failure, then one should call setxattr with 'loc' passed in create() call. (you can store it in local).

----
#0  br_stub_fsetxattr (frame=0x7f89b846a6e8, this=0x7f89ec015c00, fd=0x0, dict=0x7f89b84e9ad8, flags=0, xdata=0x0) at bit-rot-stub.c:1328
        ret = 0
        val = 0
        sign = 0x0
        priv = 0x7f89ec07ed60
        op_errno = 22
        __FUNCTION__ = "br_stub_fsetxattr"
#1  0x00007f89faaf9f8e in default_fsetxattr () from /usr/lib64/libglusterfs.so.0
No symbol table info available.
#2  0x00007f89f1e68636 in pl_fsetxattr (frame=0x7f89b825ab48, this=0x7f89ec0194a0, fd=0x0, dict=0x7f89b84e9ad8, flags=0, xdata=0x0) at posix.c:1566
        _new = 0x7f89b846a6e8
        old_THIS = 0x7f89ec0194a0
        next_xl_fn = 0x7f89faaf9ec0 <default_fsetxattr>
        tmp_cbk = 0x7f89f1e56680 <pl_fsetxattr_cbk>
        op_ret = <optimized out>
        op_errno = 0
        lockinfo_buf = 0x0
        len = 0
        __FUNCTION__ = "pl_fsetxattr"
#3  0x00007f89faaf9f8e in default_fsetxattr () from /usr/lib64/libglusterfs.so.0
No symbol table info available.
#4  0x00007f89faab319b in syncop_fsetxattr () from /usr/lib64/libglusterfs.so.0
No symbol table info available.
#5  0x00007f89f1c3d901 in worm_create_cbk (frame=frame@entry=0x7f89b8302fe8, cookie=<optimized out>, this=<optimized out>, op_ret=op_ret@entry=-1, op_errno=op_errno@entry=13, 
    fd=fd@entry=0x0, inode=0x0, buf=0x0, preparent=0x0, postparent=0x0, xdata=0x0) at worm.c:492
        ret = 0
        priv = 0x7f89ec074b38
        dict = 0x7f89b84e9ad8
        __FUNCTION__ = "worm_create_cbk"
----

Hopefully this helps.

--- Additional comment from Amar Tumballi on 2019-06-06 06:59:29 UTC ---

Can you check if below works?

diff --git a/xlators/features/read-only/src/worm.c b/xlators/features/read-only/src/worm.c
index cc3d15b8b2..6b44eae966 100644
--- a/xlators/features/read-only/src/worm.c
+++ b/xlators/features/read-only/src/worm.c
@@ -431,7 +431,7 @@ worm_create_cbk(call_frame_t *frame, void *cookie, xlator_t *this,
 
     priv = this->private;
     GF_ASSERT(priv);
-    if (priv->worm_file) {
+    if (priv->worm_file && (op_ret >= 0)) {
         dict = dict_new();
         if (!dict) {
             gf_log(this->name, GF_LOG_ERROR,


----

Great if you can confirm this.

--- Additional comment from  on 2019-06-06 07:08:49 UTC ---

I will check it!

--- Additional comment from  on 2019-06-07 08:30:08 UTC ---

@Amar

I wrote a patch with debug logs and I will observe the bricks now. During this time I have some questions concerning your patch suggestion:

1. According to crash report from the brick locks, there was a failure in

[2019-06-05 09:05:05.137156] I [MSGID: 139001] [posix-acl.c:263:posix_acl_log_permit_denied] 0-archive1-access-control: client: CTX_ID:fcab5e67-b9d9-4b72-8c15-f29de2084af3-GRAPH_ID:0-PID:189
16-HOST:fs-detlefh-c1-n2-PC_NAME:archive1-client-0-RECON_NO:-0, gfid: 494b42ad-7e40-4e27-8878-99387a80b5dc, req(uid:2000,gid:2000,perm:3,ngrps:1), ctx(uid:0,gid:0,in-groups:0,perm:755,update
d-fop:LOOKUP, acl:-) [Permission denied]

just before the crash. What can be the reason for this?

2. If this LOOKUP for acls fails, is it problematic to do a setxattr with loc? If we skip setting xattr when fd is NULL the file on that brick won't have the necessary xattr like trusted.worm_file and other. See an example directly after the crash:
# file: gluster/brick3/glusterbrick/test/data/BC/storage.log
trusted.gfid=0sag3y6RuoTgqAw//fx3ZB1Q==
trusted.gfid2path.273f2255a25b2961="bd910b86-d51a-4006-a2c4-515ef5f1777a/storage.log"
trusted.pgfid.bd910b86-d51a-4006-a2c4-515ef5f1777a=0sAAAAAQ==

On the healthy brick I got:
# file: gluster/brick3/glusterbrick/test/data/BC/storage.log
trusted.afr.dirty=0sAAAAAAAAAAAAAAAA
trusted.afr.test-client-0=0sAAAABAAAAAMAAAAA
trusted.bit-rot.version=0sAgAAAAAAAABc+P64AAEhGQ==
trusted.gfid=0sag3y6RuoTgqAw//fx3ZB1Q==
trusted.gfid2path.273f2255a25b2961="bd910b86-d51a-4006-a2c4-515ef5f1777a/storage.log"
trusted.glusterfs.mdata=0sAQAAAAAAAAAAAAAAAFz5AJEAAAAAMqdgMwAAAABcRwJEAAAAAAAAAAAAAAAAXPkAkQAAAAAAAAAA
trusted.pgfid.bd910b86-d51a-4006-a2c4-515ef5f1777a=0sAAAAAQ==
trusted.start_time="1559822481"
trusted.worm_file=0sMQA=

After restarting the faulty brick a heal was triggered and afterwards the file on the faulty brick is heal.It should be ensured that the broken file gets all necessary xattr. What is the better way? Triggering a setxattr with loc in worm_create_cbk or do a heal afterwards?

--- Additional comment from Amar Tumballi on 2019-06-07 08:35:33 UTC ---


1. permission denied is mostly probably a issue of missing permission (uid 2000, trying to create an entry in a directory with 755, owned by uid-0 (root)).


2. I think it is better to leave it to heal. If it is a create failure, we should anyways fail the operation is my opinion.

--- Additional comment from  on 2019-06-07 09:57:15 UTC ---

Allrigth, I will stress the system for a while and if everything is stable I will commit the patch to gerrit

--- Additional comment from Amar Tumballi on 2019-06-18 13:38:51 UTC ---

Looks like we need to check for 'op_ret' in most of the places in WORM code.

--- Additional comment from Worker Ant on 2019-06-19 11:10:56 UTC ---

REVIEW: https://review.gluster.org/22898 (WORM-Xlator: Avoid performing fsetxattr if fd is NULL) posted (#1) for review on master by David Spisla

--- Additional comment from  on 2019-06-19 11:26:35 UTC ---

I have send a patch to gerrit.
@Amar if there is any other place in the WORM Xlator whcih can cause an segfault please tell me. I will swrite some patches soon. At the moment the worm_create_cbk is the only one callback function in this xlator

--- Additional comment from Worker Ant on 2019-06-21 04:21:26 UTC ---

REVIEW: https://review.gluster.org/22898 (WORM-Xlator: Avoid performing fsetxattr if fd is NULL) merged (#5) on master by Amar Tumballi

Comment 1 Yaniv Kaul 2019-07-16 09:22:04 UTC
Above patch is merged to master. Do you plan to backport it?

Comment 2 david.spisla 2019-07-16 10:52:47 UTC
I think it should be backported to v6.x

Comment 3 david.spisla 2019-07-16 11:24:16 UTC
https://review.gluster.org/#/c/glusterfs/+/22916/

Comment 4 Sunny Kumar 2020-02-10 18:41:39 UTC
Patch is merged closing this bug now.

Comment 5 Sunny Kumar 2020-02-11 14:12:13 UTC
Sorry patch is in Abandoned state and there is no activity hence closed the bug.