Bug 1360125 - Tiering related core observed with "uuid_is_null () message".
Summary: Tiering related core observed with "uuid_is_null () message".
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: tiering
Version: 3.7.13
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact: bugs@gluster.org
URL:
Whiteboard:
Depends On: 1332542 1358196
Blocks: 1360122
TreeView+ depends on / blocked
 
Reported: 2016-07-26 04:58 UTC by Nithya Balachandran
Modified: 2016-08-02 07:25 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.7.14
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1358196
Environment:
Last Closed: 2016-08-02 07:25:06 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Nithya Balachandran 2016-07-26 04:58:28 UTC
+++ This bug was initially created as a clone of Bug #1358196 +++

+++ This bug was initially created as a clone of Bug #1332542 +++

Description of problem:

Tiering related core observed with "uuid_is_null () message".

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-2

How reproducible:

Once

Steps to Reproduce:

Not sure of the exact steps which generated this core but this was observed when tiering related cases were executed.

bt as below:

#0  0x00007fa21e1543fc in uuid_is_null () from /lib64/libuuid.so.1
#1  0x00007fa21070fc09 in ctr_delete_hard_link_from_db.isra.1.constprop.4 ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/changetimerecorder.so
#2  0x00007fa210716bd1 in ctr_rename_cbk ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/changetimerecorder.so
#3  0x00007fa21092cccd in trash_common_rename_cbk ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/trash.so
#4  0x00007fa211169024 in posix_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/storage/posix.so
#5  0x00007fa210934c27 in trash_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/trash.so
#6  0x00007fa21071219d in ctr_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/changetimerecorder.so
#7  0x00007fa21003449e in changelog_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/changelog.so
#8  0x00007fa21e9cbb1a in default_rename () from /lib64/libglusterfs.so.0
#9  0x00007fa20b9cd4fa in ?? ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so
#10 0x00007fa21ea3ef90 in graphyylex_destroy () from /lib64/libglusterfs.so.0
#11 0x00007fa21c4d7a54 in ?? ()
---Type <return> to continue, or q <return> to quit---
#12 0x00007fa21e9fa009 in __gf_calloc () from /lib64/libglusterfs.so.0
#13 0x00007fa20c011bd0 in ?? ()
#14 0x00007fa21c4d7a54 in ?? ()
#15 0x00007fa21e9cbb1a in default_rename () from /lib64/libglusterfs.so.0
#16 0x00007fa20b5a0dd8 in up_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/upcall.so
#17 0x00007fa21e9d8542 in default_rename_resume () from /lib64/libglusterfs.so.0
#18 0x00007fa21e9f73cd in call_resume () from /lib64/libglusterfs.so.0
#19 0x00007fa20c054c70 in ?? ()
#20 0x00007fa20c054c98 in ?? ()
#21 0x00007fa1ea4eae70 in ?? ()
#22 0x00007fa20c054c98 in ?? ()
#23 0x00007fa1ea4eae70 in ?? ()
#24 0x00007fa20b393363 in iot_worker ()
   from /usr/lib64/glusterfs/3.7.9/xlator/performance/io-threads.so
#25 0x00007fa21d82fdc5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007fa21d1761cd in clone () from /lib64/libc.so.6

Actual results:

Tiering related core observed with "uuid_is_null () message".

Expected results:

there should not be any core generated.

Additional info:

Dont have the exact steps to reproduce, however filing this bug so that we don't miss this issue.



--- Additional comment from Joseph Elwin Fernandes on 2016-05-09 01:08:01 EDT ---

As we dont have exact steps to reproduce, We dont know the cause of a NULL uuid or GFID. But yes handling the exception properly is important. Will send a patch with the necessary exception handling in ctr code.

Is there any chance of getting access to the brick logs?

--- Additional comment from Shashank Raj on 2016-05-09 12:24:03 EDT ---

Since i was not sure of the exact scenario which generated it and unfortunately there were lot many tests done on the same setup from the time the core got generated, i don't think i will be able to provide the exact brick logs.

but will surely keep a check on this issue and update the bugzilla with details if i hit it again.


--- Additional comment from Nithya Balachandran on 2016-07-20 02:55:33 EDT ---

RCA:

All the cores have the same backtrace


#root>gdb -ex 'set sysroot ./' -ex 'set solib-search-path ./'  -ex 'core-file ../core.18086'  ./usr/sbin/glusterfs

...

Program terminated with signal 11, Segmentation fault.
#0  0x00007fce4e4323fc in ?? ()
(gdb) bt
#0  0x00007fce4e4323fc in ?? ()
#1  0x00007fce409edc09 in gf_uuid_is_null (uuid=0x10 <Address 0x10 out of bounds>) at ../../../../libglusterfs/src/compat-uuid.h:45
#2  ctr_delete_hard_link_from_db (gfid=gfid@entry=0x10 <Address 0x10 out of bounds>, pargfid=pargfid@entry=0x20 <Address 0x20 out of bounds>, 
    basename=basename@entry=0x40 <Address 0x40 out of bounds>, fop_path=GFDB_FOP_UNDEL, this=0x7fce3c00a8e0, this=0x7fce3c00a8e0, fop_type=9)
    at ctr-helper.h:634
#3  0x00007fce409f4bd1 in ctr_rename_cbk (frame=0x7fce4c7b36b8, cookie=<optimized out>, this=0x7fce3c00a8e0, op_ret=0, op_errno=0, buf=0x7fce202b2760, 
    preoldparent=0x7fce202b27d0, postoldparent=0x7fce202b2840, prenewparent=0x7fce202b28b0, postnewparent=0x7fce202b2920, xdata=0x7fce4ef80a58)
    at changetimerecorder.c:860
#4  0x00007fce40c0accd in trash_common_rename_cbk (frame=0x7fce4c7b289c, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, 
    buf=0x7fce202b2760, preoldparent=0x7fce202b27d0, postoldparent=0x7fce202b2840, prenewparent=0x7fce202b28b0, postnewparent=0x7fce202b2920, 
    xdata=0x7fce4ef80a58) at trash.c:544
#5  0x00007fce41447024 in posix_rename (frame=0x7fce4c7b2aa0, this=<optimized out>, oldloc=<optimized out>, newloc=<optimized out>, xdata=<optimized out>)
    at posix.c:2340
#6  0x00007fce40c12c27 in trash_rename (frame=0x7fce4c7b289c, this=0x7fce3c009310, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at trash.c:1900
#7  0x00007fce409f019d in ctr_rename (frame=0x7fce4c7b36b8, this=0x7fce3c00a8e0, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at changetimerecorder.c:984
#8  0x00007fce4031249e in changelog_rename (frame=0x7fce4c7b32b0, this=0x7fce3c00d410, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at changelog.c:440
#9  0x00007fce4eca9b1a in ?? ()
#10 0x00007fce4c7b3a14 in ?? ()
#11 0x00007fce4ef80684 in ?? ()
#12 0x00007fce4c7b3a14 in ?? ()
#13 0x00007fce4c25ba30 in ?? ()
#14 0x00007fce4c25ba70 in ?? ()
#15 0x00007fce3c010820 in ?? ()
#16 0x00007fce4ef80684 in ?? ()
#17 0x00007fce3bbe34fa in posix_acl_rename (frame=0x7fce4c7b32b0, this=0x7fce4c25ba70, old=0x7fce3c00f3e0, new=0x7fce4c25ba30, xdata=0x7fce3c00f3e0)
    at posix-acl.c:1531




(gdb) f 6
#6  0x00007fce40c12c27 in trash_rename (frame=0x7fce4c7b289c, this=0x7fce3c009310, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at trash.c:1900
1900	                STACK_WIND (frame, trash_common_rename_cbk, FIRST_CHILD(this),
(gdb) p *oldloc
$1 = {path = 0x7fce10029cf0 "/tmp/RNM13/dir1", name = 0x7fce10029cfb "dir1", inode = 0x7fce205ee1f4, parent = 0x7fce2055148c, 
  gfid = "\036zd\216MwG\037\252\242j\204\233\003D\211", pargfid = "80\311\321I\002N\376\260\263WzB\352x\255"}
(gdb) p *newloc
$2 = {path = 0x7fce10029550 "/tmp/RNM13/dir2", name = 0x7fce1002955b "dir2", inode = 0x7fce20597224, parent = 0x7fce2055148c, 
  gfid = '\000' <repeats 15 times>, pargfid = "80\311\321I\002N\376\260\263WzB\352x\255"}



(gdb) f 3
#3  0x00007fce409f4bd1 in ctr_rename_cbk (frame=0x7fce4c7b36b8, cookie=<optimized out>, this=0x7fce3c00a8e0, op_ret=0, op_errno=0, buf=0x7fce202b2760, 
    preoldparent=0x7fce202b27d0, postoldparent=0x7fce202b2840, prenewparent=0x7fce202b28b0, postnewparent=0x7fce202b2920, xdata=0x7fce4ef80a58)
    at changetimerecorder.c:860
860	        ret = ctr_delete_hard_link_from_db (this,
(gdb) l
855	                        CTR_MSG_INSERT_RENAME_UNWIND_FAILED,
856	                        "Invalid link count from posix");
857	                goto out;
858	        }
859	
860	        ret = ctr_delete_hard_link_from_db (this,
861	                                    CTR_DB_REC(ctr_local).old_gfid,
862	                                    CTR_DB_REC(ctr_local).pargfid,
863	                                    CTR_DB_REC(ctr_local).file_name,
864	                                    fop_type, fop_path);
(gdb) p ctr_local
$3 = (gf_ctr_local_t *) 0x0
(gdb) 

860	        ret = ctr_delete_hard_link_from_db (this,
861	                                    CTR_DB_REC(ctr_local).old_gfid,
862	                                    CTR_DB_REC(ctr_local).pargfid,
863	                                    CTR_DB_REC(ctr_local).file_name,
864	                                    fop_type, fop_path);



#define CTR_DB_REC(ctr_local)\
        (ctr_local->gfdb_db_record)



From the oldloc and newloc values, this seems to be a directory rename.
ctr_rename () calls 
ret = ctr_insert_wind(frame, this, _inode_cx); 

static inline int
ctr_insert_wind (call_frame_t                    *frame,
                xlator_t                        *this,
                gf_ctr_inode_context_t          *ctr_inode_cx)
{

...

 if (_priv->ctr_record_wind && ctr_inode_cx->ia_type != IA_IFDIR) {
                frame->local = init_ctr_local_t (this);

--> inits local only for files.

The process crashes in the ctr_rename_cbk, when it tries to access ctr_local which is NULL.

--- Additional comment from Nithya Balachandran on 2016-07-20 05:29:20 EDT ---

From ctr_rename_cbk:


832	        ret = dict_get_uint32 (xdata , GF_RESPONSE_LINK_COUNT_XDATA,
833	                                &remaining_links);
834	        if (ret) {
835	                gf_msg (this->name, GF_LOG_ERROR, 0,
836	                        CTR_MSG_GET_CTR_RESPONSE_LINK_COUNT_XDATA_FAILED,
(gdb) 
837	                        "Failed to getting GF_RESPONSE_LINK_COUNT_XDATA");
838	                remaining_links = -1;
839	                goto out;
840	        }
841	
842	        ctr_local = frame->local;

(gdb) p remaining_links
$24 = 2



posix_rename sets GF_RESPONSE_LINK_COUNT_XDATA only if GF_REQUEST_LINK_COUNT_XDATA is set in xdata.

#6  0x00007fce40c12c27 in trash_rename (frame=0x7fce4c7b289c, this=0x7fce3c009310, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at trash.c:1900


(gdb) p* (((dict_t*)0x00007fce4ef80684)->members_internal)
$17 = {hash_next = 0x7fce4ef806c4, prev = 0x0, next = 0x7fce4ef806c4, value = 0x7fce4c086f70, key = 0x7fce3c1fb230 "gf_request_link_count"}
(gdb) p* (((dict_t*)0x00007fce4ef80684)->members_internal->next)
$18 = {hash_next = 0x0, prev = 0x7fce4c14c408, next = 0x0, value = 0x7fce4c092d5c, key = 0x7fce3c1f1e80 "glusterfs.inodelk-dom-count"}
(gdb) p* (((dict_t*)0x00007fce4ef80684)->members_internal->value)
$19 = {is_static = 0 '\000', is_const = 0 '\000', is_stdalloc = 0 '\000', len = 2, data = 0x7fce3c11afc0 "1", refcount = 1, lock = 1}
(gdb) bt




The key "gf_request_link_count" is set in the xdata passed to posix_rename. However this should not have happened as the ctr_rename () function should have 
failed in:
ret = update_hard_link_ctx (frame, this, oldloc->inode);

because ctr_local is NULL, causing it to skip the code to set this key.


I tried to reproduce the crash using 
mv -T dir1 dir2  

where both dir1 and dir2 exist and are empty. posix_rename consistently failed to set the GF_RESPONSE_LINK_COUNT_XDATA key value in the dict so ctr_rename_cbk always exits at line 838 in the code snippet above. I was unable to reproduce the crash.

However the original bt in the description lists graphyylex_destroy in the stack. This looks like a graph change occurred but without the logs there is no way to confirm this.

--- Additional comment from Vijay Bellur on 2016-07-20 05:41:36 EDT ---

REVIEW: http://review.gluster.org/14964 (features/ctr: Check for NULL local) posted (#1) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-20 11:22:47 EDT ---

REVIEW: http://review.gluster.org/14964 (features/ctr: Check for NULL local) posted (#2) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-25 10:53:58 EDT ---

REVIEW: http://review.gluster.org/14964 (features/ctr: Check for NULL local) posted (#3) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-25 16:14:13 EDT ---

COMMIT: http://review.gluster.org/14964 committed in master by Vijay Bellur (vbellur@redhat.com) 
------
commit fc2a0f5eb8caae4f4fd0c023619954ed4b6ca760
Author: N Balachandran <nbalacha@redhat.com>
Date:   Wed Jul 20 15:10:05 2016 +0530

    features/ctr: Check for NULL local
    
    This is a defensive fix to prevent a crash reported
    during a rename operation. This is not reproducible
    under normal circumstances.
    
    This patch also moves ctr-messages.h to the src dir
    of the changetimerecorder xlator.
    
    Change-Id: I46eb926d67bf4c19387c8b26e354c635a5fb284c
    BUG: 1358196
    Signed-off-by: N Balachandran <nbalacha@redhat.com>
    Reviewed-on: http://review.gluster.org/14964
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    Tested-by: Zhou Zhengping <johnzzpcrystal@gmail.com>
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Vijay Bellur <vbellur@redhat.com>

Comment 1 Vijay Bellur 2016-07-26 05:59:41 UTC
REVIEW: http://review.gluster.org/15007 (features/ctr: Check for NULL local) posted (#1) for review on release-3.7 by N Balachandran (nbalacha@redhat.com)

Comment 2 Vijay Bellur 2016-07-27 12:58:31 UTC
COMMIT: http://review.gluster.org/15007 committed in release-3.7 by Dan Lambright (dlambrig@redhat.com) 
------
commit bda4a2be0bba608d61bd04815e92b2d4cc63e7f2
Author: N Balachandran <nbalacha@redhat.com>
Date:   Tue Jul 26 11:27:16 2016 +0530

    features/ctr: Check for NULL local
    
    This is a defensive fix to prevent a crash reported
    during a rename operation. This is not reproducible
    under normal circumstances.
    
    This patch also moves ctr-messages.h to the src dir
    of the changetimerecorder xlator.
    
    Backported from master: http://review.gluster.org/#/c/14964/
    
    Change-Id: I2aac2d4da5752f6a0b45a70e0d97a4d506532ff4
    BUG: 1360125
    Signed-off-by: N Balachandran <nbalacha@redhat.com>
    Reviewed-on: http://review.gluster.org/15007
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Milind Changire <mchangir@redhat.com>
    Reviewed-by: Dan Lambright <dlambrig@redhat.com>

Comment 3 Kaushal 2016-08-02 07:25:06 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.14, please open a new bug report.

glusterfs-3.7.14 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-devel/2016-August/050319.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.