+++ This bug was initially created as a clone of Bug #1358196 +++ +++ This bug was initially created as a clone of Bug #1332542 +++ Description of problem: Tiering related core observed with "uuid_is_null () message". Version-Release number of selected component (if applicable): glusterfs-3.7.9-2 How reproducible: Once Steps to Reproduce: Not sure of the exact steps which generated this core but this was observed when tiering related cases were executed. bt as below: #0 0x00007fa21e1543fc in uuid_is_null () from /lib64/libuuid.so.1 #1 0x00007fa21070fc09 in ctr_delete_hard_link_from_db.isra.1.constprop.4 () from /usr/lib64/glusterfs/3.7.9/xlator/features/changetimerecorder.so #2 0x00007fa210716bd1 in ctr_rename_cbk () from /usr/lib64/glusterfs/3.7.9/xlator/features/changetimerecorder.so #3 0x00007fa21092cccd in trash_common_rename_cbk () from /usr/lib64/glusterfs/3.7.9/xlator/features/trash.so #4 0x00007fa211169024 in posix_rename () from /usr/lib64/glusterfs/3.7.9/xlator/storage/posix.so #5 0x00007fa210934c27 in trash_rename () from /usr/lib64/glusterfs/3.7.9/xlator/features/trash.so #6 0x00007fa21071219d in ctr_rename () from /usr/lib64/glusterfs/3.7.9/xlator/features/changetimerecorder.so #7 0x00007fa21003449e in changelog_rename () from /usr/lib64/glusterfs/3.7.9/xlator/features/changelog.so #8 0x00007fa21e9cbb1a in default_rename () from /lib64/libglusterfs.so.0 #9 0x00007fa20b9cd4fa in ?? () from /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so #10 0x00007fa21ea3ef90 in graphyylex_destroy () from /lib64/libglusterfs.so.0 #11 0x00007fa21c4d7a54 in ?? () ---Type <return> to continue, or q <return> to quit--- #12 0x00007fa21e9fa009 in __gf_calloc () from /lib64/libglusterfs.so.0 #13 0x00007fa20c011bd0 in ?? () #14 0x00007fa21c4d7a54 in ?? () #15 0x00007fa21e9cbb1a in default_rename () from /lib64/libglusterfs.so.0 #16 0x00007fa20b5a0dd8 in up_rename () from /usr/lib64/glusterfs/3.7.9/xlator/features/upcall.so #17 0x00007fa21e9d8542 in default_rename_resume () from /lib64/libglusterfs.so.0 #18 0x00007fa21e9f73cd in call_resume () from /lib64/libglusterfs.so.0 #19 0x00007fa20c054c70 in ?? () #20 0x00007fa20c054c98 in ?? () #21 0x00007fa1ea4eae70 in ?? () #22 0x00007fa20c054c98 in ?? () #23 0x00007fa1ea4eae70 in ?? () #24 0x00007fa20b393363 in iot_worker () from /usr/lib64/glusterfs/3.7.9/xlator/performance/io-threads.so #25 0x00007fa21d82fdc5 in start_thread () from /lib64/libpthread.so.0 #26 0x00007fa21d1761cd in clone () from /lib64/libc.so.6 Actual results: Tiering related core observed with "uuid_is_null () message". Expected results: there should not be any core generated. Additional info: Dont have the exact steps to reproduce, however filing this bug so that we don't miss this issue. --- Additional comment from Joseph Elwin Fernandes on 2016-05-09 01:08:01 EDT --- As we dont have exact steps to reproduce, We dont know the cause of a NULL uuid or GFID. But yes handling the exception properly is important. Will send a patch with the necessary exception handling in ctr code. Is there any chance of getting access to the brick logs? --- Additional comment from Shashank Raj on 2016-05-09 12:24:03 EDT --- Since i was not sure of the exact scenario which generated it and unfortunately there were lot many tests done on the same setup from the time the core got generated, i don't think i will be able to provide the exact brick logs. but will surely keep a check on this issue and update the bugzilla with details if i hit it again. --- Additional comment from Nithya Balachandran on 2016-07-20 02:55:33 EDT --- RCA: All the cores have the same backtrace #root>gdb -ex 'set sysroot ./' -ex 'set solib-search-path ./' -ex 'core-file ../core.18086' ./usr/sbin/glusterfs ... Program terminated with signal 11, Segmentation fault. #0 0x00007fce4e4323fc in ?? () (gdb) bt #0 0x00007fce4e4323fc in ?? () #1 0x00007fce409edc09 in gf_uuid_is_null (uuid=0x10 <Address 0x10 out of bounds>) at ../../../../libglusterfs/src/compat-uuid.h:45 #2 ctr_delete_hard_link_from_db (gfid=gfid@entry=0x10 <Address 0x10 out of bounds>, pargfid=pargfid@entry=0x20 <Address 0x20 out of bounds>, basename=basename@entry=0x40 <Address 0x40 out of bounds>, fop_path=GFDB_FOP_UNDEL, this=0x7fce3c00a8e0, this=0x7fce3c00a8e0, fop_type=9) at ctr-helper.h:634 #3 0x00007fce409f4bd1 in ctr_rename_cbk (frame=0x7fce4c7b36b8, cookie=<optimized out>, this=0x7fce3c00a8e0, op_ret=0, op_errno=0, buf=0x7fce202b2760, preoldparent=0x7fce202b27d0, postoldparent=0x7fce202b2840, prenewparent=0x7fce202b28b0, postnewparent=0x7fce202b2920, xdata=0x7fce4ef80a58) at changetimerecorder.c:860 #4 0x00007fce40c0accd in trash_common_rename_cbk (frame=0x7fce4c7b289c, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, buf=0x7fce202b2760, preoldparent=0x7fce202b27d0, postoldparent=0x7fce202b2840, prenewparent=0x7fce202b28b0, postnewparent=0x7fce202b2920, xdata=0x7fce4ef80a58) at trash.c:544 #5 0x00007fce41447024 in posix_rename (frame=0x7fce4c7b2aa0, this=<optimized out>, oldloc=<optimized out>, newloc=<optimized out>, xdata=<optimized out>) at posix.c:2340 #6 0x00007fce40c12c27 in trash_rename (frame=0x7fce4c7b289c, this=0x7fce3c009310, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684) at trash.c:1900 #7 0x00007fce409f019d in ctr_rename (frame=0x7fce4c7b36b8, this=0x7fce3c00a8e0, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684) at changetimerecorder.c:984 #8 0x00007fce4031249e in changelog_rename (frame=0x7fce4c7b32b0, this=0x7fce3c00d410, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684) at changelog.c:440 #9 0x00007fce4eca9b1a in ?? () #10 0x00007fce4c7b3a14 in ?? () #11 0x00007fce4ef80684 in ?? () #12 0x00007fce4c7b3a14 in ?? () #13 0x00007fce4c25ba30 in ?? () #14 0x00007fce4c25ba70 in ?? () #15 0x00007fce3c010820 in ?? () #16 0x00007fce4ef80684 in ?? () #17 0x00007fce3bbe34fa in posix_acl_rename (frame=0x7fce4c7b32b0, this=0x7fce4c25ba70, old=0x7fce3c00f3e0, new=0x7fce4c25ba30, xdata=0x7fce3c00f3e0) at posix-acl.c:1531 (gdb) f 6 #6 0x00007fce40c12c27 in trash_rename (frame=0x7fce4c7b289c, this=0x7fce3c009310, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684) at trash.c:1900 1900 STACK_WIND (frame, trash_common_rename_cbk, FIRST_CHILD(this), (gdb) p *oldloc $1 = {path = 0x7fce10029cf0 "/tmp/RNM13/dir1", name = 0x7fce10029cfb "dir1", inode = 0x7fce205ee1f4, parent = 0x7fce2055148c, gfid = "\036zd\216MwG\037\252\242j\204\233\003D\211", pargfid = "80\311\321I\002N\376\260\263WzB\352x\255"} (gdb) p *newloc $2 = {path = 0x7fce10029550 "/tmp/RNM13/dir2", name = 0x7fce1002955b "dir2", inode = 0x7fce20597224, parent = 0x7fce2055148c, gfid = '\000' <repeats 15 times>, pargfid = "80\311\321I\002N\376\260\263WzB\352x\255"} (gdb) f 3 #3 0x00007fce409f4bd1 in ctr_rename_cbk (frame=0x7fce4c7b36b8, cookie=<optimized out>, this=0x7fce3c00a8e0, op_ret=0, op_errno=0, buf=0x7fce202b2760, preoldparent=0x7fce202b27d0, postoldparent=0x7fce202b2840, prenewparent=0x7fce202b28b0, postnewparent=0x7fce202b2920, xdata=0x7fce4ef80a58) at changetimerecorder.c:860 860 ret = ctr_delete_hard_link_from_db (this, (gdb) l 855 CTR_MSG_INSERT_RENAME_UNWIND_FAILED, 856 "Invalid link count from posix"); 857 goto out; 858 } 859 860 ret = ctr_delete_hard_link_from_db (this, 861 CTR_DB_REC(ctr_local).old_gfid, 862 CTR_DB_REC(ctr_local).pargfid, 863 CTR_DB_REC(ctr_local).file_name, 864 fop_type, fop_path); (gdb) p ctr_local $3 = (gf_ctr_local_t *) 0x0 (gdb) 860 ret = ctr_delete_hard_link_from_db (this, 861 CTR_DB_REC(ctr_local).old_gfid, 862 CTR_DB_REC(ctr_local).pargfid, 863 CTR_DB_REC(ctr_local).file_name, 864 fop_type, fop_path); #define CTR_DB_REC(ctr_local)\ (ctr_local->gfdb_db_record) From the oldloc and newloc values, this seems to be a directory rename. ctr_rename () calls ret = ctr_insert_wind(frame, this, _inode_cx); static inline int ctr_insert_wind (call_frame_t *frame, xlator_t *this, gf_ctr_inode_context_t *ctr_inode_cx) { ... if (_priv->ctr_record_wind && ctr_inode_cx->ia_type != IA_IFDIR) { frame->local = init_ctr_local_t (this); --> inits local only for files. The process crashes in the ctr_rename_cbk, when it tries to access ctr_local which is NULL. --- Additional comment from Nithya Balachandran on 2016-07-20 05:29:20 EDT --- From ctr_rename_cbk: 832 ret = dict_get_uint32 (xdata , GF_RESPONSE_LINK_COUNT_XDATA, 833 &remaining_links); 834 if (ret) { 835 gf_msg (this->name, GF_LOG_ERROR, 0, 836 CTR_MSG_GET_CTR_RESPONSE_LINK_COUNT_XDATA_FAILED, (gdb) 837 "Failed to getting GF_RESPONSE_LINK_COUNT_XDATA"); 838 remaining_links = -1; 839 goto out; 840 } 841 842 ctr_local = frame->local; (gdb) p remaining_links $24 = 2 posix_rename sets GF_RESPONSE_LINK_COUNT_XDATA only if GF_REQUEST_LINK_COUNT_XDATA is set in xdata. #6 0x00007fce40c12c27 in trash_rename (frame=0x7fce4c7b289c, this=0x7fce3c009310, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684) at trash.c:1900 (gdb) p* (((dict_t*)0x00007fce4ef80684)->members_internal) $17 = {hash_next = 0x7fce4ef806c4, prev = 0x0, next = 0x7fce4ef806c4, value = 0x7fce4c086f70, key = 0x7fce3c1fb230 "gf_request_link_count"} (gdb) p* (((dict_t*)0x00007fce4ef80684)->members_internal->next) $18 = {hash_next = 0x0, prev = 0x7fce4c14c408, next = 0x0, value = 0x7fce4c092d5c, key = 0x7fce3c1f1e80 "glusterfs.inodelk-dom-count"} (gdb) p* (((dict_t*)0x00007fce4ef80684)->members_internal->value) $19 = {is_static = 0 '\000', is_const = 0 '\000', is_stdalloc = 0 '\000', len = 2, data = 0x7fce3c11afc0 "1", refcount = 1, lock = 1} (gdb) bt The key "gf_request_link_count" is set in the xdata passed to posix_rename. However this should not have happened as the ctr_rename () function should have failed in: ret = update_hard_link_ctx (frame, this, oldloc->inode); because ctr_local is NULL, causing it to skip the code to set this key. I tried to reproduce the crash using mv -T dir1 dir2 where both dir1 and dir2 exist and are empty. posix_rename consistently failed to set the GF_RESPONSE_LINK_COUNT_XDATA key value in the dict so ctr_rename_cbk always exits at line 838 in the code snippet above. I was unable to reproduce the crash. However the original bt in the description lists graphyylex_destroy in the stack. This looks like a graph change occurred but without the logs there is no way to confirm this. --- Additional comment from Vijay Bellur on 2016-07-20 05:41:36 EDT --- REVIEW: http://review.gluster.org/14964 (features/ctr: Check for NULL local) posted (#1) for review on master by N Balachandran (nbalacha) --- Additional comment from Vijay Bellur on 2016-07-20 11:22:47 EDT --- REVIEW: http://review.gluster.org/14964 (features/ctr: Check for NULL local) posted (#2) for review on master by N Balachandran (nbalacha) --- Additional comment from Vijay Bellur on 2016-07-25 10:53:58 EDT --- REVIEW: http://review.gluster.org/14964 (features/ctr: Check for NULL local) posted (#3) for review on master by N Balachandran (nbalacha) --- Additional comment from Vijay Bellur on 2016-07-25 16:14:13 EDT --- COMMIT: http://review.gluster.org/14964 committed in master by Vijay Bellur (vbellur) ------ commit fc2a0f5eb8caae4f4fd0c023619954ed4b6ca760 Author: N Balachandran <nbalacha> Date: Wed Jul 20 15:10:05 2016 +0530 features/ctr: Check for NULL local This is a defensive fix to prevent a crash reported during a rename operation. This is not reproducible under normal circumstances. This patch also moves ctr-messages.h to the src dir of the changetimerecorder xlator. Change-Id: I46eb926d67bf4c19387c8b26e354c635a5fb284c BUG: 1358196 Signed-off-by: N Balachandran <nbalacha> Reviewed-on: http://review.gluster.org/14964 NetBSD-regression: NetBSD Build System <jenkins.org> Tested-by: Zhou Zhengping <johnzzpcrystal> Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Vijay Bellur <vbellur>
REVIEW: http://review.gluster.org/15007 (features/ctr: Check for NULL local) posted (#1) for review on release-3.7 by N Balachandran (nbalacha)
COMMIT: http://review.gluster.org/15007 committed in release-3.7 by Dan Lambright (dlambrig) ------ commit bda4a2be0bba608d61bd04815e92b2d4cc63e7f2 Author: N Balachandran <nbalacha> Date: Tue Jul 26 11:27:16 2016 +0530 features/ctr: Check for NULL local This is a defensive fix to prevent a crash reported during a rename operation. This is not reproducible under normal circumstances. This patch also moves ctr-messages.h to the src dir of the changetimerecorder xlator. Backported from master: http://review.gluster.org/#/c/14964/ Change-Id: I2aac2d4da5752f6a0b45a70e0d97a4d506532ff4 BUG: 1360125 Signed-off-by: N Balachandran <nbalacha> Reviewed-on: http://review.gluster.org/15007 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Milind Changire <mchangir> Reviewed-by: Dan Lambright <dlambrig>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.14, please open a new bug report. glusterfs-3.7.14 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/gluster-devel/2016-August/050319.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user