+++ This bug was initially created as a clone of Bug #1533269 +++ Description of problem: During a rebalance of a 252 brick volume, as the rebalance is scanning through the initial directories, within 5-10 minutes, a seemingly random peer brick process dies which stops the rebalance processes. The brick logs contain healthy connection and disconnection up until the failure, where the brick process throws a stack trace: pending frames: frame : type(0) op(36) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2018-01-10 21:13:21 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.12.4 /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xaa)[0x7f7000635a5a] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x2e7)[0x7f700063f737] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f6fffa284b0] /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4)[0x7f6fffdc6d44] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_rename_key+0x66)[0x7f7000630866] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/selinux.so(+0x1f15)[0x7f6ff87a9f15] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/marker.so(+0x11d77)[0x7f6ff8595d77] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/quota.so(+0xe02f)[0x7f6ff3de602f] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/debug/io-stats.so(+0x72d6)[0x7f6ff3bb12d6] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x2e3be)[0x7f6ff37763be] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbd19)[0x7f6ff3753d19] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdb5)[0x7f6ff3753db5] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc75c)[0x7f6ff375475c] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdfe)[0x7f6ff3753dfe] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc5fe)[0x7f6ff37545fe] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc712)[0x7f6ff3754712] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdde)[0x7f6ff3753dde] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc804)[0x7f6ff3754804] /usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x276ce)[0x7f6ff376f6ce] /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpcsvc_request_handler+0x96)[0x7f70003feca6] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f6fffdc46ba] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6fffafa3dd] --------- Anywhere from 1 to 5 brick processes on various hosts will all die at the same time. Version-Release number of selected component (if applicable): 3.12.4 How reproducible: This happens consistently within the first 10 minutes of a rebalance Steps to Reproduce: - We had an existing gluster volume with about 2PB of data in it. Since our existing gluster configs (3.7.20) were pretty old, we decided to bring down the cluster and rebuild it fresh with the existing data. All gluster 3.7.20 libraries were purged, .glusterfs directory deleted from each brick and glusterd 3.12.4 was installed. All 252 bricks were re-added to the cluster and a fix-layout performed successfully. However, when a full rebalance is initiated, eventually peer brick processes will crash. Actual results: Expected results: Additional info: --- Additional comment from Atin Mukherjee on 2018-01-11 03:46:53 EST --- Jiffin, seeps to be crashing from selinux.c. Can you please check? --- Additional comment from Jiffin on 2018-01-11 04:03:02 EST --- Sure I will take a look --- Additional comment from Worker Ant on 2018-01-17 23:02:01 EST --- REVIEW: https://review.gluster.org/19220 (selinux-xlator : validate dict before calling dict_rename_key()) posted (#1) for review on master by jiffin tony Thottan --- Additional comment from Jiffin on 2018-01-17 23:03:17 EST --- From the core it look like dict = NULL passed to fops handled by selinux xlator which caused this error. A patch posted upstream https://review.gluster.org/19220 for review
REVIEW: https://review.gluster.org/19220 (selinux-xlator : validate dict before calling dict_rename_key()) posted (#2) for review on master by jiffin tony Thottan
COMMIT: https://review.gluster.org/19220 committed in master by \"jiffin tony Thottan\" <jthottan> with a commit message- selinux-xlator : validate dict before calling dict_rename_key() Change-Id: I71da3b64e5e8c82e8842e119b2b05da3e2ace550 BUG: 1535772 Signed-off-by: Jiffin Tony Thottan <jthottan>
REVIEW: https://review.gluster.org/19241 (selinux-xlator : validate dict before calling dict_rename_key()) posted (#1) for review on release-3.12 by jiffin tony Thottan
REVISION POSTED: https://review.gluster.org/19241 (selinux-xlator : validate dict before calling dict_rename_key()) posted (#2) for review on release-3.12 by jiffin tony Thottan
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.0.0, please open a new bug report. glusterfs-4.0.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2018-March/000092.html [2] https://www.gluster.org/pipermail/gluster-users/
It looks like this bug still exists in 4.0. See: https://github.com/gluster/glusterfs/blob/release-4.0/xlators/features/selinux/src/selinux.c#L192 and: https://github.com/gluster/glusterfs/blob/release-4.0/xlators/features/selinux/src/selinux.c#L150 Excerpt from original report: > Should line 150 and 192 xlators/features/selinux/src/selinux.c of be > if (!priv->selinux_enabled || !dict) > instead of > if (!priv->selinux_enabled && !dict) Bug was reported against Gluster 3.13.2 and is still present in 3.13.x See bug report here: https://bugzilla.redhat.com/show_bug.cgi?id=1552228 This bug will cause the brick process to SEGFAULT due to a NULL ptr deref, and the whole gluster volume will go down. Bug is reproducible during a gluster rebalance if selinux is set to a DISABLED state on the gluster node.