Bug 1533269

Summary:	Random GlusterFSD process dies during rebalance
Product:	[Community] GlusterFS	Reporter:	jmackey
Component:	core	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.12	CC:	amukherj, bugs, jmackey, jthottan
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.12.6	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1535772 1536294 (view as bug list)		Environment:
Last Closed:	2018-03-05 07:14:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1535772
Bug Blocks:

Description jmackey 2018-01-10 22:38:28 UTC

Description of problem:

During a rebalance of a 252 brick volume, as the rebalance is scanning through the initial directories, within 5-10 minutes, a seemingly random peer brick process dies which stops the rebalance processes. The brick logs contain healthy connection and disconnection up until the failure, where the brick process throws a stack trace:

pending frames:
frame : type(0) op(36)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2018-01-10 21:13:21
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.4
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xaa)[0x7f7000635a5a]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x2e7)[0x7f700063f737]
/lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f6fffa284b0]
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4)[0x7f6fffdc6d44]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_rename_key+0x66)[0x7f7000630866]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/selinux.so(+0x1f15)[0x7f6ff87a9f15]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/marker.so(+0x11d77)[0x7f6ff8595d77]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/features/quota.so(+0xe02f)[0x7f6ff3de602f]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/debug/io-stats.so(+0x72d6)[0x7f6ff3bb12d6]
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(default_fsetxattr+0xb5)[0x7f70006a81d5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x2e3be)[0x7f6ff37763be]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbd19)[0x7f6ff3753d19]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdb5)[0x7f6ff3753db5]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc75c)[0x7f6ff375475c]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdfe)[0x7f6ff3753dfe]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc5fe)[0x7f6ff37545fe]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc712)[0x7f6ff3754712]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xbdde)[0x7f6ff3753dde]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0xc804)[0x7f6ff3754804]
/usr/lib/x86_64-linux-gnu/glusterfs/3.12.4/xlator/protocol/server.so(+0x276ce)[0x7f6ff376f6ce]
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpcsvc_request_handler+0x96)[0x7f70003feca6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f6fffdc46ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6fffafa3dd]
---------

Anywhere from 1 to 5 brick processes on various hosts will all die at the same time.

Version-Release number of selected component (if applicable):

3.12.4


How reproducible:

This happens consistently within the first 10 minutes of a rebalance

Steps to Reproduce:
- We had an existing gluster volume with about 2PB of data in it. Since our existing gluster configs (3.7.20) were pretty old, we decided to bring down the cluster and rebuild it fresh with the existing data. All gluster 3.7.20 libraries were purged, .glusterfs directory deleted from each brick and glusterd 3.12.4 was installed. All 252 bricks were re-added to the cluster and a fix-layout performed successfully. However, when a full rebalance is initiated, eventually peer brick processes will crash.

Actual results:


Expected results:


Additional info:

Comment 1 Atin Mukherjee 2018-01-11 08:46:53 UTC

Jiffin, seeps to be crashing from selinux.c. Can you please check?

Comment 2 Jiffin 2018-01-11 09:03:02 UTC

Sure I will take a look

Comment 3 Worker Ant 2018-01-18 04:02:01 UTC

REVIEW: https://review.gluster.org/19220 (selinux-xlator : validate dict before calling dict_rename_key()) posted (#1) for review on master by jiffin tony Thottan

Comment 4 Jiffin 2018-01-18 04:03:17 UTC

From the core it look like dict = NULL passed to fops handled by selinux xlator which caused this error. A patch posted upstream  https://review.gluster.org/19220 for review

Comment 5 Worker Ant 2018-01-18 04:28:05 UTC

REVISION POSTED: https://review.gluster.org/19220 (selinux-xlator : validate dict before calling dict_rename_key()) posted (#2) for review on master by jiffin tony Thottan

Comment 7 Worker Ant 2018-01-19 07:31:47 UTC

REVIEW: https://review.gluster.org/19241 (selinux-xlator : validate dict before calling dict_rename_key()) posted (#2) for review on release-3.12 by jiffin tony Thottan

Comment 8 Worker Ant 2018-02-02 06:47:56 UTC

COMMIT: https://review.gluster.org/19241 committed in release-3.12 by "jiffin tony Thottan" <jthottan> with a commit message- selinux-xlator : validate dict before calling dict_rename_key()

Upstream reference
>Change-Id: I71da3b64e5e8c82e8842e119b2b05da3e2ace550
>BUG: 1535772
>Signed-off-by: Jiffin Tony Thottan <jthottan>
>(cherry picked from commit bee06ccd7b80e3f5804f0c7c7c56936fed6d2b4e)

Change-Id: I71da3b64e5e8c82e8842e119b2b05da3e2ace550
BUG: 1533269

Comment 9 Jiffin 2018-03-05 07:14:08 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.6, please open a new bug report.

glusterfs-3.12.6 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2018-February/033552.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 10 Red Hat Bugzilla 2023-09-14 04:15:26 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days