1352805 – [GSS] Rebalance crashed

Bug 1352805 - [GSS] Rebalance crashed

Summary: [GSS] Rebalance crashed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Susant Kumar Palai
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351515 1351530 1359711 1362069 1362070
TreeView+	depends on / blocked

Reported:	2016-07-05 06:34 UTC by Oonkwee Lim
Modified:	2020-08-13 08:30 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	Bug Fix
Doc Text:	The thread pool limit for the rebalance process was static and set to 40. This meant that machines with more than 40 cores crashed when the rebalance process attempted to create more than 40 threads and access more memory than was allocated to the stack. The thread pool limit is now dynamic, and is determined based on the number of available cores.
Clone Of:
Clones:	1359711 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:39:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Oonkwee Lim 2016-07-05 06:34:11 UTC

Description of problem:
It looks like rebalance crashed:

[2016-07-02 19:16:54.275504] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f03aaca9f10] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f03b87dbd5c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f03b87d9c06] ) 0-dict: !this || !value for key=link-count [Invalid argument]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash:
2016-07-02 19:16:54
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.9
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f03b87e11c2]
/lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f03b880697d]
/lib64/libc.so.6(+0x35670)[0x7f03b6ecd670]
/lib64/libc.so.6(gsignal+0x37)[0x7f03b6ecd5f7]
/lib64/libc.so.6(abort+0x148)[0x7f03b6ecece8]
/lib64/libc.so.6(+0x75327)[0x7f03b6f0d327]
/lib64/libc.so.6(__fortify_fail+0x37)[0x7f03b6fa6597]
/lib64/libc.so.6(__fortify_fail+0x0)[0x7f03b6fa6560]
/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(gf_defrag_start_crawl+0x846)[0x7f03aa9f7756]
/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f03b882f262]
/lib64/libc.so.6(+0x47110)[0x7f03b6edf110]
---------

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-10.el6rhs.x86_64                   Fri Jul  1 18:12:47 2016

operating-version=30712

How reproducible:
Once

Steps to Reproduce:
1. Run rebalance
2.
3.

Actual results:
Rebalance crashed

Expected results:
Rebalance should not crash

Additional info:
Is there a way to restart the rebalance?

sosreports:
https://api.access.redhat.com/rs/cases/01662018/attachments/a1742f83-a541-4895-b5d5-462895eb66d5
https://api.access.redhat.com/rs/cases/01662018/attachments/2b4be32c-f913-4c28-99c2-8a8cba7b4afb
https://api.access.redhat.com/rs/cases/01662018/attachments/819d02d0-a821-46bd-abf5-4da029435493

rebalance logs:
https://api.access.redhat.com/rs/cases/01662018/attachments/721b28cb-79eb-490f-bdcb-6372384d8716

https://api.access.redhat.com/rs/cases/01662018/attachments/e026c032-b916-4ba3-990f-72e34ba8f7c3

Comment 12 Susant Kumar Palai 2016-07-22 12:17:27 UTC

Thanks Oonkwee for providing the inputs.

RCA: The thread pool limit is static for rebalance process which is "40" now.
And the number of migrator threads created in rebalance will be {$(no. of cores) - 4}, which in this case is 44. Hence, in the process of creating more than 40 threads, rebalance tries to access memory beyond stack allocated memory, resulting in crash

As part of fix thread pool needs to be dynamic. Am working on the patch. Will send the patch upstream soon after testing it out.

Thanks,
Susant

Comment 13 Susant Kumar Palai 2016-07-25 10:43:56 UTC

Upstream Patch posted at: http://review.gluster.org/#/c/15000

Thanks,
Susant

Comment 25 Prasad Desala 2016-09-30 17:25:37 UTC

Reproduced the issue with glusterfs version 3.7.9-10 on a two node RHGS VM cluster. The VMs are configured to have 48 vCPU each. 
The same environment has been used to verify the hotfix build. The issue is fixed and rebalance crash was not seen.

lscpu:
======
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 42
Model name:            Intel Xeon E312xx (Sandy Bridge)
Stepping:              1
CPU MHz:               2199.998
BogoMIPS:              4399.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-47

Here are the steps that were performed,

1) Created a two node RHGS cluster which has 48 vCPU each. 
2) Created a distributed replica volume and started it.
3) Fuse mounted the volume to a client and created files and directories on the mount point.
4) Performed Add-brick operation to trigger a rebalance.
5) started rebalance, no crashes were seen during rebalance and it completed successfully.

Also, verified this BZ against glusterfs version 3.8.4-1.el7rhgs.x86_64 and no rebalance crashes were seen. Hence, moving this BZ state to verified.

Comment 29 Prasad Desala 2016-10-14 10:27:55 UTC

Verified the above new builds updated in the BZ. Similar config as in Comment 25 is used for verification. We are able to start the rebalance and it completed successfully without any crashes/errors.

Comment 41 Susant Kumar Palai 2017-03-06 05:46:35 UTC

Doc looks fine.

Comment 43 errata-xmlrpc 2017-03-23 05:39:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.