Description of problem: It looks like rebalance crashed: [2016-07-02 19:16:54.275504] W [dict.c:429:dict_set] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_lookup_xattr_req_prepare+0xb0) [0x7f03aaca9f10] -->/lib64/libglusterfs.so.0(dict_set_str+0x2c) [0x7f03b87dbd5c] -->/lib64/libglusterfs.so.0(dict_set+0xa6) [0x7f03b87d9c06] ) 0-dict: !this || !value for key=link-count [Invalid argument] pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2016-07-02 19:16:54 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.9 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f03b87e11c2] /lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f03b880697d] /lib64/libc.so.6(+0x35670)[0x7f03b6ecd670] /lib64/libc.so.6(gsignal+0x37)[0x7f03b6ecd5f7] /lib64/libc.so.6(abort+0x148)[0x7f03b6ecece8] /lib64/libc.so.6(+0x75327)[0x7f03b6f0d327] /lib64/libc.so.6(__fortify_fail+0x37)[0x7f03b6fa6597] /lib64/libc.so.6(__fortify_fail+0x0)[0x7f03b6fa6560] /usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(gf_defrag_start_crawl+0x846)[0x7f03aa9f7756] /lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7f03b882f262] /lib64/libc.so.6(+0x47110)[0x7f03b6edf110] --------- Version-Release number of selected component (if applicable): glusterfs-3.7.9-10.el6rhs.x86_64 Fri Jul 1 18:12:47 2016 operating-version=30712 How reproducible: Once Steps to Reproduce: 1. Run rebalance 2. 3. Actual results: Rebalance crashed Expected results: Rebalance should not crash Additional info: Is there a way to restart the rebalance? sosreports: https://api.access.redhat.com/rs/cases/01662018/attachments/a1742f83-a541-4895-b5d5-462895eb66d5 https://api.access.redhat.com/rs/cases/01662018/attachments/2b4be32c-f913-4c28-99c2-8a8cba7b4afb https://api.access.redhat.com/rs/cases/01662018/attachments/819d02d0-a821-46bd-abf5-4da029435493 rebalance logs: https://api.access.redhat.com/rs/cases/01662018/attachments/721b28cb-79eb-490f-bdcb-6372384d8716 https://api.access.redhat.com/rs/cases/01662018/attachments/e026c032-b916-4ba3-990f-72e34ba8f7c3
Thanks Oonkwee for providing the inputs. RCA: The thread pool limit is static for rebalance process which is "40" now. And the number of migrator threads created in rebalance will be {$(no. of cores) - 4}, which in this case is 44. Hence, in the process of creating more than 40 threads, rebalance tries to access memory beyond stack allocated memory, resulting in crash As part of fix thread pool needs to be dynamic. Am working on the patch. Will send the patch upstream soon after testing it out. Thanks, Susant
Upstream Patch posted at: http://review.gluster.org/#/c/15000 Thanks, Susant
Reproduced the issue with glusterfs version 3.7.9-10 on a two node RHGS VM cluster. The VMs are configured to have 48 vCPU each. The same environment has been used to verify the hotfix build. The issue is fixed and rebalance crash was not seen. lscpu: ====== Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 42 Model name: Intel Xeon E312xx (Sandy Bridge) Stepping: 1 CPU MHz: 2199.998 BogoMIPS: 4399.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K NUMA node0 CPU(s): 0-47 Here are the steps that were performed, 1) Created a two node RHGS cluster which has 48 vCPU each. 2) Created a distributed replica volume and started it. 3) Fuse mounted the volume to a client and created files and directories on the mount point. 4) Performed Add-brick operation to trigger a rebalance. 5) started rebalance, no crashes were seen during rebalance and it completed successfully. Also, verified this BZ against glusterfs version 3.8.4-1.el7rhgs.x86_64 and no rebalance crashes were seen. Hence, moving this BZ state to verified.
Verified the above new builds updated in the BZ. Similar config as in Comment 25 is used for verification. We are able to start the rebalance and it completed successfully without any crashes/errors.
Doc looks fine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html