Bug 1352805
| Summary: | [GSS] Rebalance crashed | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Oonkwee Lim <olim> | |
| Component: | distribute | Assignee: | Susant Kumar Palai <spalai> | |
| Status: | CLOSED ERRATA | QA Contact: | Prasad Desala <tdesala> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | rhgs-3.1 | CC: | amukherj, asrivast, bkunal, nbalacha, olim, pousley, rabhat, rcyriac, rhinduja, rhs-bugs, sankarshan, spalai | |
| Target Milestone: | --- | |||
| Target Release: | RHGS 3.2.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.8.4-1 | Doc Type: | Bug Fix | |
| Doc Text: |
The thread pool limit for the rebalance process was static and set to 40. This meant that machines with more than 40 cores crashed when the rebalance process attempted to create more than 40 threads and access more memory than was allocated to the stack. The thread pool limit is now dynamic, and is determined based on the number of available cores.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1359711 (view as bug list) | Environment: | ||
| Last Closed: | 2017-03-23 05:39:00 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1351515, 1351530, 1359711, 1362069, 1362070 | |||
|
Description
Oonkwee Lim
2016-07-05 06:34:11 UTC
Thanks Oonkwee for providing the inputs.
RCA: The thread pool limit is static for rebalance process which is "40" now.
And the number of migrator threads created in rebalance will be {$(no. of cores) - 4}, which in this case is 44. Hence, in the process of creating more than 40 threads, rebalance tries to access memory beyond stack allocated memory, resulting in crash
As part of fix thread pool needs to be dynamic. Am working on the patch. Will send the patch upstream soon after testing it out.
Thanks,
Susant
Upstream Patch posted at: http://review.gluster.org/#/c/15000 Thanks, Susant Reproduced the issue with glusterfs version 3.7.9-10 on a two node RHGS VM cluster. The VMs are configured to have 48 vCPU each. The same environment has been used to verify the hotfix build. The issue is fixed and rebalance crash was not seen. lscpu: ====== Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 42 Model name: Intel Xeon E312xx (Sandy Bridge) Stepping: 1 CPU MHz: 2199.998 BogoMIPS: 4399.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K NUMA node0 CPU(s): 0-47 Here are the steps that were performed, 1) Created a two node RHGS cluster which has 48 vCPU each. 2) Created a distributed replica volume and started it. 3) Fuse mounted the volume to a client and created files and directories on the mount point. 4) Performed Add-brick operation to trigger a rebalance. 5) started rebalance, no crashes were seen during rebalance and it completed successfully. Also, verified this BZ against glusterfs version 3.8.4-1.el7rhgs.x86_64 and no rebalance crashes were seen. Hence, moving this BZ state to verified. Verified the above new builds updated in the BZ. Similar config as in Comment 25 is used for verification. We are able to start the rebalance and it completed successfully without any crashes/errors. Doc looks fine. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |