Bug 1620544
Summary: | Brick process NOT ONLINE for heketidb and block-hosting volume | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Atin Mukherjee <amukherj> |
Component: | glusterd | Assignee: | bugs <bugs> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | mainline | CC: | amukherj, bmekala, bugs, kramdoss, madam, nberry, pprakash, rhs-bugs, rtalur, sankarshan, sarumuga, storage-qa-internal, vbellur |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-5.0 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 1620469 | Environment: | |
Last Closed: | 2018-10-23 15:17:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1620469 |
Comment 1
Atin Mukherjee
2018-08-23 07:03:42 UTC
Root cause: Because of a refactoring in glusterd_brick_start () code path, in case of friend handshake if brick is just supposed to connect to the brick when quorum is not yet met, we were ending up restarting bricks. Given brick restart is attempted from multiple threads, such situations can lead to a deadlock for bringing up the very first brick in case of brick multiplexing. COMMIT: https://review.gluster.org/20935 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterd: glusterd_brick_start shouldn't try to bring up brick if only_connect is true With the latest refactoring in glusterd_brick_start () function in case we run into a situation where is_gf_service_running () return a valid pid which is running but doesn't belong to a gluster process, even in case of only_connect flag passed as gf_true we'd end up trying to start a brick which would cause a deadlock in brick multiplexing as both glusterd_restart_bricks () and glusterd_do_volume_quorum_action () would cause context switching with each other for the same brick. The following bt shows the same: (gdb) t a a bt Thread 8 (Thread 0x7fcced48a700 (LWP 11959)): srch_vol=srch_vol@entry=0xbe0410, comp_vol=comp_vol@entry=0xc03680, brickinfo=brickinfo@entry=0xc14ef0) at glusterd-utils.c:5834 brickinfo=0xc14ef0, volinfo=0xc03680, conf=<optimized out>) at glusterd-utils.c:5902 brickinfo=brickinfo@entry=0xc14ef0, wait=wait@entry=_gf_false, only_connect=only_connect@entry=_gf_true) at glusterd-utils.c:6251 volinfo=0xc03680, meets_quorum=_gf_true) at glusterd-server-quorum.c:402 at glusterd-server-quorum.c:443 iov=iov@entry=0x7fcce0004040, count=count@entry=1, myframe=myframe@entry=0x7fcce00023a0) at glusterd-rpc-ops.c:542 iov=0x7fcce0004040, count=1, myframe=0x7fcce00023a0, fn=0x7fccf12403d0 <__glusterd_friend_add_cbk>) at glusterd-rpc-ops.c:223 ---Type <return> to continue, or q <return> to quit--- at rpc-transport.c:538 Thread 7 (Thread 0x7fccedc8b700 (LWP 11958)): Thread 6 (Thread 0x7fccf1d67700 (LWP 11877)): brickinfo=brickinfo@entry=0xc14ef0) at glusterd-utils.c:5834 at glusterd-utils.c:6251 Thread 5 (Thread 0x7fccf2568700 (LWP 11876)): Thread 4 (Thread 0x7fccf2d69700 (LWP 11875)): Thread 3 (Thread 0x7fccf356a700 (LWP 11874)): Thread 2 (Thread 0x7fccf3d6b700 (LWP 11873)): ---Type <return> to continue, or q <return> to quit--- Thread 1 (Thread 0x7fccf68a8780 (LWP 11872)): Fix: The solution is to ensure we don't restart bricks if only_connect is true and just ensure that the brick is attempted to be connected. Test: Simulated a code change to ensure gf_is_service_running () always return to true to hit the scenario. Change-Id: Iec184e6c9e8aabef931d310f931f4d7a580f0f48 Fixes: bz#1620544 Signed-off-by: Atin Mukherjee <amukherj> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/ |