Bug 1628651
Summary: | rpc marks brick disconnected from glusterd | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Raghavendra Talur <rtalur> | |
Component: | glusterd | Assignee: | Mohit Agrawal <moagrawa> | |
Status: | CLOSED WONTFIX | QA Contact: | Bala Konda Reddy M <bmekala> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | rhgs-3.4 | CC: | amukherj, apaladug, dmoessne, ekuric, hongkliu, mchangir, mifiedle, rcyriac, rhs-bugs, sanandpa, sankarshan, sheggodu, storage-qa-internal, vbellur | |
Target Milestone: | --- | Keywords: | Reopened, ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ocs-dependency-issue | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1631128 1631329 (view as bug list) | Environment: | ||
Last Closed: | 2018-10-25 08:41:03 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1631128 | |||
Bug Blocks: | 1598890, 1631329 |
Description
Raghavendra Talur
2018-09-13 15:54:18 UTC
I am redoing the test ... # tailf /var/log/messages Did not see the file rolling ... Any particular lines should I see in that file? RCA: Turned out to be a brick multiplexing bug in glusterd. When a volume is stopped, individual bricks get detached from the brick process through brick op phase in the volume stop transaction and in the brick-op callback the counter called "blockers" is decremented insider sync_lock & sync_unlock () section. Before sending the rpc_submit the same counter is incremented. The goal of this counter is to ensure at any given point of time, before originating a commit phase of a transaction this counter is set to 0. Now because of heketi distributes the volume stop command across different cli & glusterd, the processing of these volume stop requests become parallel and hence because of the additional context switches through sync_lock and sync_unlock for updating the blocker counter, the overall time of wait process goes up. Since RHGS 3.3.1 if any regular transactions take more than 180 seconds the lock timer is forcibly expired. Unfortunately in this situation some of the transactions are kicked out while they are still in progress. This is the reason why we see unlock timer callback log entries in glusterd log. I believe this also had a ripple effect on the saved frames which is why we see critical log like: [2018-08-28 21:23:49.759511] C [rpc-clnt.c:449:rpc_clnt_fill_request_info] 0-management: cannot lookup the saved frame corresponding to xid (2170) and this causes a disconnect to the original brick given the rpc connection for all the attached bricks are a copy of the parent brick connection. The need of updating the blockers counter under synclock is an overkill. Updating it through GF_ATOMIC API should be sufficient for the atomicity guarantee. upstream patch : https://review.gluster.org/#/c/glusterfs/+/21221/ |