Bug 1427012
| Summary: | Disconnects in nfs mount leads to IO hang and mount inaccessible | |||
|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Raghavendra G <rgowdapp> | |
| Component: | rpc | Assignee: | Raghavendra G <rgowdapp> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | mainline | CC: | amukherj, atumball, bugs, jthottan, ksandha, rcyriac, rgowdapp, rhinduja, rhs-bugs, rjoseph, skoduri, storage-qa-internal | |
| Target Milestone: | --- | Keywords: | Triaged | |
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.11.0 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1425740 | |||
| : | 1428670 (view as bug list) | Environment: | ||
| Last Closed: | 2017-05-30 18:45:26 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1409135, 1425740, 1428670, 1462447 | |||
|
Comment 1
Worker Ant
2017-02-28 08:05:29 UTC
REVIEW: https://review.gluster.org/16784 (rpc/clnt: remove locks while notifying CONNECT/DISCONNECT) posted (#2) for review on master by Raghavendra G (rgowdapp) REVIEW: https://review.gluster.org/16784 (rpc/clnt: remove locks while notifying CONNECT/DISCONNECT) posted (#3) for review on master by Raghavendra G (rgowdapp) COMMIT: https://review.gluster.org/16784 committed in master by Raghavendra G (rgowdapp) ------ commit 773f32caf190af4ee48818279b6e6d3c9f2ecc79 Author: Raghavendra G <rgowdapp> Date: Tue Feb 28 13:13:59 2017 +0530 rpc/clnt: remove locks while notifying CONNECT/DISCONNECT Locking during notify was introduced as part of commit aa22f24f5db7659387704998ae01520708869873 [1]. The fix was introduced to fix out-of-order CONNECT/DISCONNECT events from rpc-clnt to parent xlators [2]. However as part of handling DISCONNECT protocol/client does unwind saved frames (with failure) waiting for responses. This saved_frames_unwind can be a costly operation and hence ideally shouldn't be included in the critical section of notifylock, as it unnecessarily delays the reconnection to same brick. Also, its not a good practise to pass control to other xlators holding a lock as it can lead to deadlocks. So, this patch removes locking in rpc-clnt while notifying parent xlators. To fix [2], two changes are present in this patch: * notify DISCONNECT before cleaning up rpc connection (same as commit a6b63e11b7758cf1bfcb6798, patch [3]). * protocol/client uses rpc_clnt_cleanup_and_start, which cleans up rpc connection and does a start while handling a DISCONNECT event from rpc. Note that patch [3] was reverted as rpc_clnt_start called in quick_reconnect path of protocol/client didn't invoke connect on transport as the connection was not cleaned up _yet_ (as cleanup was moved post notification in rpc-clnt). This resulted in clients never attempting connect to bricks. Note that one of the neater ways to fix [2] (without using locks) is to introduce generation numbers to map CONNECT and DISCONNECTS across epochs and ignore DISCONNECT events if they don't belong to current epoch. However, this approach is a bit complex to implement and requires time. So, current patch is a hacky stop-gap fix till we come up with a more cleaner solution. [1] http://review.gluster.org/15916 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1386626 [3] http://review.gluster.org/15681 Change-Id: I62daeee8bb1430004e28558f6eb133efd4ccf418 Signed-off-by: Raghavendra G <rgowdapp> BUG: 1427012 Reviewed-on: https://review.gluster.org/16784 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Milind Changire <mchangir> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report. glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html [2] https://www.gluster.org/pipermail/gluster-users/ *** Bug 1521034 has been marked as a duplicate of this bug. *** |