Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1427012

Summary:	Disconnects in nfs mount leads to IO hang and mount inaccessible
Product:	[Community] GlusterFS	Reporter:	Raghavendra G <rgowdapp>
Component:	rpc	Assignee:	Raghavendra G <rgowdapp>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, atumball, bugs, jthottan, ksandha, rcyriac, rgowdapp, rhinduja, rhs-bugs, rjoseph, skoduri, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.11.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1425740
Clones:	1428670 (view as bug list)		Environment:
Last Closed:	2017-05-30 18:45:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1409135, 1425740, 1428670, 1462447

Comment 1 Worker Ant 2017-02-28 08:05:29 UTC

REVIEW: https://review.gluster.org/16784 (rpc/clnt: remove locks while notifying CONNECT/DISCONNECT) posted (#1) for review on master by Raghavendra G (rgowdapp)

Comment 2 Worker Ant 2017-02-28 10:01:58 UTC

REVIEW: https://review.gluster.org/16784 (rpc/clnt: remove locks while notifying CONNECT/DISCONNECT) posted (#2) for review on master by Raghavendra G (rgowdapp)

Comment 3 Worker Ant 2017-02-28 11:00:30 UTC

REVIEW: https://review.gluster.org/16784 (rpc/clnt: remove locks while notifying CONNECT/DISCONNECT) posted (#3) for review on master by Raghavendra G (rgowdapp)

Comment 4 Worker Ant 2017-03-01 14:36:09 UTC

COMMIT: https://review.gluster.org/16784 committed in master by Raghavendra G (rgowdapp) 
------
commit 773f32caf190af4ee48818279b6e6d3c9f2ecc79
Author: Raghavendra G <rgowdapp>
Date:   Tue Feb 28 13:13:59 2017 +0530

    rpc/clnt: remove locks while notifying CONNECT/DISCONNECT
    
    Locking during notify was introduced as part of commit
    aa22f24f5db7659387704998ae01520708869873 [1]. The fix was introduced
    to fix out-of-order CONNECT/DISCONNECT events from rpc-clnt to parent
    xlators [2]. However as part of handling DISCONNECT protocol/client
    does unwind saved frames (with failure) waiting for responses. This
    saved_frames_unwind can be a costly operation and hence ideally
    shouldn't be included in the critical section of notifylock, as it
    unnecessarily delays the reconnection to same brick. Also, its not a
    good practise to pass control to other xlators holding a lock as it
    can lead to deadlocks. So, this patch removes locking in rpc-clnt
    while notifying parent xlators.
    
    To fix [2], two changes are present in this patch:
    
    * notify DISCONNECT before cleaning up rpc connection (same as commit
      a6b63e11b7758cf1bfcb6798, patch [3]).
    * protocol/client uses rpc_clnt_cleanup_and_start, which cleans up rpc
      connection and does a start while handling a DISCONNECT event from
      rpc. Note that patch [3] was reverted as rpc_clnt_start called in
      quick_reconnect path of protocol/client didn't invoke connect on
      transport as the connection was not cleaned up _yet_ (as cleanup was
      moved post notification in rpc-clnt). This resulted in clients never
      attempting connect to bricks.
    
    Note that one of the neater ways to fix [2] (without using locks) is
    to introduce generation numbers to map CONNECT and DISCONNECTS across
    epochs and ignore DISCONNECT events if they don't belong to current
    epoch. However, this approach is a bit complex to implement and
    requires time. So, current patch is a hacky stop-gap fix till we come
    up with a more cleaner solution.
    
    [1] http://review.gluster.org/15916
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=1386626
    [3] http://review.gluster.org/15681
    
    Change-Id: I62daeee8bb1430004e28558f6eb133efd4ccf418
    Signed-off-by: Raghavendra G <rgowdapp>
    BUG: 1427012
    Reviewed-on: https://review.gluster.org/16784
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Milind Changire <mchangir>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 5 Shyamsundar 2017-05-30 18:45:26 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 6 Amar Tumballi 2018-10-08 17:15:39 UTC

*** Bug 1521034 has been marked as a duplicate of this bug. ***