Bug 1388323 - fuse mount point not accessible
Summary: fuse mount point not accessible
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: rpc
Version: 3.9
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: rjoseph
QA Contact:
URL:
Whiteboard:
Depends On: 1385605 1386626 1401534
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-25 06:46 UTC by Pranith Kumar K
Modified: 2017-03-08 10:23 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.9.1
Clone Of: 1386626
Environment:
Last Closed: 2017-03-08 10:23:21 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Pranith Kumar K 2016-10-25 06:46:48 UTC
Here are the logs which confirmed the race:

Connection starts here:
==============================================================================================================================================================
[2016-10-17 07:07:05.164177] I [MSGID: 114020] [client.c:2356:notify] 4-mdcache-client-1: parent translators are ready, attempting connect on transport
[2016-10-17 07:07:05.170943] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 4-mdcache-client-1: changing port to 49152 (from 0)
[2016-10-17 07:07:05.176348] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 4-mdcache-client-1: Using Program GlusterFS 3.3, Num 
(1298437), Version (330)  
[2016-10-17 07:07:05.177828] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 4-mdcache-client-1: Connected to mdcache-client-1, attached to re
mote volume '/bricks/brick0/mdcache'.
[2016-10-17 07:07:05.177849] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 4-mdcache-client-1: Server and Client lk-version numbers are not 
same, reopening the fds   
[2016-10-17 07:07:05.178187] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 4-mdcache-client-1: Server lk version = 1
[2016-10-17 07:39:41.137951] W [rpc-clnt.c:717:rpc_clnt_handle_cbk] 4-mdcache-client-1: RPC call decoding failed

Seems like connection went down here:
==============================================================================================================================================================
[2016-10-17 07:39:41.138350] E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint i
s not connected]
[2016-10-17 07:39:41.138564] W [MSGID: 114031] [client-rpc-fops.c:1102:client3_3_getxattr_cbk] 4-mdcache-client-1: remote operation failed. Path: <gfid:0000000
0-0000-0000-0000-000000000001> (00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo [Transport endpoint is not connected]
[2016-10-17 07:39:41.138720] W [MSGID: 114031] [client-rpc-fops.c:2648:client3_3_readdirp_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint 
is not connected]
[2016-10-17 07:39:41.142014] I [socket.c:3391:socket_submit_request] 4-mdcache-client-1: not connected (priv->connected = 0)
.....
We see logs of new connection here:
==============================================================================================================================================================
[2016-10-17 07:39:41.145226] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 4-mdcache-client-1: changing port to 49152 (from 0)
[2016-10-17 07:39:41.152533] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 4-mdcache-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-10-17 07:39:41.152761] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 4-mdcache-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2016-10-17 07:39:41.154742] W [MSGID: 114031] [client-rpc-fops.c:2648:client3_3_readdirp_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]
[2016-10-17 07:39:41.155498] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 4-mdcache-client-1: Connected to mdcache-client-1, attached to remote volume '/bricks/brick0/mdcache'.
[2016-10-17 07:39:41.155531] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 4-mdcache-client-1: Server and Client lk-version numbers are not same, reopening the fds   
[2016-10-17 07:39:41.155546] I [MSGID: 114042] [client-handshake.c:1053:client_post_handshake] 4-mdcache-client-1: 22 fds open - Delaying child_up until they are re-opened
[2016-10-17 07:39:41.156306] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 4-mdcache-client-1: reopendir on <gfid:00000000-0000-0000-0000-000000000001> succeeded (fd = 0)
[2016-10-17 07:39:41.156467] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 4-mdcache-client-1: reopendir on <gfid:00000000-0000-0000-0000-000000000001> succeeded (fd = 1)
....
[2016-10-17 07:39:41.161125] I [MSGID: 114041] [client-handshake.c:675:client_child_up_reopen_done] 4-mdcache-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2016-10-17 07:39:41.161445] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 4-mdcache-client-1: Server lk version = 1

We see the disconnect notification for the first connection here:
==============================================================================================================================================================
[2016-10-17 07:39:41.167534] W [MSGID: 114031] [client-rpc-fops.c:2648:client3_3_readdirp_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]
The message "W [MSGID: 114031] [client-rpc-fops.c:2648:client3_3_readdirp_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]" repeated 20 times between [2016-10-17 07:39:41.167534] and [2016-10-17 07:39:41.345856]
[2016-10-17 07:39:41.352243] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 4-mdcache-client-1: disconnected from mdcache-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2016-10-17 07:39:41.354562] W [MSGID: 114031] [client-rpc-fops.c:630:client3_3_unlink_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]
[2016-10-17 07:39:41.355605] E [MSGID: 114031] [client-rpc-fops.c:1766:client3_3_xattrop_cbk] 4-mdcache-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001)
The message "E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]" repeated 2 times between [2016-10-17 07:39:41.138350] and [2016-10-17 07:39:41.356994]

So what ends up happening is even when there is a good connection, client xlator thinks it is disconnected, because disconnect message came after connect of the newer one.

Pranith.

Comment 2 Worker Ant 2016-10-25 06:47:38 UTC
REVIEW: http://review.gluster.org/15717 (rpc: Fix the race between notification and reconnection) posted (#1) for review on release-3.9 by Pranith Kumar Karampuri (pkarampu)

Comment 3 Worker Ant 2016-10-25 11:00:49 UTC
COMMIT: http://review.gluster.org/15717 committed in release-3.9 by Pranith Kumar Karampuri (pkarampu) 
------
commit e6c38ae1d3f3c53f8739ab2db7c4ecfdbc58fc44
Author: Pranith Kumar K <pkarampu>
Date:   Wed Oct 19 15:50:50 2016 +0530

    rpc: Fix the race between notification and reconnection
    
    Problem:
    There was a hang because unlock on an entry failed with
    ENOTCONN.
    Client thinks the connection is down where as server thinks
    the connection is up.
    
    This is the race we are seeing:
    1) Connection from client to the brick disconnects.
    2) Saved frames unwind is called which unwinds all
       frames that were wound before disconnect.
    3) connection from client to the brick happens and
       setvolume.
    4) Disconnect notification for the connection in 1)
       comes now and calls client_rpc_notify() which
       marks the connection to be offline even when the
       connection is up.
    
    This is happening because I/O can retrigger connection
    before disconnect notification is sent to the higher
    layers in rpc.
    
    Fix:
    Notify the higher layers that a disconnect happened and then
    go ahead with reconnect logic.
    
    For the logs which point to the information above check:
    https://bugzilla.redhat.com/show_bug.cgi?id=1386626#c1
    
    Thanks to Raghavendra G for suggesting the correct fix.
    
     >BUG: 1386626
     >Change-Id: I3c84ba1f17010bd69049fa88ec5f0ae431f8cda9
     >Signed-off-by: Pranith Kumar K <pkarampu>
     >Reviewed-on: http://review.gluster.org/15681
     >NetBSD-regression: NetBSD Build System <jenkins.org>
     >Reviewed-by: Niels de Vos <ndevos>
     >CentOS-regression: Gluster Build System <jenkins.org>
     >Smoke: Gluster Build System <jenkins.org>
     >Reviewed-by: Raghavendra G <rgowdapp>
     >(cherry picked from commit a6b63e11b7758cf1bfcb67985e25ec02845f0995)
    
    Change-Id: Ifa721193c26b70e26b47b7698c077da0ad5f2e1d
    BUG: 1388323
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/15717
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Worker Ant 2016-11-11 06:24:32 UTC
REVIEW: http://review.gluster.org/15825 (Revert "rpc: Fix the race between notification and reconnection") posted (#1) for review on release-3.9 by Pranith Kumar Karampuri (pkarampu)

Comment 5 Worker Ant 2016-11-11 06:27:04 UTC
REVIEW: http://review.gluster.org/15825 (Revert "rpc: Fix the race between notification and reconnection") posted (#2) for review on release-3.9 by Pranith Kumar Karampuri (pkarampu)

Comment 6 Worker Ant 2016-11-11 15:46:46 UTC
REVIEW: http://review.gluster.org/15825 (Revert "rpc: Fix the race between notification and reconnection") posted (#3) for review on release-3.9 by Raghavendra Talur (rtalur)

Comment 7 Worker Ant 2016-11-12 05:06:53 UTC
COMMIT: http://review.gluster.org/15825 committed in release-3.9 by Pranith Kumar Karampuri (pkarampu) 
------
commit 72ce732fa5c9979ea6352ea98451471b721410c1
Author: Pranith Kumar K <pkarampu>
Date:   Fri Nov 11 11:48:51 2016 +0530

    Revert "rpc: Fix the race between notification and reconnection"
    
    This reverts commit e6c38ae1d3f3c53f8739ab2db7c4ecfdbc58fc44.
    
    Mount is intermittently failing with this patch, which means
    this patch introduced another race. So until we fix that
    reverting this patch and making the release.
    
    BUG: 1388323
    Change-Id: I6e110caf38fcc6a309b2abdc864bc4fbdb3a7588
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/15825
    Reviewed-by: Raghavendra Talur <rtalur>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 8 Worker Ant 2016-12-05 14:17:14 UTC
REVIEW: http://review.gluster.org/16026 (rpc: fix for race between rpc and protocol/client) posted (#1) for review on release-3.9 by Rajesh Joseph (rjoseph)

Comment 9 Worker Ant 2016-12-20 17:12:01 UTC
COMMIT: http://review.gluster.org/16026 committed in release-3.9 by Raghavendra Talur (rtalur) 
------
commit 63152c4037218ca9b89282547d6fe38399fb111a
Author: Rajesh Joseph <rjoseph>
Date:   Sat Dec 3 01:10:51 2016 +0530

    rpc: fix for race between rpc and protocol/client
    
    It is possible that the notification thread which notifies
    protocol/client layer about the disconnection is put to sleep
    and meanwhile, a fuse thread or a timer thread initiates and
    completes reconnection to the brick. The notification thread
    is then woken up and protocol/client layer updates its flags
    to indicate that network is disconnected. No reconnection is
    initiated because reconnection is rpc-lib layer's responsibility
    and its flags indicate that connection is connected.
    
    Fix: Serialize connect and disconnect notify
    
    > Credit: Raghavendra Talur <rtalur>
    > Reviewed-on: http://review.gluster.org/15916
    > Reviewed-by: Raghavendra G <rgowdapp>
    > Smoke: Gluster Build System <jenkins.org>
    > Reviewed-by: Raghavendra Talur <rtalur>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    (cherry picked from commit aa22f24f5db7659387704998ae01520708869873)
    
    Change-Id: I8ff5d1a3283b47f5c26848a42016a40bc34ffc1d
    BUG: 1388323
    Signed-off-by: Rajesh Joseph <rjoseph>
    Reviewed-on: http://review.gluster.org/16026
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>
    Reviewed-by: Raghavendra Talur <rtalur>

Comment 10 Kaushal 2017-03-08 10:23:21 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.1, please open a new bug report.

glusterfs-3.9.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-January/029725.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.