Bug 1386626 - fuse mount point not accessible
Summary: fuse mount point not accessible
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: rpc
Version: mainline
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Raghavendra Talur
QA Contact:
URL:
Whiteboard:
Depends On: 1385605
Blocks: 1388323 1401534
TreeView+ depends on / blocked
 
Reported: 2016-10-19 10:29 UTC by Pranith Kumar K
Modified: 2017-03-06 17:30 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.10.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1385605
: 1388323 1401534 (view as bug list)
Environment:
Last Closed: 2017-03-06 17:30:20 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Pranith Kumar K 2016-10-19 10:49:14 UTC

Here are the logs which confirmed the race:

Connection starts here:
==============================================================================================================================================================
[2016-10-17 07:07:05.164177] I [MSGID: 114020] [client.c:2356:notify] 4-mdcache-client-1: parent translators are ready, attempting connect on transport
[2016-10-17 07:07:05.170943] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 4-mdcache-client-1: changing port to 49152 (from 0)
[2016-10-17 07:07:05.176348] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 4-mdcache-client-1: Using Program GlusterFS 3.3, Num 
(1298437), Version (330)  
[2016-10-17 07:07:05.177828] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 4-mdcache-client-1: Connected to mdcache-client-1, attached to re
mote volume '/bricks/brick0/mdcache'.
[2016-10-17 07:07:05.177849] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 4-mdcache-client-1: Server and Client lk-version numbers are not 
same, reopening the fds   
[2016-10-17 07:07:05.178187] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 4-mdcache-client-1: Server lk version = 1
[2016-10-17 07:39:41.137951] W [rpc-clnt.c:717:rpc_clnt_handle_cbk] 4-mdcache-client-1: RPC call decoding failed

Seems like connection went down here:
==============================================================================================================================================================
[2016-10-17 07:39:41.138350] E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint i
s not connected]
[2016-10-17 07:39:41.138564] W [MSGID: 114031] [client-rpc-fops.c:1102:client3_3_getxattr_cbk] 4-mdcache-client-1: remote operation failed. Path: <gfid:0000000
0-0000-0000-0000-000000000001> (00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo [Transport endpoint is not connected]
[2016-10-17 07:39:41.138720] W [MSGID: 114031] [client-rpc-fops.c:2648:client3_3_readdirp_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint 
is not connected]
[2016-10-17 07:39:41.142014] I [socket.c:3391:socket_submit_request] 4-mdcache-client-1: not connected (priv->connected = 0)
.....
We see logs of new connection here:
==============================================================================================================================================================
[2016-10-17 07:39:41.145226] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 4-mdcache-client-1: changing port to 49152 (from 0)
[2016-10-17 07:39:41.152533] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 4-mdcache-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-10-17 07:39:41.152761] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 4-mdcache-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2016-10-17 07:39:41.154742] W [MSGID: 114031] [client-rpc-fops.c:2648:client3_3_readdirp_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]
[2016-10-17 07:39:41.155498] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 4-mdcache-client-1: Connected to mdcache-client-1, attached to remote volume '/bricks/brick0/mdcache'.
[2016-10-17 07:39:41.155531] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 4-mdcache-client-1: Server and Client lk-version numbers are not same, reopening the fds   
[2016-10-17 07:39:41.155546] I [MSGID: 114042] [client-handshake.c:1053:client_post_handshake] 4-mdcache-client-1: 22 fds open - Delaying child_up until they are re-opened
[2016-10-17 07:39:41.156306] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 4-mdcache-client-1: reopendir on <gfid:00000000-0000-0000-0000-000000000001> succeeded (fd = 0)
[2016-10-17 07:39:41.156467] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 4-mdcache-client-1: reopendir on <gfid:00000000-0000-0000-0000-000000000001> succeeded (fd = 1)
....
[2016-10-17 07:39:41.161125] I [MSGID: 114041] [client-handshake.c:675:client_child_up_reopen_done] 4-mdcache-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2016-10-17 07:39:41.161445] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 4-mdcache-client-1: Server lk version = 1

We see the disconnect notification for the first connection here:
==============================================================================================================================================================
[2016-10-17 07:39:41.167534] W [MSGID: 114031] [client-rpc-fops.c:2648:client3_3_readdirp_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]
The message "W [MSGID: 114031] [client-rpc-fops.c:2648:client3_3_readdirp_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]" repeated 20 times between [2016-10-17 07:39:41.167534] and [2016-10-17 07:39:41.345856]
[2016-10-17 07:39:41.352243] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 4-mdcache-client-1: disconnected from mdcache-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2016-10-17 07:39:41.354562] W [MSGID: 114031] [client-rpc-fops.c:630:client3_3_unlink_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]
[2016-10-17 07:39:41.355605] E [MSGID: 114031] [client-rpc-fops.c:1766:client3_3_xattrop_cbk] 4-mdcache-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001)
The message "E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 4-mdcache-client-1: remote operation failed [Transport endpoint is not connected]" repeated 2 times between [2016-10-17 07:39:41.138350] and [2016-10-17 07:39:41.356994]

So what ends up happening is even when there is a good connection, client xlator thinks it is disconnected, because disconnect message came after connect of the newer one.

Pranith.

Comment 2 Worker Ant 2016-10-19 11:44:04 UTC
REVIEW: http://review.gluster.org/15681 (rpc: Fix the race between notification and reconnection) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 3 Worker Ant 2016-10-20 07:07:04 UTC
REVIEW: http://review.gluster.org/15681 (rpc: Fix the race between notification and reconnection) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 4 Worker Ant 2016-10-25 06:42:25 UTC
COMMIT: http://review.gluster.org/15681 committed in master by Raghavendra G (rgowdapp) 
------
commit a6b63e11b7758cf1bfcb67985e25ec02845f0995
Author: Pranith Kumar K <pkarampu>
Date:   Wed Oct 19 15:50:50 2016 +0530

    rpc: Fix the race between notification and reconnection
    
    Problem:
    There was a hang because unlock on an entry failed with
    ENOTCONN.
    Client thinks the connection is down where as server thinks
    the connection is up.
    
    This is the race we are seeing:
    1) Connection from client to the brick disconnects.
    2) Saved frames unwind is called which unwinds all
       frames that were wound before disconnect.
    3) connection from client to the brick happens and
       setvolume.
    4) Disconnect notification for the connection in 1)
       comes now and calls client_rpc_notify() which
       marks the connection to be offline even when the
       connection is up.
    
    This is happening because I/O can retrigger connection
    before disconnect notification is sent to the higher
    layers in rpc.
    
    Fix:
    Notify the higher layers that a disconnect happened and then
    go ahead with reconnect logic.
    
    For the logs which point to the information above check:
    https://bugzilla.redhat.com/show_bug.cgi?id=1386626#c1
    
    Thanks to Raghavendra G for suggesting the correct fix.
    
    BUG: 1386626
    Change-Id: I3c84ba1f17010bd69049fa88ec5f0ae431f8cda9
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/15681
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Niels de Vos <ndevos>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 5 Worker Ant 2016-11-13 13:55:10 UTC
REVIEW: http://review.gluster.org/15838 (Revert "rpc: Fix the race between notification and reconnection") posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 6 Worker Ant 2016-11-16 09:25:34 UTC
COMMIT: http://review.gluster.org/15838 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 922af37e33acb59c6f8beca40a9314593b484d1a
Author: Pranith Kumar Karampuri <pkarampu>
Date:   Sun Nov 13 06:53:15 2016 -0700

    Revert "rpc: Fix the race between notification and reconnection"
    
    This reverts commit a6b63e11b7758cf1bfcb67985e25ec02845f0995.
    
    Nithya and Rajesh found that the mount fails sometimes after this patch
    was merged so reverting it.
    
    BUG: 1386626
    Change-Id: I959a5b6c7da61368cf4c67c98193c6e8fdd1755d
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/15838
    Reviewed-by: N Balachandran <nbalacha>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>

Comment 7 Worker Ant 2016-11-23 07:12:20 UTC
REVIEW: http://review.gluster.org/15911 (rpc: test for race between rpc and p/c) posted (#1) for review on master by Raghavendra Talur (rtalur)

Comment 8 Worker Ant 2016-11-23 10:21:17 UTC
REVIEW: http://review.gluster.org/15911 (rpc: test for race between rpc and p/c) posted (#2) for review on master by Raghavendra Talur (rtalur)

Comment 9 Worker Ant 2016-11-23 11:27:09 UTC
REVIEW: http://review.gluster.org/15911 (rpc: test for race between rpc and p/c) posted (#3) for review on master by Raghavendra Talur (rtalur)

Comment 10 Worker Ant 2016-11-23 11:32:18 UTC
REVIEW: http://review.gluster.org/15915 (rpc: test and fix for race between rpc and p/c) posted (#1) for review on master by Raghavendra Talur (rtalur)

Comment 11 Worker Ant 2016-11-23 11:37:18 UTC
REVIEW: http://review.gluster.org/15911 (rpc: test for race between rpc and p/c) posted (#4) for review on master by Raghavendra Talur (rtalur)

Comment 12 Worker Ant 2016-11-23 12:39:31 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#1) for review on master by Raghavendra Talur (rtalur)

Comment 13 Worker Ant 2016-11-23 12:47:37 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#2) for review on master by Raghavendra Talur (rtalur)

Comment 14 Worker Ant 2016-11-23 13:29:24 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#3) for review on master by Raghavendra Talur (rtalur)

Comment 15 Worker Ant 2016-11-24 04:56:09 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#4) for review on master by Raghavendra Talur (rtalur)

Comment 16 Worker Ant 2016-11-24 08:00:22 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#5) for review on master by Raghavendra Talur (rtalur)

Comment 17 Worker Ant 2016-11-24 08:47:25 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#6) for review on master by Raghavendra Talur (rtalur)

Comment 18 Worker Ant 2016-11-24 13:02:18 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#7) for review on master by Raghavendra Talur (rtalur)

Comment 19 Worker Ant 2016-11-25 03:44:35 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#8) for review on master by Raghavendra Talur (rtalur)

Comment 20 Worker Ant 2016-11-25 07:50:36 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#9) for review on master by Raghavendra Talur (rtalur)

Comment 21 Worker Ant 2016-11-25 11:46:47 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#10) for review on master by Raghavendra Talur (rtalur)

Comment 22 Worker Ant 2016-11-25 13:00:00 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#11) for review on master by Raghavendra Talur (rtalur)

Comment 23 Worker Ant 2016-11-27 15:24:43 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#12) for review on master by Raghavendra Talur (rtalur)

Comment 24 Worker Ant 2016-11-28 03:16:11 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#13) for review on master by Raghavendra Talur (rtalur)

Comment 25 Worker Ant 2016-11-28 04:36:19 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#14) for review on master by Raghavendra Talur (rtalur)

Comment 26 Worker Ant 2016-11-28 08:59:08 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#15) for review on master by Raghavendra Talur (rtalur)

Comment 27 Worker Ant 2016-11-28 09:42:52 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#16) for review on master by Raghavendra Talur (rtalur)

Comment 28 Worker Ant 2016-12-02 19:43:40 UTC
REVIEW: http://review.gluster.org/16017 (rpc: Fix race between disconnect and connect notify) posted (#1) for review on master by Rajesh Joseph (rjoseph)

Comment 29 Worker Ant 2016-12-05 09:11:06 UTC
REVIEW: http://review.gluster.org/16017 (rpc: Fix race between disconnect and connect notify) posted (#2) for review on master by Rajesh Joseph (rjoseph)

Comment 30 Worker Ant 2016-12-05 09:23:31 UTC
REVIEW: http://review.gluster.org/15916 (rpc: fix for race between rpc and protocol/client) posted (#17) for review on master by Rajesh Joseph (rjoseph)

Comment 31 Worker Ant 2016-12-05 14:09:51 UTC
COMMIT: http://review.gluster.org/15916 committed in master by Atin Mukherjee (amukherj) 
------
commit aa22f24f5db7659387704998ae01520708869873
Author: Rajesh Joseph <rjoseph>
Date:   Sat Dec 3 01:10:51 2016 +0530

    rpc: fix for race between rpc and protocol/client
    
    It is possible that the notification thread which notifies
    protocol/client layer about the disconnection is put to sleep
    and meanwhile, a fuse thread or a timer thread initiates and
    completes reconnection to the brick. The notification thread
    is then woken up and protocol/client layer updates its flags
    to indicate that network is disconnected. No reconnection is
    initiated because reconnection is rpc-lib layer's responsibility
    and its flags indicate that connection is connected.
    
    Fix: Serialize connect and disconnect notify
    
    Credit: Raghavendra Talur <rtalur>
    Change-Id: I8ff5d1a3283b47f5c26848a42016a40bc34ffc1d
    BUG: 1386626
    Signed-off-by: Rajesh Joseph <rjoseph>
    Reviewed-on: http://review.gluster.org/15916
    Reviewed-by: Raghavendra G <rgowdapp>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Raghavendra Talur <rtalur>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 32 Shyamsundar 2017-03-06 17:30:20 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.