Bug 1462447

Summary: brick maintenance - no client reconnect
Product: [Community] GlusterFS Reporter: Markus Stockhausen <mst>
Component: rpcAssignee: Niels de Vos <ndevos>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.8CC: bugs, ndevos
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.14 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-15 12:22:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1427012    
Bug Blocks: 1464927    
Attachments:
Description Flags
client without reconnect
none
debug log none

Description Markus Stockhausen 2017-06-17 18:59:12 UTC
Description of problem:

We are using glusterfs in an OVirt environment. The setup might be a little bit speical but here we go:

- 3 cluster storage nodes that have 2 bricks each to provide a distributed replicated gluster filesystem

- a virtual IP that is managed by keepalived between the storage nodes

- several OVirt gluster clients that connect to the storage via native gluster fuse mounts (initiallay over the virtual IP)
 
- Network infrastucture between all the nodes is IPoIB 

Version-Release number of selected component (if applicable):

Centos 7
glusterfs 3.8.12

How reproducible:

40-60%

Steps to Reproduce:
1. restart storage node for maintenance
2. wait for resync to complete

Actual results:
resync does not finish. New files are permanently added to the list of unsynced filed. Looking at the client connections we see that the clients have not reconnected to the node in maintenance.

Expected results:
client nodes should reconnect.

Comment 1 Markus Stockhausen 2017-06-17 19:03:49 UTC
Status on storage side is:

gluster volume status gluster1 clients| grep -i -E "brick|conn"
Client connections for volume gluster1
Brick : cfilers101:/var/data/brick1/brick
Clients connected : 10
Brick : cfilers201:/var/data/brick2/brick <- was in maintenance
Clients connected : 7
Brick : cfilers201:/var/data/brick1/brick <- was in maintenance
Clients connected : 9
Brick : cfilers301:/var/data/brick2/brick
Clients connected : 10
Brick : cfilers301:/var/data/brick1/brick
Clients connected : 10
Brick : cfilers101:/var/data/brick2/brick
Clients connected : 10

We observed two symptoms:

First: Clients try to reconnect, flood the gluster logs, but cannot reconnect to the restarted storage node. More see here: http://lists.gluster.org/pipermail/gluster-users/2017-June/031525.html

Second: Clients show no signs of reconnect. No new entries in their logs. 

Log of one client attached.

Comment 2 Markus Stockhausen 2017-06-17 19:06:12 UTC
Created attachment 1288635 [details]
client without reconnect

Comment 3 Markus Stockhausen 2017-06-17 19:09:46 UTC
# yum list installed ...
glusterfs.x86_64                       3.8.12-1.el7               @ovirt-4.1-centos-gluster38
glusterfs-api.x86_64                   3.8.12-1.el7               @ovirt-4.1-centos-gluster38
glusterfs-cli.x86_64                   3.8.12-1.el7               @ovirt-4.1-centos-gluster38
glusterfs-client-xlators.x86_64        3.8.12-1.el7               @ovirt-4.1-centos-gluster38
glusterfs-fuse.x86_64                  3.8.12-1.el7               @ovirt-4.1-centos-gluster38
glusterfs-libs.x86_64                  3.8.12-1.el7               @ovirt-4.1-centos-gluster38
kernel.x86_64                          3.10.0-514.el7             @anaconda
kernel.x86_64                          3.10.0-514.16.1.el7        @updates
kernel.x86_64                          3.10.0-514.21.1.el7        @updates

Comment 4 Markus Stockhausen 2017-06-17 19:20:19 UTC
Might this be due to BZ1428070 that is missing in 3.8 release?

Comment 5 Markus Stockhausen 2017-06-18 07:38:02 UTC
I made several tests and can confirm that the restart of cluster brick nodes leads to permanently disconnected clients in ~50% of my tests. Attached you will find a client debug log that hopefully shows the error in detail.

Additionally I'm rising the severity to high as this makes our Ovirt installation with gluster nearly unusable.

Comment 6 Markus Stockhausen 2017-06-18 07:38:36 UTC
Created attachment 1288829 [details]
debug log

Comment 7 Markus Stockhausen 2017-06-22 19:37:48 UTC
I did some further tests.

BZ1428070 does NOT fix the issue. the situation from over there is never hit.

Adding some further log messages into the code I would say that the Patch "rpc/clnt: remove locks while notifying CONNECT/DISCONNECT" from BZ1427012 should be the solution for the problem. 

It was included into 3.10.1 but never got backported to 3.8. Maybe one of the developers can tell if this will integrate nicely.

Comment 8 Worker Ant 2017-07-10 07:51:42 UTC
REVIEW: https://review.gluster.org/17733 (rpc/clnt: remove locks while notifying CONNECT/DISCONNECT) posted (#1) for review on release-3.8 by Niels de Vos (ndevos)

Comment 9 Worker Ant 2017-07-11 08:27:38 UTC
REVIEW: https://review.gluster.org/17743 (rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup) posted (#1) for review on release-3.8 by Niels de Vos (ndevos)

Comment 10 Worker Ant 2017-07-11 08:28:14 UTC
REVIEW: https://review.gluster.org/17743 (rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup) posted (#2) for review on release-3.8 by Niels de Vos (ndevos)

Comment 11 Worker Ant 2017-07-11 13:25:35 UTC
COMMIT: https://review.gluster.org/17733 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit 9d5328e0e07353501392b33e9553300bd34a32ad
Author: Niels de Vos <ndevos>
Date:   Mon Jul 10 09:45:02 2017 +0200

    rpc/clnt: remove locks while notifying CONNECT/DISCONNECT
    
    Locking during notify was introduced as part of commit
    aa22f24f5db7659387704998ae01520708869873 [1]. The fix was introduced
    to fix out-of-order CONNECT/DISCONNECT events from rpc-clnt to parent
    xlators [2]. However as part of handling DISCONNECT protocol/client
    does unwind saved frames (with failure) waiting for responses. This
    saved_frames_unwind can be a costly operation and hence ideally
    shouldn't be included in the critical section of notifylock, as it
    unnecessarily delays the reconnection to same brick. Also, its not a
    good practise to pass control to other xlators holding a lock as it
    can lead to deadlocks. So, this patch removes locking in rpc-clnt
    while notifying parent xlators.
    
    To fix [2], two changes are present in this patch:
    
    * notify DISCONNECT before cleaning up rpc connection (same as commit
      a6b63e11b7758cf1bfcb6798, patch [3]).
    * protocol/client uses rpc_clnt_cleanup_and_start, which cleans up rpc
      connection and does a start while handling a DISCONNECT event from
      rpc. Note that patch [3] was reverted as rpc_clnt_start called in
      quick_reconnect path of protocol/client didn't invoke connect on
      transport as the connection was not cleaned up _yet_ (as cleanup was
      moved post notification in rpc-clnt). This resulted in clients never
      attempting connect to bricks.
    
    Note that one of the neater ways to fix [2] (without using locks) is
    to introduce generation numbers to map CONNECT and DISCONNECTS across
    epochs and ignore DISCONNECT events if they don't belong to current
    epoch. However, this approach is a bit complex to implement and
    requires time. So, current patch is a hacky stop-gap fix till we come
    up with a more cleaner solution.
    
    [1] http://review.gluster.org/15916
    [2] https://bugzilla.redhat.com/show_bug.cgi?id=1386626
    [3] http://review.gluster.org/15681
    
    Cherry picked from commit 773f32caf190af4ee48818279b6e6d3c9f2ecc79:
    > Change-Id: I62daeee8bb1430004e28558f6eb133efd4ccf418
    > Signed-off-by: Raghavendra G <rgowdapp>
    > BUG: 1427012
    > Reviewed-on: https://review.gluster.org/16784
    > Smoke: Gluster Build System <jenkins.org>
    > Reviewed-by: Milind Changire <mchangir>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    
    Change-Id: I62daeee8bb1430004e28558f6eb133efd4ccf418
    Reported-by: Markus Stockhausen <mst>
    Signed-off-by: Niels de Vos <ndevos>
    BUG: 1462447
    Reviewed-on: https://review.gluster.org/17733
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Milind Changire <mchangir>
    Reviewed-by: Raghavendra G <rgowdapp>

Comment 12 Worker Ant 2017-07-11 13:25:44 UTC
COMMIT: https://review.gluster.org/17743 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit 96e7452155c3fd1b160a97d88c03f1bf31b0be97
Author: Atin Mukherjee <amukherj>
Date:   Sat Mar 18 16:29:10 2017 +0530

    rpc: bump up conn->cleanup_gen in rpc_clnt_reconnect_cleanup
    
    Commit 086436a introduced generation number (cleanup_gen) to ensure that
    rpc layer doesn't end up cleaning up the connection object if
    application layer has already destroyed it. Bumping up cleanup_gen was
    done only in rpc_clnt_connection_cleanup (). However the same is needed
    in rpc_clnt_reconnect_cleanup () too as with out it if the object gets destroyed
    through the reconnect event in the application layer, rpc layer will
    still end up in trying to delete the object resulting into double free
    and crash.
    
    Peer probing an invalid host/IP was the basic test to catch this issue.
    
    Cherry picked from commit 39e09ad1e0e93f08153688c31433c38529f93716:
    > Change-Id: Id5332f3239cb324cead34eb51cf73d426733bd46
    > BUG: 1433578
    > Signed-off-by: Atin Mukherjee <amukherj>
    > Reviewed-on: https://review.gluster.org/16914
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > Reviewed-by: Milind Changire <mchangir>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Jeff Darcy <jeff.us>
    
    Change-Id: Id5332f3239cb324cead34eb51cf73d426733bd46
    BUG: 1462447
    Signed-off-by: Niels de Vos <ndevos>
    Reviewed-on: https://review.gluster.org/17743
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Milind Changire <mchangir>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 13 Niels de Vos 2017-07-15 12:22:59 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.14, please open a new bug report.

glusterfs-3.8.14 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-July/000077.html
[2] https://www.gluster.org/pipermail/gluster-users/