Bug 1325491

Summary: Daemons cannot connect to GlusterD when management encryption is enabled
Product: [Community] GlusterFS Reporter: Kaushal <kaushal>
Component: rpcAssignee: Kaushal <kaushal>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.7.10CC: bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.7.11 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1325492 (view as bug list) Environment:
Last Closed: 2016-04-19 07:13:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1325492    
Bug Blocks: 1324405    

Description Kaushal 2016-04-09 07:14:40 UTC
..which causes them to fail fetching volfile and not start.

Comment 1 Vijay Bellur 2016-04-09 08:01:17 UTC
REVIEW: http://review.gluster.org/13931 (socket: Don't cleanup encrypted transport in socket_connect()) posted (#2) for review on release-3.7 by Kaushal M (kaushal)

Comment 2 Vijay Bellur 2016-04-10 04:55:08 UTC
COMMIT: http://review.gluster.org/13931 committed in release-3.7 by Kaushal M (kaushal) 
------
commit 6a1d6da4588726ea0e1d0b0b6eb204a9d829db19
Author: Kaushal M <kaushal>
Date:   Thu Apr 7 20:21:18 2016 +0530

    socket: Don't cleanup encrypted transport in socket_connect()
    
    ..instead cleanup only in socket_poller()
    
      Backport of be99ddd from master
    
    With commit d117466 socket_poller() wasn't launched from socket_connect
    (for encrypted connections), if connect() failed. This was done to
    prevent the socket private data from being double unreffed, from the
    cleanups in both socket_poller() and socket_connect(). This allowed
    future reconnects to happen successfully.
    
    If a socket reconnects is sort of decided by the rpc notify function
    registered. The above change worked with glusterd, as the glusterd rpc
    notify function (glusterd_peer_rpc_notify()) continuously allowed
    reconnects on failure.
    
    mgmt_rpc_notify(), the rpc notify function in glusterfsd, behaves
    differently.
    
    For a DISCONNECT event, if more volfile servers are available or if more
    addresses are available in the dns cache, it allows reconnects. If not
    it terminates the program.
    
    For a CONNECT event, it attempts to do a volfile fetch rpc request. If
    sending this rpc fails, it immediately terminates the program.
    
    One side effect of commit d117466, was that the encrypted socket was
    registered with epoll, unintentionally, on a connect failure.  A weird
    thing happens because of this. The epoll notifier notifies
    mgmt_rpc_notify() of a CONNECT event, instead of a DISCONNECT as
    expected. This causes mgmt_rpc_notify() to attempt an unsuccessful
    volfile fetch rpc request, and terminate.
    (I still don't know why the epoll raises the CONNECT event)
    
    Commit 46bd29e fixed some issues with IPv6 in GlusterFS. This caused
    address resolution in GlusterFS to also request of IPv6 addresses
    (AF_UNSPEC) instead of just IPv4. On most systems, this causes the IPv6
    addresses to be returned first.
    
    GlusterD listens on 0.0.0.0:24007 by default. While this attaches to all
    interfaces, it only listens on IPv4 addresses. GlusterFS daemons and
    bricks are given 'localhost' as the volfile server. This resolves to
    '::1' as the first address.
    
    When using management encryption, the above reasons cause the daemon
    processes to fail to fetch volfiles and terminate.
    
    Solution
    --------
    The solution to this is simple. Instead of cleaning up the encrypted
    socket in socket_connect(), launch socket_poller() and let it cleanup
    the socket instead. This prevents the unintentional registration with
    epoll, and socket_poller() sends the correct events to the rpc notify
    functions, which allows proper reconnects to happen.
    
    Change-Id: Idb0c0a828743cccca51cfdd1aa6458cfa0a9d100
    BUG: 1325491
    Signed-off-by: Kaushal M <kaushal>
    Reviewed-on: http://review.gluster.org/13931
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Tested-by: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 3 Kaushal 2016-04-19 07:13:37 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.11, please open a new bug report.

glusterfs-3.7.11 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-April/026321.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user