Bug 980754

Summary: glusterd: inconsistent results
Product: [Community] GlusterFS Reporter: Kaushal <kaushal>
Component: cliAssignee: Kaushal <kaushal>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: urgent    
Version: mainlineCC: dblack, gluster-bugs, jbyers, joe.lin, nsathyan, rhs-bugs, senaik, surs, vagarwal, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.5.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 979861
: 995286 (view as bug list) Environment:
Last Closed: 2014-04-17 11:43:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 979861, 995286    

Description Kaushal 2013-07-03 07:45:38 UTC
+++ This bug was initially created as a clone of Bug #979861 +++

glusterd is reported to be not operational by `gluster' command despite glusterd being alive:

[root@wingo ~]# gluster volume info

No volumes present
Connection failed. Please check if gluster daemon is operational.
[root@wingo ~]# gluster volume status
Connection failed. Please check if gluster daemon is operational.
[root@wingo ~]# gluster peer status
peer status: failed
Connection failed. Please check if gluster daemon is operational.
[root@wingo ~]# pgrep glusterd
2751
[root@wingo ~]# 
[root@wingo ~]# telnet localhost 24007
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
^]

==========================================================================
Setup comprises of four machines: tex, mater, van, wingo

I did some volume operations in a gap of 30 seconds each. Approximately 15 set and unset on volume were done.

Uploaded sosreports from the machines. They have hostnames as part of the filename.

--- Additional comment from krishnan parthasarathi on 2013-07-02 16:42:01 IST ---

Root cause:

The bug synopsis has a rather apocalyptic tone to what is being observed :-) What I guess is being called as inconsistent, is the following message on stderr, which is generally associated with glusterd service being down,
"Connection failed. Please check if gluster daemon is operational"

The reason why the CLI prints that message is because it is unable to make RPC(s) to glusterd. This is because (for reasons that will follow) CLI requests are being made from port no. > 1024. glusterd 'drops' such requests.

How did the system run out of port no. < 1024?
Executing gluster CLI in a loop results in an active close of CLI's TCP connection with glusterd. Actively closed TCP connections go into TCP_WAIT state. What this means to us is, the port is 'held' by the system for upto 2*MSL (2mins). At this rate, we would be piling up TCP connections in TCP_TIME_WAIT state.

This is still a serious (transient) resource leak since we would expect monitoring agents to constantly consult glusterd via gluster CLI for volume health and status.

Comment 1 Anand Avati 2013-07-03 11:08:26 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Use unix domain sockets for cli-glusterd communication) posted (#1) for review on master by Kaushal M (kaushal)

Comment 2 Anand Avati 2013-07-03 12:40:52 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Use unix domain sockets for cli-glusterd communication) posted (#2) for review on master by Kaushal M (kaushal)

Comment 3 Anand Avati 2013-07-05 03:34:07 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Use unix domain sockets for cli-glusterd communication) posted (#3) for review on master by Kaushal M (kaushal)

Comment 4 Anand Avati 2013-07-08 07:02:14 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Use unix domain sockets for cli-glusterd communication) posted (#4) for review on master by Kaushal M (kaushal)

Comment 5 senaik 2013-07-08 09:19:20 UTC
Version : 3.4.0.12rhs.beta3-1.el6rhs.x86_64

Facing the below issue 
-----------------------
1) Created a distributed volume and while starting the volume , got the error message that 'volume start failed' , and on trying to start the volume again , it gives the message that the volume has already been started . 
 
gluster volume create vol_12 10.70.34.85:/rhs/brick1/A1 10.70.34.105:/rhs/brick1/A2 10.70.34.86:/rhs/brick1/A3 10.70.34.85:/rhs/brick1/A4 10.70.34.105:/rhs/brick1/A5
volume create: vol_12: success: please start the volume to access data

[root@fillmore tmp]# gluster v start vol_12
volume start: vol_12: failed: Commit failed on 10.70.34.85. Please check the log file for more details.

[root@fillmore tmp]# gluster v start vol_12
volume start: vol_12: failed: Volume vol_12 already started

[root@fillmore tmp]# gluster v i vol_12
 
Volume Name: vol_12
Type: Distribute
Volume ID: 570901ec-377a-4690-b81d-8a4824deb797
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: 10.70.34.85:/rhs/brick1/A1
Brick2: 10.70.34.105:/rhs/brick1/A2
Brick3: 10.70.34.86:/rhs/brick1/A3
Brick4: 10.70.34.85:/rhs/brick1/A4
Brick5: 10.70.34.105:/rhs/brick1/A5

-----------part of log from 10.70.34.85-------------- 

[2013-07-08 09:17:02.303536] E [rpcsvc.c:519:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
[2013-07-08 09:17:02.314110] E [rpcsvc.c:519:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
[2013-07-08 09:17:02.413361] E [rpcsvc.c:519:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
[2013-07-08 09:17:02.418914] E [rpcsvc.c:519:rpcsvc_handle_rpc_call] 0-glusterd: 

-------------------------------------------------------

Comment 7 Anand Avati 2013-07-30 05:50:50 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Use unix domain sockets for cli-glusterd communication) posted (#5) for review on master by Kaushal M (kaushal)

Comment 8 Anand Avati 2013-08-06 11:56:06 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Changes to cli-glusterd communication) posted (#6) for review on master by Kaushal M (kaushal)

Comment 9 Anand Avati 2013-09-25 12:12:36 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Changes to cli-glusterd communication) posted (#7) for review on master by Kaushal M (kaushal)

Comment 10 Anand Avati 2013-09-25 13:24:47 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Changes to cli-glusterd communication) posted (#8) for review on master by Kaushal M (kaushal)

Comment 11 Anand Avati 2013-09-26 03:03:12 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Changes to cli-glusterd communication) posted (#9) for review on master by Kaushal M (kaushal)

Comment 12 Anand Avati 2013-10-03 06:39:04 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Changes to cli-glusterd communication) posted (#10) for review on master by Kaushal M (kaushal)

Comment 13 Anand Avati 2013-10-17 05:21:15 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Changes to cli-glusterd communication) posted (#11) for review on master by Kaushal M (kaushal)

Comment 14 Anand Avati 2013-10-17 05:24:41 UTC
REVIEW: http://review.gluster.org/5280 (cli,glusterd: Changes to cli-glusterd communication) posted (#12) for review on master by Kaushal M (kaushal)

Comment 15 Anand Avati 2013-10-17 18:26:57 UTC
COMMIT: http://review.gluster.org/5280 committed in master by Vijay Bellur (vbellur) 
------
commit fc637b14cfad4d08e72bee7064194c8007a388d0
Author: Kaushal M <kaushal>
Date:   Wed Jul 3 16:31:22 2013 +0530

    cli,glusterd: Changes to cli-glusterd communication
    
    Glusterd changes:
    With this patch, glusterd creates a socket file in
    DATADIR/run/glusterd.socket , and listen on it for cli requests. It
    listens for 2 rpc programs on the socket file,
    - The glusterd cli rpc program, for all cli commands
    - A reduced glusterd handshake program, just for the 'system:: getspec'
      command
    
    The location of the socket file can be changed with the glusterd option
    'glusterd-sockfile'.
    
    To retain compatibility with the '--remote-host' cli option, glusterd
    also listens for the cli requests on port 24007. But, for the sake of
    security, it listens using a reduced cli rpc program on the port. The
    reduced rpc program only contains read-only procs used for 'volume
    (info|list|status)', 'peer status' and 'system:: getwd' cli commands.
    
    CLI changes:
    The gluster cli now uses the glusterd socket file for communicating with
    glusterd by default. A new option '--gluster-sock' has been added to
    allow specifying the sockfile used to connect. Using the '--remote-host'
    option will make cli connect to the given host & port.
    
    Tests changes:
    cluster.rc has been modified to make use of socket files and use
    different log files for each glusterd.
    Some of the tests using cluster.rc have been fixed.
    
    Change-Id: Iaf24bc22f42f8014a5fa300ce37c7fc9b1b92b53
    BUG: 980754
    Signed-off-by: Kaushal M <kaushal>
    Reviewed-on: http://review.gluster.org/5280
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 16 Niels de Vos 2014-04-17 11:43:06 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.0, please reopen this bug report.

glusterfs-3.5.0 has been announced on the Gluster Developers mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/6137
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user