Bug 1813029 - volume brick fails to come online because other process is using port 49152
Summary: volume brick fails to come online because other process is using port 49152
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 6
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-12 18:34 UTC by Amgad
Modified: 2020-03-16 02:40 UTC (History)
1 user (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-03-16 02:40:06 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Amgad 2020-03-12 18:34:00 UTC
Description of problem:
Glusterd starts volume bricks when booting starting with port 49152, if the port is used by any other process even in a transient manner (like mistral), glusterd won't do to the next port number and we ends up with volume brick offline.

Status of volume: cacerts
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.10.10.12:/gluster/cacerts/brick    N/A       N/A        N       N/A  

If we do volume stop/start (no restart for glusterd), the brick will come online on the next available port number (49156)

Status of volume: cacerts
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.10.10.12:/gluster/cacerts/brick    49156     0          Y       17660
So there's 

This happen during reboot of a VM, the  mistral process established a connection using port 49152 on server , so it's using it (See mistral conn to amqp)

-----------------

-bash-4.2# lsof -ni -p 4927 | grep 49152
lsof: WARNING: can't stat() fuse.glusterfs file system /cacerts
Output information may be incomplete.
glusterfs 682 root 10u IPv4 156905 0t0 TCP 10.10.10.12:49089->10.10.10.13:49152 (ESTABLISHED)
nginx 807 nginx 78u IPv4 3471078 0t0 TCP 10.10.10.9:amqp->10.10.10.12:49152 (ESTABLISHED)
mistral-s 4927 mistral 12u IPv4 3476971 0t0 TCP 10.10.10.12:49152->10.10.10.9:amqp (ESTABLISHED)
glusterfs 8852 root 13u IPv4 76064 0t0 TCP 10.10.10.12:49141->10.10.10.13:49152 (ESTABLISHED)

---------------
GlusterFS won't be able to claim full access to it and fails with the messages:
----------------------
/var/log/glusterfs/bricks/gluster-cacerts-brick.log
[2020-03-11 11:29:48.311164] W [socket.c:721:__socket_rwv] 0-glusterfs: readv on 10.10.10.12:24007 failed (No data available)
[2020-03-11 11:29:48.311208] I [glusterfsd-mgmt.c:2443:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.10.10.12
[2020-03-11 11:29:48.311220] I [glusterfsd-mgmt.c:2463:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2020-03-11 11:30:08.541800] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-cacerts-server: disconnecting connection from CTX_ID:577576fc-fc99-4874-95d9-1acd3ad25153-GRAPH_ID:0-PID:20906-HOST:cbnd-2-server-PC_NAME:cacerts-client-1-RECON_NO:-0
[2020-03-11 11:30:08.542166] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-cacerts-server: Shutting down connection CTX_ID:577576fc-fc99-4874-95d9-1acd3ad25153-GRAPH_ID:0-PID:20906-HOST:cbnd-2-server-PC_NAME:cacerts-client-1-RECON_NO:-0
[2020-03-11 11:35:09.957365] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 6.5 (args: /usr/sbin/glusterfsd -s 10.10.10.12 --volfile-id cacerts.10.10.10.12.gluster-cacerts-brick -p /var/run/gluster/vols/cacerts/10.10.10.12-gluster-cacerts-brick.pid -S /var/run/gluster/6378652cef803b2a.socket --brick-name /gluster/cacerts/brick -l /var/log/glusterfs/bricks/gluster-cacerts-brick.log --xlator-option *-posix.glusterd-uuid=a9a3e24e-599a-4c45-9fbb-fe9914193d33 --process-name brick --brick-port 49152 --xlator-option cacerts-server.listen-port=49152)
[2020-03-11 11:35:09.957872] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 6574
[2020-03-11 11:35:09.963535] I [socket.c:904:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2020-03-11 11:35:09.967195] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0
[2020-03-11 11:35:09.967320] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2020-03-11 11:37:59.244311] I [rpcsvc.c:2698:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2020-03-11 11:37:59.244561] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:37:59.244578] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:00.244719] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:38:00.244771] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:01.244877] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:38:01.244919] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:02.245025] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:38:02.245066] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:03.245174] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:38:03.245216] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:04.245348] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:38:04.245399] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:05.245519] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:38:05.245568] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:06.245681] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:38:06.245741] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:07.245848] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:38:07.245902] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:08.246040] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use
[2020-03-11 11:38:08.246093] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use
[2020-03-11 11:38:09.246190] E [socket.c:3718:socket_listen] 0-tcp.cacerts-server: __socket_server_bind failed;closing socket 12
[2020-03-11 11:38:09.246242] W [rpcsvc.c:1991:rpcsvc_create_listener] 0-rpc-service: listening on transport failed
[2020-03-11 11:38:09.246298] W [MSGID: 115045] [server.c:1186:server_init] 0-cacerts-server: creation of listener failed
[2020-03-11 11:38:09.247097] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-cacerts-server: Initialization of volume 'cacerts-server' failed, review your volfile again
[2020-03-11 11:38:09.247124] E [MSGID: 101066] [graph.c:362:glusterfs_graph_init] 0-cacerts-server: initializing translator failed
[2020-03-11 11:38:09.247154] E [MSGID: 101176] [graph.c:725:glusterfs_graph_activate] 0-graph: init failed
[2020-03-11 11:38:09.248308] W [glusterfsd.c:1570:cleanup_and_exit] (->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x873) [0x557305bb0383] -->/usr/sbin/glusterfsd(glusterfs_process_volfp+0x239) [0x557305ba95f9] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x54) [0x557305ba8a84] ) 0: received signum (-1), shutting down

-------------------------------------------------------------------------
When I rebooted that VM (where glusterd was started), it came back with the volume brick on port 49152

Status of volume: cacerts
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 10.10.10.12:/gluster/cacerts/brick 49152 0 Y 4838

*** So there's a race condition.

-bash-4.2# gluster volume status

How it should work:
glusterd should move to the next available port number for volume bricks if the one trying already being used.

How reproducible:
This doesn't happen all the time, but occasionally. 

Steps to Reproduce:
1. 3 VM cluster, with 4 volumes replica of 3
2. Add files to the volumes
3. Reboot a VM and check gluster volume status

Actual results:


Expected results:


Additional info:

Comment 1 Amgad 2020-03-13 15:55:45 UTC
any update!

Comment 2 Worker Ant 2020-03-16 02:40:06 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/1101, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.