Description of problem: Glusterd starts volume bricks when booting starting with port 49152, if the port is used by any other process even in a transient manner (like mistral), glusterd won't do to the next port number and we ends up with volume brick offline. Status of volume: cacerts Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.10.10.12:/gluster/cacerts/brick N/A N/A N N/A If we do volume stop/start (no restart for glusterd), the brick will come online on the next available port number (49156) Status of volume: cacerts Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.10.10.12:/gluster/cacerts/brick 49156 0 Y 17660 So there's This happen during reboot of a VM, the mistral process established a connection using port 49152 on server , so it's using it (See mistral conn to amqp) ----------------- -bash-4.2# lsof -ni -p 4927 | grep 49152 lsof: WARNING: can't stat() fuse.glusterfs file system /cacerts Output information may be incomplete. glusterfs 682 root 10u IPv4 156905 0t0 TCP 10.10.10.12:49089->10.10.10.13:49152 (ESTABLISHED) nginx 807 nginx 78u IPv4 3471078 0t0 TCP 10.10.10.9:amqp->10.10.10.12:49152 (ESTABLISHED) mistral-s 4927 mistral 12u IPv4 3476971 0t0 TCP 10.10.10.12:49152->10.10.10.9:amqp (ESTABLISHED) glusterfs 8852 root 13u IPv4 76064 0t0 TCP 10.10.10.12:49141->10.10.10.13:49152 (ESTABLISHED) --------------- GlusterFS won't be able to claim full access to it and fails with the messages: ---------------------- /var/log/glusterfs/bricks/gluster-cacerts-brick.log [2020-03-11 11:29:48.311164] W [socket.c:721:__socket_rwv] 0-glusterfs: readv on 10.10.10.12:24007 failed (No data available) [2020-03-11 11:29:48.311208] I [glusterfsd-mgmt.c:2443:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.10.10.12 [2020-03-11 11:29:48.311220] I [glusterfsd-mgmt.c:2463:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers [2020-03-11 11:30:08.541800] I [MSGID: 115036] [server.c:499:server_rpc_notify] 0-cacerts-server: disconnecting connection from CTX_ID:577576fc-fc99-4874-95d9-1acd3ad25153-GRAPH_ID:0-PID:20906-HOST:cbnd-2-server-PC_NAME:cacerts-client-1-RECON_NO:-0 [2020-03-11 11:30:08.542166] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-cacerts-server: Shutting down connection CTX_ID:577576fc-fc99-4874-95d9-1acd3ad25153-GRAPH_ID:0-PID:20906-HOST:cbnd-2-server-PC_NAME:cacerts-client-1-RECON_NO:-0 [2020-03-11 11:35:09.957365] I [MSGID: 100030] [glusterfsd.c:2847:main] 0-/usr/sbin/glusterfsd: Started running /usr/sbin/glusterfsd version 6.5 (args: /usr/sbin/glusterfsd -s 10.10.10.12 --volfile-id cacerts.10.10.10.12.gluster-cacerts-brick -p /var/run/gluster/vols/cacerts/10.10.10.12-gluster-cacerts-brick.pid -S /var/run/gluster/6378652cef803b2a.socket --brick-name /gluster/cacerts/brick -l /var/log/glusterfs/bricks/gluster-cacerts-brick.log --xlator-option *-posix.glusterd-uuid=a9a3e24e-599a-4c45-9fbb-fe9914193d33 --process-name brick --brick-port 49152 --xlator-option cacerts-server.listen-port=49152) [2020-03-11 11:35:09.957872] I [glusterfsd.c:2556:daemonize] 0-glusterfs: Pid of current running process is 6574 [2020-03-11 11:35:09.963535] I [socket.c:904:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9 [2020-03-11 11:35:09.967195] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2020-03-11 11:35:09.967320] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2020-03-11 11:37:59.244311] I [rpcsvc.c:2698:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64 [2020-03-11 11:37:59.244561] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:37:59.244578] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:00.244719] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:38:00.244771] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:01.244877] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:38:01.244919] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:02.245025] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:38:02.245066] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:03.245174] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:38:03.245216] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:04.245348] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:38:04.245399] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:05.245519] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:38:05.245568] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:06.245681] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:38:06.245741] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:07.245848] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:38:07.245902] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:08.246040] E [socket.c:923:__socket_server_bind] 0-tcp.cacerts-server: binding to failed: Address already in use [2020-03-11 11:38:08.246093] E [socket.c:925:__socket_server_bind] 0-tcp.cacerts-server: Port is already in use [2020-03-11 11:38:09.246190] E [socket.c:3718:socket_listen] 0-tcp.cacerts-server: __socket_server_bind failed;closing socket 12 [2020-03-11 11:38:09.246242] W [rpcsvc.c:1991:rpcsvc_create_listener] 0-rpc-service: listening on transport failed [2020-03-11 11:38:09.246298] W [MSGID: 115045] [server.c:1186:server_init] 0-cacerts-server: creation of listener failed [2020-03-11 11:38:09.247097] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-cacerts-server: Initialization of volume 'cacerts-server' failed, review your volfile again [2020-03-11 11:38:09.247124] E [MSGID: 101066] [graph.c:362:glusterfs_graph_init] 0-cacerts-server: initializing translator failed [2020-03-11 11:38:09.247154] E [MSGID: 101176] [graph.c:725:glusterfs_graph_activate] 0-graph: init failed [2020-03-11 11:38:09.248308] W [glusterfsd.c:1570:cleanup_and_exit] (->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x873) [0x557305bb0383] -->/usr/sbin/glusterfsd(glusterfs_process_volfp+0x239) [0x557305ba95f9] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x54) [0x557305ba8a84] ) 0: received signum (-1), shutting down ------------------------------------------------------------------------- When I rebooted that VM (where glusterd was started), it came back with the volume brick on port 49152 Status of volume: cacerts Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.10.10.12:/gluster/cacerts/brick 49152 0 Y 4838 *** So there's a race condition. -bash-4.2# gluster volume status How it should work: glusterd should move to the next available port number for volume bricks if the one trying already being used. How reproducible: This doesn't happen all the time, but occasionally. Steps to Reproduce: 1. 3 VM cluster, with 4 volumes replica of 3 2. Add files to the volumes 3. Reboot a VM and check gluster volume status Actual results: Expected results: Additional info:
any update!
This bug is moved to https://github.com/gluster/glusterfs/issues/1101, and will be tracked there from now on. Visit GitHub issues URL for further details