Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1540249

Summary:

Gluster is trying to use a port outside documentation and firewalld's glusterfs.xml

Product:

[Community] GlusterFS

Reporter:

devianca

Component:

glusterd

Assignee:

bugs <bugs>

Status:

CLOSED WORKSFORME

QA Contact:

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

3.12

CC:

amukherj, bugs, devianca, kaushal, kkeithle, rtalur

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-10-05 02:47:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
this is a fresh log from a situation when a node is stuck (removed all logs before run)	none
both nodes when node2 got stuck (both logs are 'fresh')	none
statedump from node2 when node2 is stuck	none

Description devianca 2018-01-30 15:39:21 UTC

Created attachment 1388503 [details]
this is a fresh log from a situation when a node is stuck (removed all logs before run)

Description of problem:
Wrote a script to catch these situations in a reboot-loop where in my replica-2, a node after reboot (statistically ~15%) cannot connect to other node and stays like that forever in `gluster pool list` while at the same time other node says that the first node is Connected and for this other node everything is working perfectly.

I narrowed it down to firewalld denying connections over port 49151 and this is causing it.

Node1: 10.250.1.2
Node2: 10.250.1.1
Both Centos7 with glusterfs.xml openned on 'gluster' interconnect zone
This is a quick paste of this exact situation when a node is stuck:

[root@ProdigyX ~]# uptime
 13:43:47 up 7 min,  1 user,  load average: 0,00, 0,04, 0,05
[root@ProdigyX ~]# systemctl is-active glusterd
active
[root@ProdigyX ~]# networkctl status bond1
● 9: bond1
   Link File: n/a
Network File: /etc/systemd/network/bond1.network
        Type: ether
       State: routable (configured)
      Driver: bonding
  HW Address: 26:bb:b5:40:75:92
         MTU: 9198
     Address: 10.250.1.2 <------------------------------------------------------------------------------------
[root@ProdigyX ~]# firewall-cmd --get-active-zones
gluster
  interfaces: bond1
[root@ProdigyX ~]# firewall-cmd --permanent --info-zone=gluster
gluster (active)
  target: default
  icmp-block-inversion: no
  interfaces: bond1 <----------------------------------------------------------------------------------------- interface bond1
  sources:
  services: glusterfs <--------------------------------------------------------------------------------------- whole service opened
  ports:
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:
 
[root@ProdigyX ~]# cat /usr/lib/firewalld/services/glusterfs.xml
<?xml version="1.0" encoding="utf-8"?>
<service>
<short>glusterfs-static</short>
<description>Default ports for gluster-distributed storage</description>
<port protocol="tcp" port="24007"/>    <!--For glusterd -->
<port protocol="tcp" port="24008"/>    <!--For glusterd RDMA port management -->
<port protocol="tcp" port="24009"/>    <!--For glustereventsd -->
<port protocol="tcp" port="38465"/>    <!--Gluster NFS service -->
<port protocol="tcp" port="38466"/>    <!--Gluster NFS service -->
<port protocol="tcp" port="38467"/>    <!--Gluster NFS service -->
<port protocol="tcp" port="38468"/>    <!--Gluster NFS service -->
<port protocol="tcp" port="38469"/>    <!--Gluster NFS service -->
<port protocol="tcp" port="49152-49664"/>  <!--512 ports for bricks --> <-------------------------------------- 49152 opened
</service>
[root@ProdigyX ~]# cat /var/log/messages | grep "10.250.1.1" | tail -10 <-------------------------------------- its seeking 49151 instead of 49152
Jan 30 13:42:11 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
Jan 30 13:43:16 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
Jan 30 13:43:17 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
Jan 30 13:43:18 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
Jan 30 13:43:19 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
Jan 30 13:43:22 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
Jan 30 13:43:23 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
Jan 30 13:43:28 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
Jan 30 13:43:31 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
Jan 30 13:43:40 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0
[root@ProdigyX ~]# lsof -P | grep ':49151'
glusterd  1164                 root   13u     IPv4              38496       0t0        TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT)
glusterd  1164                 root   16u     IPv4              19066       0t0        TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED)
glusterti 1164 1165            root   13u     IPv4              38496       0t0        TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT)
glusterti 1164 1165            root   16u     IPv4              19066       0t0        TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED)
glustersi 1164 1166            root   13u     IPv4              38496       0t0        TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT)
glustersi 1164 1166            root   16u     IPv4              19066       0t0        TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED)
glusterme 1164 1167            root   13u     IPv4              38496       0t0        TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT)
glusterme 1164 1167            root   16u     IPv4              19066       0t0        TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED)
glustersp 1164 1168            root   13u     IPv4              38496       0t0        TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT)
glustersp 1164 1168            root   16u     IPv4              19066       0t0        TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED)
glustersp 1164 1169            root   13u     IPv4              38496       0t0        TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT)
glustersp 1164 1169            root   16u     IPv4              19066       0t0        TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED)
glustergd 1164 1171            root   13u     IPv4              38496       0t0        TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT)
glustergd 1164 1171            root   16u     IPv4              19066       0t0        TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED)
glusterep 1164 1172            root   13u     IPv4              38496       0t0        TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT)
glusterep 1164 1172            root   16u     IPv4              19066       0t0        TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED)
[root@ProdigyX ~]# ss -tpl | grep 49151
[root@ProdigyX ~]# ss -tpl | grep 49152
LISTEN     0      10         *:49152                    *:*                     users:(("glusterfsd",pid=1193,fd=11))
[root@ProdigyX ~]# gluster pool list
UUID                                    Hostname        State
xxx    10.250.1.1      Disconnected <-------------------------------------------------------------------------- other node
yyy    localhost       Connected
[root@ProdigyX ~]# sleep 60
[root@ProdigyX ~]# systemctl stop firewalld
[root@ProdigyX ~]# sleep 60
[root@ProdigyX ~]# gluster pool list
UUID                                    Hostname        State
xxx    10.250.1.1      Connected <----------------------------------------------------------------------------- sometimes it works until here, sometimes need to also restart glusterd to make it Connected
yyy    localhost       Connected

Comment 1 devianca 2018-01-30 15:45:21 UTC

to remind everyone, this is a firewall issue i bet.

but nevertheless, this is some more info from a situation like this:

node1:
[root@ProdigyX ~]# gluster pool list
UUID                                    Hostname        State
xxx    10.250.1.1      Disconnected
yyy    localhost       Connected
[root@ProdigyX ~]# gluster volume info replica1
 
Volume Name: replica1
Type: Replicate
Volume ID: zzz
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.250.1.2:/array0/brick1
Brick2: 10.250.1.1:/raid0array1/brick2
Options Reconfigured:
client.bind-insecure: off
performance.client-io-threads: off
auth.allow: 10.250.1.1,10.250.1.2
transport.address-family: inet
nfs.disable: on
server.event-threads: 8
performance.io-thread-count: 64
performance.cache-size: 32MB
performance.write-behind-window-size: 64MB
server.allow-insecure: off
[root@ProdigyX ~]# gluster volume status replica1
Status of volume: replica1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.250.1.2:/array0/brick1             49152     0          Y       1193
Self-heal Daemon on localhost               N/A       N/A        Y       1176
 
Task Status of Volume replica1
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@ProdigyX ~]# ps auxwww | grep gluster
root      1164  0.3  0.0 604644 10732 ?        Ssl  14:00   0:00 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
root      1176  0.0  0.0 593156  7644 ?        Ssl  14:00   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/gluste shd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6c565af3c4462a80c526b26f74d90dca.socket --xlator-option *replicate*.node-uuid=yyy
root      1193  0.0  0.0 1612800 16356 ?       Ssl  14:00   0:00 /usr/sbin/glusterfsd -s 10.250.1.2 --volfile-id replica1.10.250.1.2.array0-brick1 -p /var/run/gluster/vols/replica1/10.250.1.2-array0-brick1.pid -S /var/run/gluster/9329e7359a1938faf4767c564e490de5.socket --brick-name /array0/brick1 -l /var/log/glusterfs/bricks/array0-brick1.log --xlator-option *-posix.glusterd-uuid=yyy --brick-port 49152 --xlator-option replica1-server.listen-port=49152
root      2171  0.1  0.0 550584  9496 ?        Ssl  14:01   0:00 /usr/sbin/glusterfs --volfile-server=127.0.0.1 --volfile-id=/replica1 /gluster
root      2343  0.0  0.0 112676   972 pts/0    S+   14:01   0:00 grep --color=auto gluster

node2:
[root@BUNKER ~]# gluster pool list
UUID                                    Hostname        State
yyy    10.250.1.2      Connected
xxx    localhost       Connected
You have new mail in /var/spool/mail/root
[root@BUNKER ~]# gluster volume info replica1
 
Volume Name: replica1
Type: Replicate
Volume ID: zzz
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.250.1.2:/array0/brick1
Brick2: 10.250.1.1:/raid0array1/brick2
Options Reconfigured:
server.allow-insecure: off
performance.write-behind-window-size: 64MB
performance.cache-size: 32MB
performance.io-thread-count: 64
server.event-threads: 8
nfs.disable: on
transport.address-family: inet
auth.allow: 10.250.1.1,10.250.1.2
performance.client-io-threads: off
client.bind-insecure: off
[root@BUNKER ~]# gluster volume status replica1
Status of volume: replica1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.250.1.2:/array0/brick1             49152     0          Y       1193
Brick 10.250.1.1:/raid0array1/brick2        49152     0          Y       1400
Self-heal Daemon on localhost               N/A       N/A        Y       1503
Self-heal Daemon on 10.250.1.2              N/A       N/A        Y       1176
 
Task Status of Volume replica1
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@BUNKER ~]# ps auxwww |grep gluster
root       869  0.0  0.2 604652  9560 ?        Ssl  12:35   0:01 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
root      1400  0.0  0.5 1416192 20312 ?       Ssl  12:35   0:00 /usr/sbin/glusterfsd -s 10.250.1.1 --volfile-id replica1.10.250.1.1.raid0array1-brick2 -p /var/run/gluster/vols/replica1/10.250.1.1-raid0array1-brick2.pid -S /var/run/gluster/b1821a9027697b21620a8c5abc7a8fb9.socket --brick-name /raid0array1/brick2 -l /var/log/glusterfs/bricks/raid0array1-brick2.log --xlator-option *-posix.glusterd-uuid=xxx --brick-port 49152 --xlator-option replica1-server.listen-port=49152
root      1503  0.0  0.4 683312 17008 ?        Ssl  12:39   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/941dd362146b7df8e34dbccb220962e5.socket --xlator-option *replicate*.node-uuid=xxx
root      1918  0.0  0.0 112680   984 pts/0    S+   14:04   0:00 grep --color=auto gluster

Comment 2 devianca 2018-01-30 15:57:07 UTC

Created attachment 1388511 [details]
both nodes when node2 got stuck (both logs are 'fresh')

both nodes when node2 got stuck (both logs are 'fresh')

Comment 3 devianca 2018-01-30 15:59:51 UTC

Created attachment 1388512 [details]
statedump from node2 when node2 is stuck

statedump from node2 when node2 is stuck

Comment 4 Atin Mukherjee 2018-01-30 16:01:06 UTC

Please note that from the ps output I can clearly see that the brick port (listen-port) allocated by glusterd is 49152.

root      1400  0.0  0.5 1416192 20312 ?       Ssl  12:35   0:00 /usr/sbin/glusterfsd -s 10.250.1.1 --volfile-id replica1.10.250.1.1.raid0array1-brick2 -p /var/run/gluster/vols/replica1/10.250.1.1-raid0array1-brick2.pid -S /var/run/gluster/b1821a9027697b21620a8c5abc7a8fb9.socket --brick-name /raid0array1/brick2 -l /var/log/glusterfs/bricks/raid0array1-brick2.log --xlator-option *-posix.glusterd-uuid=xxx --brick-port 49152 --xlator-option replica1-server.listen-port=49152

I think what we're suspecting here is a port which is used by client to communicate to glusterd which might be blocked.

Could you provide output of:

netstat -nap | grep 49151

Comment 5 devianca 2018-01-30 16:03:16 UTC

netstat is the last:

[root@ProdigyX ~]# gluster pool list
UUID                                    Hostname        State
2f6697f4-2529-4072-910c-8862fdc43562    10.250.1.1      Disconnected
f7976943-b81a-4bb4-a1fb-06253bf064c4    localhost       Connected
[root@ProdigyX ~]# gluster peer status
Number of Peers: 1
 
Hostname: 10.250.1.1
Uuid: 2f6697f4-2529-4072-910c-8862fdc43562
State: Peer in Cluster (Disconnected)
[root@ProdigyX ~]# gluster volume status
Status of volume: replica1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.250.1.2:/array0/brick1             49152     0          Y       1192
Self-heal Daemon on localhost               N/A       N/A        Y       1175
 
Task Status of Volume replica1
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@ProdigyX ~]# gluster volume info
 
Volume Name: replica1
Type: Replicate
Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.250.1.2:/array0/brick1
Brick2: 10.250.1.1:/raid0array1/brick2
Options Reconfigured:
client.bind-insecure: off
performance.client-io-threads: off
auth.allow: 10.250.1.1,10.250.1.2
transport.address-family: inet
nfs.disable: on
server.event-threads: 8
performance.io-thread-count: 64
performance.cache-size: 32MB
performance.write-behind-window-size: 64MB
server.allow-insecure: off
[root@ProdigyX ~]# netstat -nap | grep 49151
tcp        0      1 10.250.1.2:49151        10.250.1.1:24007        SYN_SENT    1163/glusterd
tcp        0      0 10.250.1.2:24007        10.250.1.1:49151        ESTABLISHED 1163/glusterd

Comment 6 devianca 2018-01-30 16:05:29 UTC

with no rebooting of any of the node, same commands on node1:

[root@BUNKER ~]# gluster pool list
UUID                                    Hostname        State
f7976943-b81a-4bb4-a1fb-06253bf064c4    10.250.1.2      Connected
2f6697f4-2529-4072-910c-8862fdc43562    localhost       Connected
[root@BUNKER ~]# gluster peer status
Number of Peers: 1

Hostname: 10.250.1.2
Uuid: f7976943-b81a-4bb4-a1fb-06253bf064c4
State: Peer in Cluster (Connected)
[root@BUNKER ~]# gluster volume status
Status of volume: replica1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.250.1.2:/array0/brick1             49152     0          Y       1192
Brick 10.250.1.1:/raid0array1/brick2        49152     0          Y       1395
Self-heal Daemon on localhost               N/A       N/A        Y       1147
Self-heal Daemon on 10.250.1.2              N/A       N/A        Y       1175

Task Status of Volume replica1
------------------------------------------------------------------------------
There are no active volume tasks

[root@BUNKER ~]# gluster volume info

Volume Name: replica1
Type: Replicate
Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.250.1.2:/array0/brick1
Brick2: 10.250.1.1:/raid0array1/brick2
Options Reconfigured:
client.bind-insecure: off
performance.client-io-threads: off
auth.allow: 10.250.1.1,10.250.1.2
transport.address-family: inet
nfs.disable: on
server.event-threads: 8
performance.io-thread-count: 64
performance.cache-size: 32MB
performance.write-behind-window-size: 64MB
server.allow-insecure: off
[root@BUNKER ~]# netstat -nap | grep 49151
tcp        0      0 10.250.1.1:24007        10.250.1.2:49151        SYN_RECV    -
tcp        0      0 10.250.1.1:49151        10.250.1.2:24007        ESTABLISHED 870/glusterd

Comment 7 devianca 2018-01-30 16:06:39 UTC

I ment node2* here

Comment 8 Atin Mukherjee 2018-02-01 11:36:12 UTC

Based on the netstat output shared (mentioned below)
[root@BUNKER ~]# netstat -nap | grep 49151
tcp        0      0 10.250.1.1:24007        10.250.1.2:49151        SYN_RECV    -
tcp        0      0 10.250.1.1:49151        10.250.1.2:24007        ESTABLISHED 870/glusterd

49151 looks to be the outband port glusterd has used to communicate to other glusterd instances. In our deployments what we have seen this outbound port keeps in the range of privileged port but in this particular case it picked up 49151 and since 49151 is kind of blocked all the communication from glusterd at 10.250.1.1 to other glusterd instances aren't happening because of which you see nodes are stuck in the cli command.

Just a wild guess was it like that port was initially unblocked and post glusterd came up this port was made blocked?

Comment 9 devianca 2018-02-01 11:53:04 UTC

No. If thinking anything around this, the exact opposite happened:

0. format
1. made systemd-networkd interconnect interfaces
- all ports blocked
2. made new "gluster" zone
3. added glusterfs.xml to "gluster" zone

Thanks for staying and helping!

Comment 10 devianca 2018-02-02 22:46:00 UTC

Whats the progress?

More logs, check uptime, firewalld+glusterd start times, pool list:

Node1:
[root@ProdigyX ~]# uptime
 23:42:48 up 3 days,  6:50,  1 user,  load average: 0,01, 0,03, 0,05
[root@ProdigyX ~]# systemctl status firewalld --no-pager --full
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
   Active: active (running) since tor 2018-01-30 16:52:39 CET; 3 days ago
     Docs: man:firewalld(1)
 Main PID: 859 (firewalld)
   CGroup: /system.slice/firewalld.service
           └─859 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid --debug=10

jan 30 16:52:38 ProdigyX systemd[1]: Starting firewalld - dynamic firewall daemon...
jan 30 16:52:39 ProdigyX systemd[1]: Started firewalld - dynamic firewall daemon.
jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: ICMP type 'beyond-scope' is not supported by the kernel for ipv6.
jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: beyond-scope: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time.
jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: ICMP type 'failed-policy' is not supported by the kernel for ipv6.
jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: failed-policy: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time.
jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: ICMP type 'reject-route' is not supported by the kernel for ipv6.
jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: reject-route: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time.
[root@ProdigyX ~]# systemctl status glusterd --no-pager --full
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/glusterd.service.d
           └─override.conf
   Active: active (running) since tor 2018-01-30 16:53:19 CET; 3 days ago
 Main PID: 1163 (glusterd)
   CGroup: /system.slice/glusterd.service
           ├─1163 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           ├─1175 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6c565af3c4462a80c526b26f74d90dca.socket --xlator-option *replicate*.node-uuid=f7976943-b81a-4bb4-a1fb-06253bf064c4
           └─1192 /usr/sbin/glusterfsd -s 10.250.1.2 --volfile-id replica1.10.250.1.2.array0-brick1 -p /var/run/gluster/vols/replica1/10.250.1.2-array0-brick1.pid -S /var/run/gluster/9329e7359a1938faf4767c564e490de5.socket --brick-name /array0/brick1 -l /var/log/glusterfs/bricks/array0-brick1.log --xlator-option *-posix.glusterd-uuid=f7976943-b81a-4bb4-a1fb-06253bf064c4 --brick-port 49152 --xlator-option replica1-server.listen-port=49152

jan 30 16:52:43 ProdigyX systemd[1]: Starting GlusterFS, a clustered file-system server...
jan 30 16:52:48 ProdigyX bash[1174]: Local brick online.
jan 30 16:53:19 ProdigyX systemd[1]: Started GlusterFS, a clustered file-system server.
[root@ProdigyX ~]# gluster pool list
UUID                                    Hostname        State
2f6697f4-2529-4072-910c-8862fdc43562    10.250.1.1      Disconnected
f7976943-b81a-4bb4-a1fb-06253bf064c4    localhost       Connected
[root@ProdigyX ~]# gluster peer status
Number of Peers: 1

Hostname: 10.250.1.1
Uuid: 2f6697f4-2529-4072-910c-8862fdc43562
State: Peer in Cluster (Disconnected)
[root@ProdigyX ~]# gluster volume status
Status of volume: replica1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.250.1.2:/array0/brick1             49152     0          Y       1192
Self-heal Daemon on localhost               N/A       N/A        Y       1175

Task Status of Volume replica1
------------------------------------------------------------------------------
There are no active volume tasks

[root@ProdigyX ~]# netstat -nap | grep 49151
tcp        0      1 10.250.1.2:49151        10.250.1.1:24007        SYN_SENT    1163/glusterd
tcp        0      0 10.250.1.2:24007        10.250.1.1:49151        ESTABLISHED 1163/glusterd

Node2:
[root@BUNKER ~]# uptime
 23:42:08 up 3 days,  6:54,  1 user,  load average: 0,00, 0,01, 0,05
[root@BUNKER ~]# systemctl status firewalld --no-pager --full
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
   Active: active (running) since tor 2018-01-30 16:47:48 CET; 3 days ago
     Docs: man:firewalld(1)
 Main PID: 670 (firewalld)
   CGroup: /system.slice/firewalld.service
           └─670 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid

jan 30 16:47:47 BUNKER systemd[1]: Starting firewalld - dynamic firewall daemon...
jan 30 16:47:48 BUNKER systemd[1]: Started firewalld - dynamic firewall daemon.
jan 30 16:47:49 BUNKER firewalld[670]: WARNING: ICMP type 'beyond-scope' is not supported by the kernel for ipv6.
jan 30 16:47:49 BUNKER firewalld[670]: WARNING: beyond-scope: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time.
jan 30 16:47:49 BUNKER firewalld[670]: WARNING: ICMP type 'failed-policy' is not supported by the kernel for ipv6.
jan 30 16:47:49 BUNKER firewalld[670]: WARNING: failed-policy: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time.
jan 30 16:47:49 BUNKER firewalld[670]: WARNING: ICMP type 'reject-route' is not supported by the kernel for ipv6.
jan 30 16:47:49 BUNKER firewalld[670]: WARNING: reject-route: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time.
[root@BUNKER ~]# systemctl status glusterd --no-pager --full
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/glusterd.service.d
           └─override.conf
   Active: active (running) since tor 2018-01-30 16:47:52 CET; 3 days ago
 Main PID: 870 (glusterd)
   CGroup: /system.slice/glusterd.service
           ├─ 870 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
           ├─1147 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/941dd362146b7df8e34dbccb220962e5.socket --xlator-option *replicate*.node-uuid=2f6697f4-2529-4072-910c-8862fdc43562
           └─1395 /usr/sbin/glusterfsd -s 10.250.1.1 --volfile-id replica1.10.250.1.1.raid0array1-brick2 -p /var/run/gluster/vols/replica1/10.250.1.1-raid0array1-brick2.pid -S /var/run/gluster/b1821a9027697b21620a8c5abc7a8fb9.socket --brick-name /raid0array1/brick2 -l /var/log/glusterfs/bricks/raid0array1-brick2.log --xlator-option *-posix.glusterd-uuid=2f6697f4-2529-4072-910c-8862fdc43562 --brick-port 49152 --xlator-option replica1-server.listen-port=49152

jan 30 16:47:50 BUNKER systemd[1]: Starting GlusterFS, a clustered file-system server...
jan 30 16:47:52 BUNKER systemd[1]: Started GlusterFS, a clustered file-system server.
[root@BUNKER ~]# gluster pool list
UUID                                    Hostname        State
f7976943-b81a-4bb4-a1fb-06253bf064c4    10.250.1.2      Connected
2f6697f4-2529-4072-910c-8862fdc43562    localhost       Connected
[root@BUNKER ~]# gluster peer status
Number of Peers: 1

Hostname: 10.250.1.2
Uuid: f7976943-b81a-4bb4-a1fb-06253bf064c4
State: Peer in Cluster (Connected)
[root@BUNKER ~]# gluster volume status
Status of volume: replica1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.250.1.2:/array0/brick1             49152     0          Y       1192
Brick 10.250.1.1:/raid0array1/brick2        49152     0          Y       1395
Self-heal Daemon on localhost               N/A       N/A        Y       1147
Self-heal Daemon on 10.250.1.2              N/A       N/A        Y       1175

Task Status of Volume replica1
------------------------------------------------------------------------------
There are no active volume tasks

[root@BUNKER ~]# netstat -nap | grep 49151
tcp        0      0 10.250.1.1:24007        10.250.1.2:49151        SYN_RECV    -
tcp        0      0 10.250.1.1:49151        10.250.1.2:24007        ESTABLISHED 870/glusterd

Comment 11 Atin Mukherjee 2018-02-05 14:48:44 UTC

I haven't got time to root cause this yet. I'll pass this to one of my colleague Gaurav Yadav to take a look at this. If this issue is blocking you to move forward could you just open 49151 port for now and move forward. We shall be able to come back on this sooner.

Gaurav - could you take a look at this?

Comment 12 Kaleb KEITHLEY 2018-02-08 13:11:35 UTC

In my case it's that SELinux appears to be blocking the bind() on 49151. I can start the volume if I disable SELinux. (I only tested on a single brick volume on the localhost so I would not be affected by the firewalld.)

Comment 13 Atin Mukherjee 2018-02-08 13:51:53 UTC

(In reply to Kaleb KEITHLEY from comment #12)
> In my case it's that SELinux appears to be blocking the bind() on 49151. I
> can start the volume if I disable SELinux. (I only tested on a single brick
> volume on the localhost so I would not be affected by the firewalld.)

You mean 49151 was allocated to a brick port? That can never happen! As in glusterd's pmap the base port is set to 49152 and that from where the brick port allocation starts. As per my analysis at comment 8 it was the outband port which glusterd uses to send requests to other glusterd instances which was blocked.

I couldn't reproduce this problem on my fedora27 system with selinux mode enabled.

[root@atinmu glusterfs]# sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      31
[root@atinmu glusterfs]# 
[root@atinmu glusterfs]# 
[root@atinmu glusterfs]# 
[root@atinmu glusterfs]# 
[root@atinmu glusterfs]# glusterd
[root@atinmu glusterfs]# cat /etc/fedora-release
Fedora release 27 (Twenty Seven)


[root@atinmu glusterfs]# gluster v create test-vol x.x.x.x:/tmp/b{1..4} force
volume create: test-vol: success: please start the volume to access data
[root@atinmu glusterfs]# gluster v start test-vol
volume start: test-vol: success
[root@atinmu glusterfs]# gluster v status
Status of volume: test-vol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.0.106:/tmp/b1                 49152     0          Y       22410
Brick 192.168.0.106:/tmp/b2                 49153     0          Y       22431
Brick 192.168.0.106:/tmp/b3                 49154     0          Y       22452
Brick 192.168.0.106:/tmp/b4                 49155     0          Y       22473
 
Task Status of Volume test-vol
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@atinmu glusterfs]# netstat -nap | grep 49151
tcp        0      0 x.x.x.x:24007     x.x.x.x:49151     ESTABLISHED 21932/glusterd      
tcp        0      0 x.x.x.x:49151     x.x.x.x:24007     ESTABLISHED 22410/glusterfsd   

x.x.x.x - same IP only

Comment 14 Kaleb KEITHLEY 2018-02-09 13:17:28 UTC

from your own netstat at the end of comment 13

>[root@atinmu glusterfs]# netstat -nap | grep 49151
> tcp        0      0 x.x.x.x:24007     x.x.x.x:49151     ESTABLISHED 21932/glusterd      
> tcp        0      0 x.x.x.x:49151     x.x.x.x:24007     ESTABLISHED 22410/glusterfsd   
                      ^^^^^^^^^^^^^

selinux is denying the bind() on port 49151 by glusterfsd running locally

selinux is also denying the mkdir_p of (/var)/run/glusterd/snaps by glusterd

This is on Fedora 28 (rawhide) and rhel 7+1

Comment 15 Kaleb KEITHLEY 2018-02-09 13:41:04 UTC

correction:
> selinux is also denying the mkdir_p of (/var)/run/gluster/snaps by glusterd

Comment 16 Gaurav Yadav 2018-02-09 17:39:00 UTC

I tried replicating issue in my setup by creating new zone and assigning glusterfs service to newly created zone.

Here is my cluster state..
Node1
[root@localhost glusterfs]# gluster peer status
Number of Peers: 1

Hostname: 192.168.122.120
Uuid: 695ce317-e35b-4f4a-b85d-326df13308fc
State: Accepted peer request (Connected)
[root@localhost glusterfs]# 

Node2
[root@localhost ~]# gluster peer status
Number of Peers: 1

Hostname: 192.168.122.79
Uuid: a3fa5d12-2614-4a52-a85c-a96d33bb116d
State: Accepted peer request (Disconnected)

Node1
[root@localhost glusterfs]#  netstat -tap | grep gluster
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      14180/glusterd      
tcp        0      0 localhost.localdo:49151 192.168.122.120:24007   ESTABLISHED 14180/glusterd

Node2
[root@localhost gaurav]# netstat -tap | grep gluster
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      25958/glusterd      
tcp        0      0 localhost.localdo:24007 192.168.122.79:49151    ESTABLISHED 25958/glusterd      
tcp        0      1 localhost.localdo:49151 192.168.122.79:24007    SYN_SENT    25958/glusterd

I too suspect its a firewall issue, in order to confirm I need few info from your setup.

please provide output of below commands.
1. firewall-cmd --list-ports --zone=gluster
2. firewall-cmd --list-services --zone=gluster

Comment 17 devianca 2018-02-09 21:06:49 UTC

Thanks for helping! I'm so tired of googling "gluster 49151" that words cannot describe.

[root@ProdigyX ~]# gluster pool list
UUID                                    Hostname        State
2f6697f4-2529-4072-910c-8862fdc43562    10.250.1.1      Disconnected
f7976943-b81a-4bb4-a1fb-06253bf064c4    localhost       Connected
[root@ProdigyX ~]# gluster peer status
Number of Peers: 1

Hostname: 10.250.1.1
Uuid: 2f6697f4-2529-4072-910c-8862fdc43562
State: Peer in Cluster (Disconnected)
[root@ProdigyX ~]# firewall-cmd --list-ports --zone=gluster

[root@ProdigyX ~]# firewall-cmd --list-services --zone=gluster
glusterfs
[root@ProdigyX ~]# netstat -tap | grep gluster
tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN      1194/glusterfsd
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      1165/glusterd
tcp        0      0 ProdigyX:24007          ProdigyX:49149          ESTABLISHED 1165/glusterd
tcp        0      0 localhost:24007         localhost:49150         ESTABLISHED 1165/glusterd
tcp        0      0 ProdigyX:49152          ProdigyX:1016           ESTABLISHED 1194/glusterfsd
tcp        0      0 ProdigyX:1019           10.250.1.1:49152        ESTABLISHED 1177/glusterfs
tcp        0      1 ProdigyX:49151          10.250.1.1:24007        SYN_SENT    1165/glusterd
tcp        0      0 localhost:49150         localhost:24007         ESTABLISHED 1177/glusterfs
tcp        0      0 ProdigyX:1016           ProdigyX:49152          ESTABLISHED 2038/glusterfs
tcp        0      0 localhost:24007         localhost:49148         ESTABLISHED 1165/glusterd
tcp        0      0 ProdigyX:24007          10.250.1.1:49151        ESTABLISHED 1165/glusterd
tcp        0      0 ProdigyX:1015           10.250.1.1:49152        ESTABLISHED 2038/glusterfs
tcp        0      0 ProdigyX:49149          ProdigyX:24007          ESTABLISHED 1194/glusterfsd
tcp        0      0 ProdigyX:1020           ProdigyX:49152          ESTABLISHED 1177/glusterfs
tcp        0      0 localhost:49148         localhost:24007         ESTABLISHED 2038/glusterfs
tcp        0      0 ProdigyX:49152          10.250.1.1:exp2         ESTABLISHED 1194/glusterfsd
tcp        0      0 ProdigyX:49152          ProdigyX:1020           ESTABLISHED 1194/glusterfsd

Comment 18 devianca 2018-02-09 21:09:03 UTC

Also to Gaurav: finally someone with this bug beside me. Did you do anything special to get it replicated or no? What do you think is the problem? It's not firewall, because gluster should not touch port 49151 or am I horribly mistaken?

Comment 19 Gaurav Yadav 2018-02-11 18:13:24 UTC

For me it was a firewall issue, on one of the node firewall was blocking the connection.
Now I am able to perform all set of operation.

N1
firewall-cmd --get-active-zones
gluster
  interfaces: ens3
N2
firewall-cmd --get-active-zones
gluster
  interfaces: ens3

firewall-cmd --permanent --info-zone=gluster
gluster
  target: default
  icmp-block-inversion: no
  interfaces: 
  sources: 
  services: glusterfs
  ports: 22/tcp
  protocols: 
  masquerade: no
  forward-ports: 
  sourceports: 
  icmp-blocks: 
  rich rules: 

gluster pool list
UUID					Hostname       	State
b38f67e6-e67b-4e0b-834f-87df268b6243	192.168.122.120	Connected 
9d39dcde-cf55-4f8b-b889-4240fb4b72c8	localhost      	Connected 

[root@localhost glusterfs]# gluster peer status
Number of Peers: 1

Hostname: 192.168.122.79
Uuid: 9d39dcde-cf55-4f8b-b889-4240fb4b72c8
State: Peer in Cluster (Connected

[root@localhost glusterfs]# gluster peer status
Number of Peers: 1

Hostname: 192.168.122.120
Uuid: b38f67e6-e67b-4e0b-834f-87df268b6243
State: Peer in Cluster (Connected)

[root@localhost glusterfs]# netstat -nap | grep gluster
tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN      10825/glusterfsd    
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      10677/glusterd      
tcp        0      0 192.168.122.120:24007   192.168.122.120:49150   ESTABLISHED 10677/glusterd      
tcp        0      0 192.168.122.120:49150   192.168.122.120:24007   ESTABLISHED 10825/glusterfsd    
tcp        0      0 192.168.122.120:24007   192.168.122.79:49151    ESTABLISHED 10677/glusterd      
tcp        0      0 192.168.122.120:49151   192.168.122.79:24007    ESTABLISHED 10677/glusterd      
unix  2      [ ACC ]     STREAM     LISTENING     75338    10677/glusterd       /var/run/glusterd.socket
unix  2      [ ACC ]     STREAM     LISTENING     94304    10825/glusterfsd     /var/run/gluster/changelog-93b92d74198dc111ed4be6a4e0d52563.sock
unix  2      [ ACC ]     STREAM     LISTENING     75696    10825/glusterfsd     /var/run/gluster/6f4035c6b1db1fa531293ed9e2b3891a.socket
unix  3      [ ]         STREAM     CONNECTED     94316    10825/glusterfsd     /var/run/gluster/6f4035c6b1db1fa531293ed9e2b3891a.socket
unix  3      [ ]         STREAM     CONNECTED     94315    10677/glusterd       
unix  2      [ ]         DGRAM                    74473    10677/glusterd       
unix  2      [ ]         DGRAM                    77064    10825/glusterfsd 


Port allocation to various processes.

1023 - 1       : Ports range for clients, if bind secure is turned on
49151 - 1024   : Port range for clients, if bind insecure is on
                   (fall back to this, if in above case ports exhaust)
49152 - 65535  : Port range for bricks

Comment 20 devianca 2018-02-12 09:06:45 UTC

1. In my situation, both nodes have only 1 brick, so why should it get exhausted?

2. Are you indicating that port 49151 is perfectly normal? If so, why isn't that range added to glusterfs.xml?

3. How to fix this now? Manually open whole range 1024-49151 eventhough both nodes have only 1 brick and they are the only clients connecting?

Comment 21 devianca 2018-02-12 09:40:33 UTC

Both *were already* set, check https://bugzilla.redhat.com/show_bug.cgi?id=1540249#c1

Now, also disabled both clients, and still getting Disconnected,

node1:

[root@ProdigyX ~]# uptime
 10:37:35 up 11 min,  1 user,  load average: 0,00, 0,01, 0,03
[root@ProdigyX ~]# systemctl status gluster.mount
● gluster.mount - Mount Gluster
   Loaded: loaded (/etc/systemd/system/gluster.mount; disabled; vendor preset: disabled)
   Active: inactive (dead)
    Where: /gluster
     What: 127.0.0.1:/replica1
[root@ProdigyX ~]# gluster pool list
UUID                                    Hostname        State
2f6697f4-2529-4072-910c-8862fdc43562    10.250.1.1      Disconnected
f7976943-b81a-4bb4-a1fb-06253bf064c4    localhost       Connected
[root@ProdigyX ~]# gluster peer status
Number of Peers: 1

Hostname: 10.250.1.1
Uuid: 2f6697f4-2529-4072-910c-8862fdc43562
State: Peer in Cluster (Disconnected)
[root@ProdigyX ~]# gluster volume status
Status of volume: replica1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.250.1.2:/array0/brick1             49152     0          Y       1390
Self-heal Daemon on localhost               N/A       N/A        Y       1373

Task Status of Volume replica1
------------------------------------------------------------------------------
There are no active volume tasks

[root@ProdigyX ~]# gluster volume info

Volume Name: replica1
Type: Replicate
Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.250.1.2:/array0/brick1
Brick2: 10.250.1.1:/raid0array1/brick2
Options Reconfigured:
client.bind-insecure: off
performance.client-io-threads: off
auth.allow: 10.250.1.1,10.250.1.2
transport.address-family: inet
nfs.disable: on
server.event-threads: 8
performance.io-thread-count: 64
performance.cache-size: 32MB
performance.write-behind-window-size: 64MB
server.allow-insecure: off
[root@ProdigyX ~]# netstat -tap | grep gluster
tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN      1390/glusterfsd
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      1361/glusterd
tcp        0      0 ProdigyX:1019           10.250.1.1:49152        ESTABLISHED 1373/glusterfs
tcp        0      0 ProdigyX:1020           ProdigyX:49152          ESTABLISHED 1373/glusterfs
tcp        0      0 ProdigyX:49152          10.250.1.1:exp2         ESTABLISHED 1390/glusterfsd
tcp        0      0 ProdigyX:24007          10.250.1.1:49151        ESTABLISHED 1361/glusterd
tcp        0      1 ProdigyX:49151          10.250.1.1:24007        SYN_SENT    1361/glusterd
tcp        0      0 ProdigyX:24007          ProdigyX:49149          ESTABLISHED 1361/glusterd
tcp        0      0 ProdigyX:49152          ProdigyX:1020           ESTABLISHED 1390/glusterfsd
tcp        0      0 ProdigyX:49149          ProdigyX:24007          ESTABLISHED 1390/glusterfsd
tcp        0      0 localhost:49150         localhost:24007         ESTABLISHED 1373/glusterfs
tcp        0      0 localhost:24007         localhost:49150         ESTABLISHED 1361/glusterd

node2:

[root@BUNKER ~]# uptime
 10:37:34 up 26 min,  1 user,  load average: 0,00, 0,01, 0,05
[root@BUNKER ~]# systemctl status gluster.mount
● gluster.mount - Mount Gluster
   Loaded: loaded (/etc/systemd/system/gluster.mount; disabled; vendor preset: disabled)
   Active: inactive (dead)
    Where: /gluster
     What: 127.0.0.1:/replica1
[root@BUNKER ~]# gluster pool list
UUID                                    Hostname        State
f7976943-b81a-4bb4-a1fb-06253bf064c4    10.250.1.2      Connected
2f6697f4-2529-4072-910c-8862fdc43562    localhost       Connected
[root@BUNKER ~]# gluster peer status
Number of Peers: 1

Hostname: 10.250.1.2
Uuid: f7976943-b81a-4bb4-a1fb-06253bf064c4
State: Peer in Cluster (Connected)
[root@BUNKER ~]# gluster volume status
Status of volume: replica1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.250.1.2:/array0/brick1             49152     0          Y       1390
Brick 10.250.1.1:/raid0array1/brick2        49152     0          Y       1334
Self-heal Daemon on localhost               N/A       N/A        Y       1149
Self-heal Daemon on 10.250.1.2              N/A       N/A        Y       1373

Task Status of Volume replica1
------------------------------------------------------------------------------
There are no active volume tasks

[root@BUNKER ~]# gluster volume info

Volume Name: replica1
Type: Replicate
Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.250.1.2:/array0/brick1
Brick2: 10.250.1.1:/raid0array1/brick2
Options Reconfigured:
client.bind-insecure: off
performance.client-io-threads: off
auth.allow: 10.250.1.1,10.250.1.2
transport.address-family: inet
nfs.disable: on
server.event-threads: 8
performance.io-thread-count: 64
performance.cache-size: 32MB
performance.write-behind-window-size: 64MB
server.allow-insecure: off
[root@BUNKER ~]# netstat -tap | grep gluster
tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN      1334/glusterfsd
tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN      866/glusterd
tcp        0      0 BUNKER:49152            10.250.1.2:1019         ESTABLISHED 1334/glusterfsd
tcp        0      0 BUNKER:exp2             10.250.1.2:49152        ESTABLISHED 1149/glusterfs
tcp        0      0 BUNKER:24007            BUNKER:49149            ESTABLISHED 866/glusterd
tcp        0      0 BUNKER:1020             BUNKER:49152            ESTABLISHED 1149/glusterfs
tcp        0      0 BUNKER:49151            10.250.1.2:24007        ESTABLISHED 866/glusterd
tcp        0      0 BUNKER:49152            BUNKER:1020             ESTABLISHED 1334/glusterfsd
tcp        0      0 localhost:49150         localhost:24007         ESTABLISHED 1149/glusterfs
tcp        0      0 localhost:24007         localhost:49150         ESTABLISHED 866/glusterd
tcp        0      0 BUNKER:49149            BUNKER:24007            ESTABLISHED 1334/glusterfsd

Comment 22 devianca 2018-02-12 09:45:37 UTC

For faster debugging for all of you, heres the script I'm using to catch it:

[root@ProdigyX ~]# systemctl cat glusterd
# /etc/systemd/system/glusterd.service
[Unit]
Description=GlusterFS, a clustered file-system server
Requires=rpcbind.service
After=network.target rpcbind.service
Before=network-online.target

[Service]
Type=forking
PIDFile=/var/run/glusterd.pid
LimitNOFILE=65536
Environment="LOG_LEVEL=INFO"
EnvironmentFile=-/etc/sysconfig/glusterd
ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid  --log-level $LOG_LEVEL $GLUSTERD_OPTIONS
KillMode=process
SuccessExitStatus=15

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/glusterd.service.d/override.conf
[Unit]
PartOf=array0.mount
Wants=systemd-networkd-wait-online.service
#Wants=gluster.mount
After=array0.mount
After=systemd-networkd-wait-online.service
#Before=gluster.mount
[Service]
ExecStartPost=/usr/bin/bash -c '_ip="10.250.1.2"; while :; do gluster volume status | grep "$_ip" | grep -q Y && break; sleep 0.5; done; echo "Local brick online."'
ExecStartPost=/usr/bin/bash /root/glusterd_health
StandardOutput=journal
Restart=on-failure
RestartSec=5s

[root@ProdigyX ~]# cat glusterd_health
#!/bin/bash

echo "DATE: $(date +%G-%m-%d" "%H:%M:%S)" >> /root/glusterd_health.log

log_it() {
        gluster volume status >> /root/glusterd_health.log
        gluster pool list >> /root/glusterd_health.log
        rm -Rf /var/log/glusterfs/*
        sync
}

_ip="10.250.1.1"
_until=$(($(date +%s) + 30))
until [[ "$(date +%s)" -gt "$_until" ]]; do
        gluster pool list | grep "$_ip" | grep -q Connected && log_it && reboot
        sleep 0.5
done

Comment 23 devianca 2018-02-12 09:47:02 UTC

(In reply to devianca from comment #21)
> Both *were already* set, check
> https://bugzilla.redhat.com/show_bug.cgi?id=1540249#c1
> 
> Now, also disabled both clients, and still getting Disconnected,
> 
> node1:
> 
> [root@ProdigyX ~]# uptime
>  10:37:35 up 11 min,  1 user,  load average: 0,00, 0,01, 0,03
> [root@ProdigyX ~]# systemctl status gluster.mount
> ● gluster.mount - Mount Gluster
>    Loaded: loaded (/etc/systemd/system/gluster.mount; disabled; vendor
> preset: disabled)
>    Active: inactive (dead)
>     Where: /gluster
>      What: 127.0.0.1:/replica1
> [root@ProdigyX ~]# gluster pool list
> UUID                                    Hostname        State
> 2f6697f4-2529-4072-910c-8862fdc43562    10.250.1.1      Disconnected
> f7976943-b81a-4bb4-a1fb-06253bf064c4    localhost       Connected
> [root@ProdigyX ~]# gluster peer status
> Number of Peers: 1
> 
> Hostname: 10.250.1.1
> Uuid: 2f6697f4-2529-4072-910c-8862fdc43562
> State: Peer in Cluster (Disconnected)
> [root@ProdigyX ~]# gluster volume status
> Status of volume: replica1
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.250.1.2:/array0/brick1             49152     0          Y       1390
> Self-heal Daemon on localhost               N/A       N/A        Y       1373
> 
> Task Status of Volume replica1
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
> 
> [root@ProdigyX ~]# gluster volume info
> 
> Volume Name: replica1
> Type: Replicate
> Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: 10.250.1.2:/array0/brick1
> Brick2: 10.250.1.1:/raid0array1/brick2
> Options Reconfigured:
> client.bind-insecure: off
> performance.client-io-threads: off
> auth.allow: 10.250.1.1,10.250.1.2
> transport.address-family: inet
> nfs.disable: on
> server.event-threads: 8
> performance.io-thread-count: 64
> performance.cache-size: 32MB
> performance.write-behind-window-size: 64MB
> server.allow-insecure: off
> [root@ProdigyX ~]# netstat -tap | grep gluster
> tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN  
> 1390/glusterfsd
> tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN  
> 1361/glusterd
> tcp        0      0 ProdigyX:1019           10.250.1.1:49152       
> ESTABLISHED 1373/glusterfs
> tcp        0      0 ProdigyX:1020           ProdigyX:49152         
> ESTABLISHED 1373/glusterfs
> tcp        0      0 ProdigyX:49152          10.250.1.1:exp2        
> ESTABLISHED 1390/glusterfsd
> tcp        0      0 ProdigyX:24007          10.250.1.1:49151       
> ESTABLISHED 1361/glusterd
> tcp        0      1 ProdigyX:49151          10.250.1.1:24007        SYN_SENT
> 1361/glusterd
> tcp        0      0 ProdigyX:24007          ProdigyX:49149         
> ESTABLISHED 1361/glusterd
> tcp        0      0 ProdigyX:49152          ProdigyX:1020          
> ESTABLISHED 1390/glusterfsd
> tcp        0      0 ProdigyX:49149          ProdigyX:24007         
> ESTABLISHED 1390/glusterfsd
> tcp        0      0 localhost:49150         localhost:24007        
> ESTABLISHED 1373/glusterfs
> tcp        0      0 localhost:24007         localhost:49150        
> ESTABLISHED 1361/glusterd
> 
> node2:
> 
> [root@BUNKER ~]# uptime
>  10:37:34 up 26 min,  1 user,  load average: 0,00, 0,01, 0,05
> [root@BUNKER ~]# systemctl status gluster.mount
> ● gluster.mount - Mount Gluster
>    Loaded: loaded (/etc/systemd/system/gluster.mount; disabled; vendor
> preset: disabled)
>    Active: inactive (dead)
>     Where: /gluster
>      What: 127.0.0.1:/replica1
> [root@BUNKER ~]# gluster pool list
> UUID                                    Hostname        State
> f7976943-b81a-4bb4-a1fb-06253bf064c4    10.250.1.2      Connected
> 2f6697f4-2529-4072-910c-8862fdc43562    localhost       Connected
> [root@BUNKER ~]# gluster peer status
> Number of Peers: 1
> 
> Hostname: 10.250.1.2
> Uuid: f7976943-b81a-4bb4-a1fb-06253bf064c4
> State: Peer in Cluster (Connected)
> [root@BUNKER ~]# gluster volume status
> Status of volume: replica1
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> -----------------------------------------------------------------------------
> -
> Brick 10.250.1.2:/array0/brick1             49152     0          Y       1390
> Brick 10.250.1.1:/raid0array1/brick2        49152     0          Y       1334
> Self-heal Daemon on localhost               N/A       N/A        Y       1149
> Self-heal Daemon on 10.250.1.2              N/A       N/A        Y       1373
> 
> Task Status of Volume replica1
> -----------------------------------------------------------------------------
> -
> There are no active volume tasks
> 
> [root@BUNKER ~]# gluster volume info
> 
> Volume Name: replica1
> Type: Replicate
> Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: 10.250.1.2:/array0/brick1
> Brick2: 10.250.1.1:/raid0array1/brick2
> Options Reconfigured:
> client.bind-insecure: off
> performance.client-io-threads: off
> auth.allow: 10.250.1.1,10.250.1.2
> transport.address-family: inet
> nfs.disable: on
> server.event-threads: 8
> performance.io-thread-count: 64
> performance.cache-size: 32MB
> performance.write-behind-window-size: 64MB
> server.allow-insecure: off
> [root@BUNKER ~]# netstat -tap | grep gluster
> tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN  
> 1334/glusterfsd
> tcp        0      0 0.0.0.0:24007           0.0.0.0:*               LISTEN  
> 866/glusterd
> tcp        0      0 BUNKER:49152            10.250.1.2:1019        
> ESTABLISHED 1334/glusterfsd
> tcp        0      0 BUNKER:exp2             10.250.1.2:49152       
> ESTABLISHED 1149/glusterfs
> tcp        0      0 BUNKER:24007            BUNKER:49149           
> ESTABLISHED 866/glusterd
> tcp        0      0 BUNKER:1020             BUNKER:49152           
> ESTABLISHED 1149/glusterfs
> tcp        0      0 BUNKER:49151            10.250.1.2:24007       
> ESTABLISHED 866/glusterd
> tcp        0      0 BUNKER:49152            BUNKER:1020            
> ESTABLISHED 1334/glusterfsd
> tcp        0      0 localhost:49150         localhost:24007        
> ESTABLISHED 1149/glusterfs
> tcp        0      0 localhost:24007         localhost:49150        
> ESTABLISHED 866/glusterd
> tcp        0      0 BUNKER:49149            BUNKER:24007           
> ESTABLISHED 1334/glusterfsd

Both *were already* set, I ment these:
server.allow-insecure: off
client.bind-insecure: off

Comment 24 Raghavendra Talur 2018-02-12 10:10:33 UTC

Hi,

Gaurav reproduced the same issue that you are hitting. Here is the explanation for it.

Gluster has some sockets in listening mode and some which are just clients.
For example a glusterd process creates a LISTENING port at 24007 and also creates a client connection using connect() call. The glusterfs.xml the we provide *only* lists those ports that we bind to and LISTEN at.


In your setup we have

10.250.1.1     10.250.1.2
24007          24007
49151          49151

where 24007 are listening ports and 49151 are client ports. Now, 10.250.1.2 sent a packet from 49151 to 10.250.1.1:24007. That is what you see as SYN_SENT. It was also received by 10.250.1.1 and we see that as SYN_RECV. The log message below shows that when 10.250.1.1:24007 replied back with SYN_ACK it was dropped.

Jan 30 13:42:11 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0

I think glusterfs.xml isn't applied at node 10.250.1.1. Please check once, maybe the zone isn't mapped to the interface.

Comment 25 devianca 2018-02-12 13:38:21 UTC

Wow, thanks for the quick reply. Please check this, and tell me if its not mapped,

node1:

[root@ProdigyX ~]# networkctl
IDX LINK             TYPE               OPERATIONAL SETUP
  1 lo               loopback           carrier     unmanaged
  2 i210.0           ether              carrier     configured
  3 i210.1           ether              off         unmanaged
  4 i350.0           ether              routable    configured
  5 i350.1           ether              carrier     configured
  6 i350.2           ether              carrier     configured
  7 i350.3           ether              carrier     configured
  8 bond0            ether              off         unmanaged
  9 bond1            ether              routable    configured

9 links listed.
[root@ProdigyX ~]# networkctl status
●      State: routable
     Address: xxx on i350.0
              10.250.1.2 on bond1
     Gateway: xxx (MITAC INTERNATIONAL CORP.) on i350.0
         DNS: xxx
         NTP: xxx
[root@ProdigyX ~]# firewall-cmd --info-zone=gluster --permanent
gluster (active)
  target: default
  icmp-block-inversion: no
  interfaces: bond1
  sources:
  services: glusterfs
  ports:
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:


node2:

[root@BUNKER ~]# networkctl
IDX LINK             TYPE               OPERATIONAL SETUP
  1 lo               loopback           carrier     unmanaged
  2 enp1s0           ether              carrier     configured
  3 enp2s0           ether              carrier     configured
  4 enp3s0           ether              carrier     configured
  5 enp4s0           ether              carrier     configured
  6 bond0            ether              routable    configured
  7 enp1s0.120       ether              routable    configured

7 links listed.
[root@BUNKER ~]# networkctl status
●      State: routable
     Address: 10.250.1.1 on bond0
              xxx on enp1s0.120
     Gateway: xxx (MITAC INTERNATIONAL CORP.) on enp1s0.120
         DNS: xxx
         NTP: xxx
[root@BUNKER ~]# firewall-cmd --info-zone=gluster --permanent
gluster (active)
  target: default
  icmp-block-inversion: no
  interfaces: bond0
  sources:
  services: glusterfs
  ports:
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

Comment 26 Gaurav Yadav 2018-02-12 19:28:07 UTC

Could you please check the entry for ZONE using below mention commands on both the nodes.
Node 1:
cat /etc/sysconfig/network-scripts/ifcfg-bond0

Node2:
cat /etc/sysconfig/network-scripts/ifcfg-bond1

Comment 27 devianca 2018-02-12 20:41:14 UTC

Im not using network.service. Im using systemd-networkd.service...

You can see from previous outputs that (active):

[root@BUNKER ~]# firewall-cmd --info-zone=gluster --permanent
gluster (active)
...
[root@ProdigyX ~]# firewall-cmd --info-zone=gluster --permanent
gluster (active)
...

More output:
[root@ProdigyX ~]# firewall-cmd --get-active-zones
gluster
  interfaces: bond1

[root@BUNKER ~]# firewall-cmd --get-active-zones
gluster
  interfaces: bond0
public
  interfaces: enp1s0

Comment 28 devianca 2018-02-12 20:42:56 UTC

So let me get this straight for myself. What you are saying is that firewalld is blocking traffic that wants to go OUT (and not IN)?

Comment 29 Kaushal 2018-02-13 11:24:48 UTC

(In reply to devianca from comment #28)
> So let me get this straight for myself. What you are saying is that
> firewalld is blocking traffic that wants to go OUT (and not IN)?

Sure seems like it. But that is strange.

A default new firewalld zone should block all incoming traffic, other than those that have been allowed via services/ports etc. All outgoing traffic should be allowed.

In your case, the firewall appears to be dropping the connection establishment replies (SYN_ACK) for an outgoing connection. That really shouldn't be happening. But I have no idea on why this happens. Maybe we should reach out to the firewalld project or other firewall experts for answers to this.

Comment 30 Atin Mukherjee 2018-10-05 02:47:45 UTC

We couldn't reproduce this issue in house and as explained in the earlier comments this doesn't look like any gluster specific issues at this moment.