Bug 1540249
Summary: | Gluster is trying to use a port outside documentation and firewalld's glusterfs.xml | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | devianca |
Component: | glusterd | Assignee: | bugs <bugs> |
Status: | CLOSED WORKSFORME | QA Contact: | |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.12 | CC: | amukherj, bugs, devianca, kaushal, kkeithle, rtalur |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-10-05 02:47:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
to remind everyone, this is a firewall issue i bet. but nevertheless, this is some more info from a situation like this: node1: [root@ProdigyX ~]# gluster pool list UUID Hostname State xxx 10.250.1.1 Disconnected yyy localhost Connected [root@ProdigyX ~]# gluster volume info replica1 Volume Name: replica1 Type: Replicate Volume ID: zzz Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.250.1.2:/array0/brick1 Brick2: 10.250.1.1:/raid0array1/brick2 Options Reconfigured: client.bind-insecure: off performance.client-io-threads: off auth.allow: 10.250.1.1,10.250.1.2 transport.address-family: inet nfs.disable: on server.event-threads: 8 performance.io-thread-count: 64 performance.cache-size: 32MB performance.write-behind-window-size: 64MB server.allow-insecure: off [root@ProdigyX ~]# gluster volume status replica1 Status of volume: replica1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.250.1.2:/array0/brick1 49152 0 Y 1193 Self-heal Daemon on localhost N/A N/A Y 1176 Task Status of Volume replica1 ------------------------------------------------------------------------------ There are no active volume tasks [root@ProdigyX ~]# ps auxwww | grep gluster root 1164 0.3 0.0 604644 10732 ? Ssl 14:00 0:00 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO root 1176 0.0 0.0 593156 7644 ? Ssl 14:00 0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/gluste shd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6c565af3c4462a80c526b26f74d90dca.socket --xlator-option *replicate*.node-uuid=yyy root 1193 0.0 0.0 1612800 16356 ? Ssl 14:00 0:00 /usr/sbin/glusterfsd -s 10.250.1.2 --volfile-id replica1.10.250.1.2.array0-brick1 -p /var/run/gluster/vols/replica1/10.250.1.2-array0-brick1.pid -S /var/run/gluster/9329e7359a1938faf4767c564e490de5.socket --brick-name /array0/brick1 -l /var/log/glusterfs/bricks/array0-brick1.log --xlator-option *-posix.glusterd-uuid=yyy --brick-port 49152 --xlator-option replica1-server.listen-port=49152 root 2171 0.1 0.0 550584 9496 ? Ssl 14:01 0:00 /usr/sbin/glusterfs --volfile-server=127.0.0.1 --volfile-id=/replica1 /gluster root 2343 0.0 0.0 112676 972 pts/0 S+ 14:01 0:00 grep --color=auto gluster node2: [root@BUNKER ~]# gluster pool list UUID Hostname State yyy 10.250.1.2 Connected xxx localhost Connected You have new mail in /var/spool/mail/root [root@BUNKER ~]# gluster volume info replica1 Volume Name: replica1 Type: Replicate Volume ID: zzz Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.250.1.2:/array0/brick1 Brick2: 10.250.1.1:/raid0array1/brick2 Options Reconfigured: server.allow-insecure: off performance.write-behind-window-size: 64MB performance.cache-size: 32MB performance.io-thread-count: 64 server.event-threads: 8 nfs.disable: on transport.address-family: inet auth.allow: 10.250.1.1,10.250.1.2 performance.client-io-threads: off client.bind-insecure: off [root@BUNKER ~]# gluster volume status replica1 Status of volume: replica1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.250.1.2:/array0/brick1 49152 0 Y 1193 Brick 10.250.1.1:/raid0array1/brick2 49152 0 Y 1400 Self-heal Daemon on localhost N/A N/A Y 1503 Self-heal Daemon on 10.250.1.2 N/A N/A Y 1176 Task Status of Volume replica1 ------------------------------------------------------------------------------ There are no active volume tasks [root@BUNKER ~]# ps auxwww |grep gluster root 869 0.0 0.2 604652 9560 ? Ssl 12:35 0:01 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO root 1400 0.0 0.5 1416192 20312 ? Ssl 12:35 0:00 /usr/sbin/glusterfsd -s 10.250.1.1 --volfile-id replica1.10.250.1.1.raid0array1-brick2 -p /var/run/gluster/vols/replica1/10.250.1.1-raid0array1-brick2.pid -S /var/run/gluster/b1821a9027697b21620a8c5abc7a8fb9.socket --brick-name /raid0array1/brick2 -l /var/log/glusterfs/bricks/raid0array1-brick2.log --xlator-option *-posix.glusterd-uuid=xxx --brick-port 49152 --xlator-option replica1-server.listen-port=49152 root 1503 0.0 0.4 683312 17008 ? Ssl 12:39 0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/941dd362146b7df8e34dbccb220962e5.socket --xlator-option *replicate*.node-uuid=xxx root 1918 0.0 0.0 112680 984 pts/0 S+ 14:04 0:00 grep --color=auto gluster Created attachment 1388511 [details]
both nodes when node2 got stuck (both logs are 'fresh')
both nodes when node2 got stuck (both logs are 'fresh')
Created attachment 1388512 [details]
statedump from node2 when node2 is stuck
statedump from node2 when node2 is stuck
Please note that from the ps output I can clearly see that the brick port (listen-port) allocated by glusterd is 49152. root 1400 0.0 0.5 1416192 20312 ? Ssl 12:35 0:00 /usr/sbin/glusterfsd -s 10.250.1.1 --volfile-id replica1.10.250.1.1.raid0array1-brick2 -p /var/run/gluster/vols/replica1/10.250.1.1-raid0array1-brick2.pid -S /var/run/gluster/b1821a9027697b21620a8c5abc7a8fb9.socket --brick-name /raid0array1/brick2 -l /var/log/glusterfs/bricks/raid0array1-brick2.log --xlator-option *-posix.glusterd-uuid=xxx --brick-port 49152 --xlator-option replica1-server.listen-port=49152 I think what we're suspecting here is a port which is used by client to communicate to glusterd which might be blocked. Could you provide output of: netstat -nap | grep 49151 netstat is the last: [root@ProdigyX ~]# gluster pool list UUID Hostname State 2f6697f4-2529-4072-910c-8862fdc43562 10.250.1.1 Disconnected f7976943-b81a-4bb4-a1fb-06253bf064c4 localhost Connected [root@ProdigyX ~]# gluster peer status Number of Peers: 1 Hostname: 10.250.1.1 Uuid: 2f6697f4-2529-4072-910c-8862fdc43562 State: Peer in Cluster (Disconnected) [root@ProdigyX ~]# gluster volume status Status of volume: replica1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.250.1.2:/array0/brick1 49152 0 Y 1192 Self-heal Daemon on localhost N/A N/A Y 1175 Task Status of Volume replica1 ------------------------------------------------------------------------------ There are no active volume tasks [root@ProdigyX ~]# gluster volume info Volume Name: replica1 Type: Replicate Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.250.1.2:/array0/brick1 Brick2: 10.250.1.1:/raid0array1/brick2 Options Reconfigured: client.bind-insecure: off performance.client-io-threads: off auth.allow: 10.250.1.1,10.250.1.2 transport.address-family: inet nfs.disable: on server.event-threads: 8 performance.io-thread-count: 64 performance.cache-size: 32MB performance.write-behind-window-size: 64MB server.allow-insecure: off [root@ProdigyX ~]# netstat -nap | grep 49151 tcp 0 1 10.250.1.2:49151 10.250.1.1:24007 SYN_SENT 1163/glusterd tcp 0 0 10.250.1.2:24007 10.250.1.1:49151 ESTABLISHED 1163/glusterd with no rebooting of any of the node, same commands on node1: [root@BUNKER ~]# gluster pool list UUID Hostname State f7976943-b81a-4bb4-a1fb-06253bf064c4 10.250.1.2 Connected 2f6697f4-2529-4072-910c-8862fdc43562 localhost Connected [root@BUNKER ~]# gluster peer status Number of Peers: 1 Hostname: 10.250.1.2 Uuid: f7976943-b81a-4bb4-a1fb-06253bf064c4 State: Peer in Cluster (Connected) [root@BUNKER ~]# gluster volume status Status of volume: replica1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.250.1.2:/array0/brick1 49152 0 Y 1192 Brick 10.250.1.1:/raid0array1/brick2 49152 0 Y 1395 Self-heal Daemon on localhost N/A N/A Y 1147 Self-heal Daemon on 10.250.1.2 N/A N/A Y 1175 Task Status of Volume replica1 ------------------------------------------------------------------------------ There are no active volume tasks [root@BUNKER ~]# gluster volume info Volume Name: replica1 Type: Replicate Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.250.1.2:/array0/brick1 Brick2: 10.250.1.1:/raid0array1/brick2 Options Reconfigured: client.bind-insecure: off performance.client-io-threads: off auth.allow: 10.250.1.1,10.250.1.2 transport.address-family: inet nfs.disable: on server.event-threads: 8 performance.io-thread-count: 64 performance.cache-size: 32MB performance.write-behind-window-size: 64MB server.allow-insecure: off [root@BUNKER ~]# netstat -nap | grep 49151 tcp 0 0 10.250.1.1:24007 10.250.1.2:49151 SYN_RECV - tcp 0 0 10.250.1.1:49151 10.250.1.2:24007 ESTABLISHED 870/glusterd I ment node2* here Based on the netstat output shared (mentioned below) [root@BUNKER ~]# netstat -nap | grep 49151 tcp 0 0 10.250.1.1:24007 10.250.1.2:49151 SYN_RECV - tcp 0 0 10.250.1.1:49151 10.250.1.2:24007 ESTABLISHED 870/glusterd 49151 looks to be the outband port glusterd has used to communicate to other glusterd instances. In our deployments what we have seen this outbound port keeps in the range of privileged port but in this particular case it picked up 49151 and since 49151 is kind of blocked all the communication from glusterd at 10.250.1.1 to other glusterd instances aren't happening because of which you see nodes are stuck in the cli command. Just a wild guess was it like that port was initially unblocked and post glusterd came up this port was made blocked? No. If thinking anything around this, the exact opposite happened: 0. format 1. made systemd-networkd interconnect interfaces - all ports blocked 2. made new "gluster" zone 3. added glusterfs.xml to "gluster" zone Thanks for staying and helping! Whats the progress? More logs, check uptime, firewalld+glusterd start times, pool list: Node1: [root@ProdigyX ~]# uptime 23:42:48 up 3 days, 6:50, 1 user, load average: 0,01, 0,03, 0,05 [root@ProdigyX ~]# systemctl status firewalld --no-pager --full ● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled) Active: active (running) since tor 2018-01-30 16:52:39 CET; 3 days ago Docs: man:firewalld(1) Main PID: 859 (firewalld) CGroup: /system.slice/firewalld.service └─859 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid --debug=10 jan 30 16:52:38 ProdigyX systemd[1]: Starting firewalld - dynamic firewall daemon... jan 30 16:52:39 ProdigyX systemd[1]: Started firewalld - dynamic firewall daemon. jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: ICMP type 'beyond-scope' is not supported by the kernel for ipv6. jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: beyond-scope: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time. jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: ICMP type 'failed-policy' is not supported by the kernel for ipv6. jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: failed-policy: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time. jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: ICMP type 'reject-route' is not supported by the kernel for ipv6. jan 30 16:52:40 ProdigyX firewalld[859]: WARNING: reject-route: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time. [root@ProdigyX ~]# systemctl status glusterd --no-pager --full ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/glusterd.service.d └─override.conf Active: active (running) since tor 2018-01-30 16:53:19 CET; 3 days ago Main PID: 1163 (glusterd) CGroup: /system.slice/glusterd.service ├─1163 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO ├─1175 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6c565af3c4462a80c526b26f74d90dca.socket --xlator-option *replicate*.node-uuid=f7976943-b81a-4bb4-a1fb-06253bf064c4 └─1192 /usr/sbin/glusterfsd -s 10.250.1.2 --volfile-id replica1.10.250.1.2.array0-brick1 -p /var/run/gluster/vols/replica1/10.250.1.2-array0-brick1.pid -S /var/run/gluster/9329e7359a1938faf4767c564e490de5.socket --brick-name /array0/brick1 -l /var/log/glusterfs/bricks/array0-brick1.log --xlator-option *-posix.glusterd-uuid=f7976943-b81a-4bb4-a1fb-06253bf064c4 --brick-port 49152 --xlator-option replica1-server.listen-port=49152 jan 30 16:52:43 ProdigyX systemd[1]: Starting GlusterFS, a clustered file-system server... jan 30 16:52:48 ProdigyX bash[1174]: Local brick online. jan 30 16:53:19 ProdigyX systemd[1]: Started GlusterFS, a clustered file-system server. [root@ProdigyX ~]# gluster pool list UUID Hostname State 2f6697f4-2529-4072-910c-8862fdc43562 10.250.1.1 Disconnected f7976943-b81a-4bb4-a1fb-06253bf064c4 localhost Connected [root@ProdigyX ~]# gluster peer status Number of Peers: 1 Hostname: 10.250.1.1 Uuid: 2f6697f4-2529-4072-910c-8862fdc43562 State: Peer in Cluster (Disconnected) [root@ProdigyX ~]# gluster volume status Status of volume: replica1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.250.1.2:/array0/brick1 49152 0 Y 1192 Self-heal Daemon on localhost N/A N/A Y 1175 Task Status of Volume replica1 ------------------------------------------------------------------------------ There are no active volume tasks [root@ProdigyX ~]# netstat -nap | grep 49151 tcp 0 1 10.250.1.2:49151 10.250.1.1:24007 SYN_SENT 1163/glusterd tcp 0 0 10.250.1.2:24007 10.250.1.1:49151 ESTABLISHED 1163/glusterd Node2: [root@BUNKER ~]# uptime 23:42:08 up 3 days, 6:54, 1 user, load average: 0,00, 0,01, 0,05 [root@BUNKER ~]# systemctl status firewalld --no-pager --full ● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled) Active: active (running) since tor 2018-01-30 16:47:48 CET; 3 days ago Docs: man:firewalld(1) Main PID: 670 (firewalld) CGroup: /system.slice/firewalld.service └─670 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid jan 30 16:47:47 BUNKER systemd[1]: Starting firewalld - dynamic firewall daemon... jan 30 16:47:48 BUNKER systemd[1]: Started firewalld - dynamic firewall daemon. jan 30 16:47:49 BUNKER firewalld[670]: WARNING: ICMP type 'beyond-scope' is not supported by the kernel for ipv6. jan 30 16:47:49 BUNKER firewalld[670]: WARNING: beyond-scope: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time. jan 30 16:47:49 BUNKER firewalld[670]: WARNING: ICMP type 'failed-policy' is not supported by the kernel for ipv6. jan 30 16:47:49 BUNKER firewalld[670]: WARNING: failed-policy: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time. jan 30 16:47:49 BUNKER firewalld[670]: WARNING: ICMP type 'reject-route' is not supported by the kernel for ipv6. jan 30 16:47:49 BUNKER firewalld[670]: WARNING: reject-route: INVALID_ICMPTYPE: No supported ICMP type., ignoring for run-time. [root@BUNKER ~]# systemctl status glusterd --no-pager --full ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/glusterd.service.d └─override.conf Active: active (running) since tor 2018-01-30 16:47:52 CET; 3 days ago Main PID: 870 (glusterd) CGroup: /system.slice/glusterd.service ├─ 870 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO ├─1147 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/941dd362146b7df8e34dbccb220962e5.socket --xlator-option *replicate*.node-uuid=2f6697f4-2529-4072-910c-8862fdc43562 └─1395 /usr/sbin/glusterfsd -s 10.250.1.1 --volfile-id replica1.10.250.1.1.raid0array1-brick2 -p /var/run/gluster/vols/replica1/10.250.1.1-raid0array1-brick2.pid -S /var/run/gluster/b1821a9027697b21620a8c5abc7a8fb9.socket --brick-name /raid0array1/brick2 -l /var/log/glusterfs/bricks/raid0array1-brick2.log --xlator-option *-posix.glusterd-uuid=2f6697f4-2529-4072-910c-8862fdc43562 --brick-port 49152 --xlator-option replica1-server.listen-port=49152 jan 30 16:47:50 BUNKER systemd[1]: Starting GlusterFS, a clustered file-system server... jan 30 16:47:52 BUNKER systemd[1]: Started GlusterFS, a clustered file-system server. [root@BUNKER ~]# gluster pool list UUID Hostname State f7976943-b81a-4bb4-a1fb-06253bf064c4 10.250.1.2 Connected 2f6697f4-2529-4072-910c-8862fdc43562 localhost Connected [root@BUNKER ~]# gluster peer status Number of Peers: 1 Hostname: 10.250.1.2 Uuid: f7976943-b81a-4bb4-a1fb-06253bf064c4 State: Peer in Cluster (Connected) [root@BUNKER ~]# gluster volume status Status of volume: replica1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.250.1.2:/array0/brick1 49152 0 Y 1192 Brick 10.250.1.1:/raid0array1/brick2 49152 0 Y 1395 Self-heal Daemon on localhost N/A N/A Y 1147 Self-heal Daemon on 10.250.1.2 N/A N/A Y 1175 Task Status of Volume replica1 ------------------------------------------------------------------------------ There are no active volume tasks [root@BUNKER ~]# netstat -nap | grep 49151 tcp 0 0 10.250.1.1:24007 10.250.1.2:49151 SYN_RECV - tcp 0 0 10.250.1.1:49151 10.250.1.2:24007 ESTABLISHED 870/glusterd I haven't got time to root cause this yet. I'll pass this to one of my colleague Gaurav Yadav to take a look at this. If this issue is blocking you to move forward could you just open 49151 port for now and move forward. We shall be able to come back on this sooner. Gaurav - could you take a look at this? In my case it's that SELinux appears to be blocking the bind() on 49151. I can start the volume if I disable SELinux. (I only tested on a single brick volume on the localhost so I would not be affected by the firewalld.) (In reply to Kaleb KEITHLEY from comment #12) > In my case it's that SELinux appears to be blocking the bind() on 49151. I > can start the volume if I disable SELinux. (I only tested on a single brick > volume on the localhost so I would not be affected by the firewalld.) You mean 49151 was allocated to a brick port? That can never happen! As in glusterd's pmap the base port is set to 49152 and that from where the brick port allocation starts. As per my analysis at comment 8 it was the outband port which glusterd uses to send requests to other glusterd instances which was blocked. I couldn't reproduce this problem on my fedora27 system with selinux mode enabled. [root@atinmu glusterfs]# sestatus SELinux status: enabled SELinuxfs mount: /sys/fs/selinux SELinux root directory: /etc/selinux Loaded policy name: targeted Current mode: enforcing Mode from config file: enforcing Policy MLS status: enabled Policy deny_unknown status: allowed Memory protection checking: actual (secure) Max kernel policy version: 31 [root@atinmu glusterfs]# [root@atinmu glusterfs]# [root@atinmu glusterfs]# [root@atinmu glusterfs]# [root@atinmu glusterfs]# glusterd [root@atinmu glusterfs]# cat /etc/fedora-release Fedora release 27 (Twenty Seven) [root@atinmu glusterfs]# gluster v create test-vol x.x.x.x:/tmp/b{1..4} force volume create: test-vol: success: please start the volume to access data [root@atinmu glusterfs]# gluster v start test-vol volume start: test-vol: success [root@atinmu glusterfs]# gluster v status Status of volume: test-vol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.0.106:/tmp/b1 49152 0 Y 22410 Brick 192.168.0.106:/tmp/b2 49153 0 Y 22431 Brick 192.168.0.106:/tmp/b3 49154 0 Y 22452 Brick 192.168.0.106:/tmp/b4 49155 0 Y 22473 Task Status of Volume test-vol ------------------------------------------------------------------------------ There are no active volume tasks [root@atinmu glusterfs]# netstat -nap | grep 49151 tcp 0 0 x.x.x.x:24007 x.x.x.x:49151 ESTABLISHED 21932/glusterd tcp 0 0 x.x.x.x:49151 x.x.x.x:24007 ESTABLISHED 22410/glusterfsd x.x.x.x - same IP only from your own netstat at the end of comment 13 >[root@atinmu glusterfs]# netstat -nap | grep 49151 > tcp 0 0 x.x.x.x:24007 x.x.x.x:49151 ESTABLISHED 21932/glusterd > tcp 0 0 x.x.x.x:49151 x.x.x.x:24007 ESTABLISHED 22410/glusterfsd ^^^^^^^^^^^^^ selinux is denying the bind() on port 49151 by glusterfsd running locally selinux is also denying the mkdir_p of (/var)/run/glusterd/snaps by glusterd This is on Fedora 28 (rawhide) and rhel 7+1 correction:
> selinux is also denying the mkdir_p of (/var)/run/gluster/snaps by glusterd
I tried replicating issue in my setup by creating new zone and assigning glusterfs service to newly created zone. Here is my cluster state.. Node1 [root@localhost glusterfs]# gluster peer status Number of Peers: 1 Hostname: 192.168.122.120 Uuid: 695ce317-e35b-4f4a-b85d-326df13308fc State: Accepted peer request (Connected) [root@localhost glusterfs]# Node2 [root@localhost ~]# gluster peer status Number of Peers: 1 Hostname: 192.168.122.79 Uuid: a3fa5d12-2614-4a52-a85c-a96d33bb116d State: Accepted peer request (Disconnected) Node1 [root@localhost glusterfs]# netstat -tap | grep gluster tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 14180/glusterd tcp 0 0 localhost.localdo:49151 192.168.122.120:24007 ESTABLISHED 14180/glusterd Node2 [root@localhost gaurav]# netstat -tap | grep gluster tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 25958/glusterd tcp 0 0 localhost.localdo:24007 192.168.122.79:49151 ESTABLISHED 25958/glusterd tcp 0 1 localhost.localdo:49151 192.168.122.79:24007 SYN_SENT 25958/glusterd I too suspect its a firewall issue, in order to confirm I need few info from your setup. please provide output of below commands. 1. firewall-cmd --list-ports --zone=gluster 2. firewall-cmd --list-services --zone=gluster Thanks for helping! I'm so tired of googling "gluster 49151" that words cannot describe. [root@ProdigyX ~]# gluster pool list UUID Hostname State 2f6697f4-2529-4072-910c-8862fdc43562 10.250.1.1 Disconnected f7976943-b81a-4bb4-a1fb-06253bf064c4 localhost Connected [root@ProdigyX ~]# gluster peer status Number of Peers: 1 Hostname: 10.250.1.1 Uuid: 2f6697f4-2529-4072-910c-8862fdc43562 State: Peer in Cluster (Disconnected) [root@ProdigyX ~]# firewall-cmd --list-ports --zone=gluster [root@ProdigyX ~]# firewall-cmd --list-services --zone=gluster glusterfs [root@ProdigyX ~]# netstat -tap | grep gluster tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN 1194/glusterfsd tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 1165/glusterd tcp 0 0 ProdigyX:24007 ProdigyX:49149 ESTABLISHED 1165/glusterd tcp 0 0 localhost:24007 localhost:49150 ESTABLISHED 1165/glusterd tcp 0 0 ProdigyX:49152 ProdigyX:1016 ESTABLISHED 1194/glusterfsd tcp 0 0 ProdigyX:1019 10.250.1.1:49152 ESTABLISHED 1177/glusterfs tcp 0 1 ProdigyX:49151 10.250.1.1:24007 SYN_SENT 1165/glusterd tcp 0 0 localhost:49150 localhost:24007 ESTABLISHED 1177/glusterfs tcp 0 0 ProdigyX:1016 ProdigyX:49152 ESTABLISHED 2038/glusterfs tcp 0 0 localhost:24007 localhost:49148 ESTABLISHED 1165/glusterd tcp 0 0 ProdigyX:24007 10.250.1.1:49151 ESTABLISHED 1165/glusterd tcp 0 0 ProdigyX:1015 10.250.1.1:49152 ESTABLISHED 2038/glusterfs tcp 0 0 ProdigyX:49149 ProdigyX:24007 ESTABLISHED 1194/glusterfsd tcp 0 0 ProdigyX:1020 ProdigyX:49152 ESTABLISHED 1177/glusterfs tcp 0 0 localhost:49148 localhost:24007 ESTABLISHED 2038/glusterfs tcp 0 0 ProdigyX:49152 10.250.1.1:exp2 ESTABLISHED 1194/glusterfsd tcp 0 0 ProdigyX:49152 ProdigyX:1020 ESTABLISHED 1194/glusterfsd Also to Gaurav: finally someone with this bug beside me. Did you do anything special to get it replicated or no? What do you think is the problem? It's not firewall, because gluster should not touch port 49151 or am I horribly mistaken? For me it was a firewall issue, on one of the node firewall was blocking the connection. Now I am able to perform all set of operation. N1 firewall-cmd --get-active-zones gluster interfaces: ens3 N2 firewall-cmd --get-active-zones gluster interfaces: ens3 firewall-cmd --permanent --info-zone=gluster gluster target: default icmp-block-inversion: no interfaces: sources: services: glusterfs ports: 22/tcp protocols: masquerade: no forward-ports: sourceports: icmp-blocks: rich rules: gluster pool list UUID Hostname State b38f67e6-e67b-4e0b-834f-87df268b6243 192.168.122.120 Connected 9d39dcde-cf55-4f8b-b889-4240fb4b72c8 localhost Connected [root@localhost glusterfs]# gluster peer status Number of Peers: 1 Hostname: 192.168.122.79 Uuid: 9d39dcde-cf55-4f8b-b889-4240fb4b72c8 State: Peer in Cluster (Connected [root@localhost glusterfs]# gluster peer status Number of Peers: 1 Hostname: 192.168.122.120 Uuid: b38f67e6-e67b-4e0b-834f-87df268b6243 State: Peer in Cluster (Connected) [root@localhost glusterfs]# netstat -nap | grep gluster tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN 10825/glusterfsd tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 10677/glusterd tcp 0 0 192.168.122.120:24007 192.168.122.120:49150 ESTABLISHED 10677/glusterd tcp 0 0 192.168.122.120:49150 192.168.122.120:24007 ESTABLISHED 10825/glusterfsd tcp 0 0 192.168.122.120:24007 192.168.122.79:49151 ESTABLISHED 10677/glusterd tcp 0 0 192.168.122.120:49151 192.168.122.79:24007 ESTABLISHED 10677/glusterd unix 2 [ ACC ] STREAM LISTENING 75338 10677/glusterd /var/run/glusterd.socket unix 2 [ ACC ] STREAM LISTENING 94304 10825/glusterfsd /var/run/gluster/changelog-93b92d74198dc111ed4be6a4e0d52563.sock unix 2 [ ACC ] STREAM LISTENING 75696 10825/glusterfsd /var/run/gluster/6f4035c6b1db1fa531293ed9e2b3891a.socket unix 3 [ ] STREAM CONNECTED 94316 10825/glusterfsd /var/run/gluster/6f4035c6b1db1fa531293ed9e2b3891a.socket unix 3 [ ] STREAM CONNECTED 94315 10677/glusterd unix 2 [ ] DGRAM 74473 10677/glusterd unix 2 [ ] DGRAM 77064 10825/glusterfsd Port allocation to various processes. 1023 - 1 : Ports range for clients, if bind secure is turned on 49151 - 1024 : Port range for clients, if bind insecure is on (fall back to this, if in above case ports exhaust) 49152 - 65535 : Port range for bricks 1. In my situation, both nodes have only 1 brick, so why should it get exhausted? 2. Are you indicating that port 49151 is perfectly normal? If so, why isn't that range added to glusterfs.xml? 3. How to fix this now? Manually open whole range 1024-49151 eventhough both nodes have only 1 brick and they are the only clients connecting? Both *were already* set, check https://bugzilla.redhat.com/show_bug.cgi?id=1540249#c1 Now, also disabled both clients, and still getting Disconnected, node1: [root@ProdigyX ~]# uptime 10:37:35 up 11 min, 1 user, load average: 0,00, 0,01, 0,03 [root@ProdigyX ~]# systemctl status gluster.mount ● gluster.mount - Mount Gluster Loaded: loaded (/etc/systemd/system/gluster.mount; disabled; vendor preset: disabled) Active: inactive (dead) Where: /gluster What: 127.0.0.1:/replica1 [root@ProdigyX ~]# gluster pool list UUID Hostname State 2f6697f4-2529-4072-910c-8862fdc43562 10.250.1.1 Disconnected f7976943-b81a-4bb4-a1fb-06253bf064c4 localhost Connected [root@ProdigyX ~]# gluster peer status Number of Peers: 1 Hostname: 10.250.1.1 Uuid: 2f6697f4-2529-4072-910c-8862fdc43562 State: Peer in Cluster (Disconnected) [root@ProdigyX ~]# gluster volume status Status of volume: replica1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.250.1.2:/array0/brick1 49152 0 Y 1390 Self-heal Daemon on localhost N/A N/A Y 1373 Task Status of Volume replica1 ------------------------------------------------------------------------------ There are no active volume tasks [root@ProdigyX ~]# gluster volume info Volume Name: replica1 Type: Replicate Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.250.1.2:/array0/brick1 Brick2: 10.250.1.1:/raid0array1/brick2 Options Reconfigured: client.bind-insecure: off performance.client-io-threads: off auth.allow: 10.250.1.1,10.250.1.2 transport.address-family: inet nfs.disable: on server.event-threads: 8 performance.io-thread-count: 64 performance.cache-size: 32MB performance.write-behind-window-size: 64MB server.allow-insecure: off [root@ProdigyX ~]# netstat -tap | grep gluster tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN 1390/glusterfsd tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 1361/glusterd tcp 0 0 ProdigyX:1019 10.250.1.1:49152 ESTABLISHED 1373/glusterfs tcp 0 0 ProdigyX:1020 ProdigyX:49152 ESTABLISHED 1373/glusterfs tcp 0 0 ProdigyX:49152 10.250.1.1:exp2 ESTABLISHED 1390/glusterfsd tcp 0 0 ProdigyX:24007 10.250.1.1:49151 ESTABLISHED 1361/glusterd tcp 0 1 ProdigyX:49151 10.250.1.1:24007 SYN_SENT 1361/glusterd tcp 0 0 ProdigyX:24007 ProdigyX:49149 ESTABLISHED 1361/glusterd tcp 0 0 ProdigyX:49152 ProdigyX:1020 ESTABLISHED 1390/glusterfsd tcp 0 0 ProdigyX:49149 ProdigyX:24007 ESTABLISHED 1390/glusterfsd tcp 0 0 localhost:49150 localhost:24007 ESTABLISHED 1373/glusterfs tcp 0 0 localhost:24007 localhost:49150 ESTABLISHED 1361/glusterd node2: [root@BUNKER ~]# uptime 10:37:34 up 26 min, 1 user, load average: 0,00, 0,01, 0,05 [root@BUNKER ~]# systemctl status gluster.mount ● gluster.mount - Mount Gluster Loaded: loaded (/etc/systemd/system/gluster.mount; disabled; vendor preset: disabled) Active: inactive (dead) Where: /gluster What: 127.0.0.1:/replica1 [root@BUNKER ~]# gluster pool list UUID Hostname State f7976943-b81a-4bb4-a1fb-06253bf064c4 10.250.1.2 Connected 2f6697f4-2529-4072-910c-8862fdc43562 localhost Connected [root@BUNKER ~]# gluster peer status Number of Peers: 1 Hostname: 10.250.1.2 Uuid: f7976943-b81a-4bb4-a1fb-06253bf064c4 State: Peer in Cluster (Connected) [root@BUNKER ~]# gluster volume status Status of volume: replica1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.250.1.2:/array0/brick1 49152 0 Y 1390 Brick 10.250.1.1:/raid0array1/brick2 49152 0 Y 1334 Self-heal Daemon on localhost N/A N/A Y 1149 Self-heal Daemon on 10.250.1.2 N/A N/A Y 1373 Task Status of Volume replica1 ------------------------------------------------------------------------------ There are no active volume tasks [root@BUNKER ~]# gluster volume info Volume Name: replica1 Type: Replicate Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.250.1.2:/array0/brick1 Brick2: 10.250.1.1:/raid0array1/brick2 Options Reconfigured: client.bind-insecure: off performance.client-io-threads: off auth.allow: 10.250.1.1,10.250.1.2 transport.address-family: inet nfs.disable: on server.event-threads: 8 performance.io-thread-count: 64 performance.cache-size: 32MB performance.write-behind-window-size: 64MB server.allow-insecure: off [root@BUNKER ~]# netstat -tap | grep gluster tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN 1334/glusterfsd tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 866/glusterd tcp 0 0 BUNKER:49152 10.250.1.2:1019 ESTABLISHED 1334/glusterfsd tcp 0 0 BUNKER:exp2 10.250.1.2:49152 ESTABLISHED 1149/glusterfs tcp 0 0 BUNKER:24007 BUNKER:49149 ESTABLISHED 866/glusterd tcp 0 0 BUNKER:1020 BUNKER:49152 ESTABLISHED 1149/glusterfs tcp 0 0 BUNKER:49151 10.250.1.2:24007 ESTABLISHED 866/glusterd tcp 0 0 BUNKER:49152 BUNKER:1020 ESTABLISHED 1334/glusterfsd tcp 0 0 localhost:49150 localhost:24007 ESTABLISHED 1149/glusterfs tcp 0 0 localhost:24007 localhost:49150 ESTABLISHED 866/glusterd tcp 0 0 BUNKER:49149 BUNKER:24007 ESTABLISHED 1334/glusterfsd For faster debugging for all of you, heres the script I'm using to catch it: [root@ProdigyX ~]# systemctl cat glusterd # /etc/systemd/system/glusterd.service [Unit] Description=GlusterFS, a clustered file-system server Requires=rpcbind.service After=network.target rpcbind.service Before=network-online.target [Service] Type=forking PIDFile=/var/run/glusterd.pid LimitNOFILE=65536 Environment="LOG_LEVEL=INFO" EnvironmentFile=-/etc/sysconfig/glusterd ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS KillMode=process SuccessExitStatus=15 [Install] WantedBy=multi-user.target # /etc/systemd/system/glusterd.service.d/override.conf [Unit] PartOf=array0.mount Wants=systemd-networkd-wait-online.service #Wants=gluster.mount After=array0.mount After=systemd-networkd-wait-online.service #Before=gluster.mount [Service] ExecStartPost=/usr/bin/bash -c '_ip="10.250.1.2"; while :; do gluster volume status | grep "$_ip" | grep -q Y && break; sleep 0.5; done; echo "Local brick online."' ExecStartPost=/usr/bin/bash /root/glusterd_health StandardOutput=journal Restart=on-failure RestartSec=5s [root@ProdigyX ~]# cat glusterd_health #!/bin/bash echo "DATE: $(date +%G-%m-%d" "%H:%M:%S)" >> /root/glusterd_health.log log_it() { gluster volume status >> /root/glusterd_health.log gluster pool list >> /root/glusterd_health.log rm -Rf /var/log/glusterfs/* sync } _ip="10.250.1.1" _until=$(($(date +%s) + 30)) until [[ "$(date +%s)" -gt "$_until" ]]; do gluster pool list | grep "$_ip" | grep -q Connected && log_it && reboot sleep 0.5 done (In reply to devianca from comment #21) > Both *were already* set, check > https://bugzilla.redhat.com/show_bug.cgi?id=1540249#c1 > > Now, also disabled both clients, and still getting Disconnected, > > node1: > > [root@ProdigyX ~]# uptime > 10:37:35 up 11 min, 1 user, load average: 0,00, 0,01, 0,03 > [root@ProdigyX ~]# systemctl status gluster.mount > ● gluster.mount - Mount Gluster > Loaded: loaded (/etc/systemd/system/gluster.mount; disabled; vendor > preset: disabled) > Active: inactive (dead) > Where: /gluster > What: 127.0.0.1:/replica1 > [root@ProdigyX ~]# gluster pool list > UUID Hostname State > 2f6697f4-2529-4072-910c-8862fdc43562 10.250.1.1 Disconnected > f7976943-b81a-4bb4-a1fb-06253bf064c4 localhost Connected > [root@ProdigyX ~]# gluster peer status > Number of Peers: 1 > > Hostname: 10.250.1.1 > Uuid: 2f6697f4-2529-4072-910c-8862fdc43562 > State: Peer in Cluster (Disconnected) > [root@ProdigyX ~]# gluster volume status > Status of volume: replica1 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.250.1.2:/array0/brick1 49152 0 Y 1390 > Self-heal Daemon on localhost N/A N/A Y 1373 > > Task Status of Volume replica1 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > [root@ProdigyX ~]# gluster volume info > > Volume Name: replica1 > Type: Replicate > Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: 10.250.1.2:/array0/brick1 > Brick2: 10.250.1.1:/raid0array1/brick2 > Options Reconfigured: > client.bind-insecure: off > performance.client-io-threads: off > auth.allow: 10.250.1.1,10.250.1.2 > transport.address-family: inet > nfs.disable: on > server.event-threads: 8 > performance.io-thread-count: 64 > performance.cache-size: 32MB > performance.write-behind-window-size: 64MB > server.allow-insecure: off > [root@ProdigyX ~]# netstat -tap | grep gluster > tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN > 1390/glusterfsd > tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN > 1361/glusterd > tcp 0 0 ProdigyX:1019 10.250.1.1:49152 > ESTABLISHED 1373/glusterfs > tcp 0 0 ProdigyX:1020 ProdigyX:49152 > ESTABLISHED 1373/glusterfs > tcp 0 0 ProdigyX:49152 10.250.1.1:exp2 > ESTABLISHED 1390/glusterfsd > tcp 0 0 ProdigyX:24007 10.250.1.1:49151 > ESTABLISHED 1361/glusterd > tcp 0 1 ProdigyX:49151 10.250.1.1:24007 SYN_SENT > 1361/glusterd > tcp 0 0 ProdigyX:24007 ProdigyX:49149 > ESTABLISHED 1361/glusterd > tcp 0 0 ProdigyX:49152 ProdigyX:1020 > ESTABLISHED 1390/glusterfsd > tcp 0 0 ProdigyX:49149 ProdigyX:24007 > ESTABLISHED 1390/glusterfsd > tcp 0 0 localhost:49150 localhost:24007 > ESTABLISHED 1373/glusterfs > tcp 0 0 localhost:24007 localhost:49150 > ESTABLISHED 1361/glusterd > > node2: > > [root@BUNKER ~]# uptime > 10:37:34 up 26 min, 1 user, load average: 0,00, 0,01, 0,05 > [root@BUNKER ~]# systemctl status gluster.mount > ● gluster.mount - Mount Gluster > Loaded: loaded (/etc/systemd/system/gluster.mount; disabled; vendor > preset: disabled) > Active: inactive (dead) > Where: /gluster > What: 127.0.0.1:/replica1 > [root@BUNKER ~]# gluster pool list > UUID Hostname State > f7976943-b81a-4bb4-a1fb-06253bf064c4 10.250.1.2 Connected > 2f6697f4-2529-4072-910c-8862fdc43562 localhost Connected > [root@BUNKER ~]# gluster peer status > Number of Peers: 1 > > Hostname: 10.250.1.2 > Uuid: f7976943-b81a-4bb4-a1fb-06253bf064c4 > State: Peer in Cluster (Connected) > [root@BUNKER ~]# gluster volume status > Status of volume: replica1 > Gluster process TCP Port RDMA Port Online Pid > ----------------------------------------------------------------------------- > - > Brick 10.250.1.2:/array0/brick1 49152 0 Y 1390 > Brick 10.250.1.1:/raid0array1/brick2 49152 0 Y 1334 > Self-heal Daemon on localhost N/A N/A Y 1149 > Self-heal Daemon on 10.250.1.2 N/A N/A Y 1373 > > Task Status of Volume replica1 > ----------------------------------------------------------------------------- > - > There are no active volume tasks > > [root@BUNKER ~]# gluster volume info > > Volume Name: replica1 > Type: Replicate > Volume ID: 5331fac2-42b6-4530-bf79-1ec0236efbc4 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: 10.250.1.2:/array0/brick1 > Brick2: 10.250.1.1:/raid0array1/brick2 > Options Reconfigured: > client.bind-insecure: off > performance.client-io-threads: off > auth.allow: 10.250.1.1,10.250.1.2 > transport.address-family: inet > nfs.disable: on > server.event-threads: 8 > performance.io-thread-count: 64 > performance.cache-size: 32MB > performance.write-behind-window-size: 64MB > server.allow-insecure: off > [root@BUNKER ~]# netstat -tap | grep gluster > tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN > 1334/glusterfsd > tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN > 866/glusterd > tcp 0 0 BUNKER:49152 10.250.1.2:1019 > ESTABLISHED 1334/glusterfsd > tcp 0 0 BUNKER:exp2 10.250.1.2:49152 > ESTABLISHED 1149/glusterfs > tcp 0 0 BUNKER:24007 BUNKER:49149 > ESTABLISHED 866/glusterd > tcp 0 0 BUNKER:1020 BUNKER:49152 > ESTABLISHED 1149/glusterfs > tcp 0 0 BUNKER:49151 10.250.1.2:24007 > ESTABLISHED 866/glusterd > tcp 0 0 BUNKER:49152 BUNKER:1020 > ESTABLISHED 1334/glusterfsd > tcp 0 0 localhost:49150 localhost:24007 > ESTABLISHED 1149/glusterfs > tcp 0 0 localhost:24007 localhost:49150 > ESTABLISHED 866/glusterd > tcp 0 0 BUNKER:49149 BUNKER:24007 > ESTABLISHED 1334/glusterfsd Both *were already* set, I ment these: server.allow-insecure: off client.bind-insecure: off Hi, Gaurav reproduced the same issue that you are hitting. Here is the explanation for it. Gluster has some sockets in listening mode and some which are just clients. For example a glusterd process creates a LISTENING port at 24007 and also creates a client connection using connect() call. The glusterfs.xml the we provide *only* lists those ports that we bind to and LISTEN at. In your setup we have 10.250.1.1 10.250.1.2 24007 24007 49151 49151 where 24007 are listening ports and 49151 are client ports. Now, 10.250.1.2 sent a packet from 49151 to 10.250.1.1:24007. That is what you see as SYN_SENT. It was also received by 10.250.1.1 and we see that as SYN_RECV. The log message below shows that when 10.250.1.1:24007 replied back with SYN_ACK it was dropped. Jan 30 13:42:11 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 I think glusterfs.xml isn't applied at node 10.250.1.1. Please check once, maybe the zone isn't mapped to the interface. Wow, thanks for the quick reply. Please check this, and tell me if its not mapped, node1: [root@ProdigyX ~]# networkctl IDX LINK TYPE OPERATIONAL SETUP 1 lo loopback carrier unmanaged 2 i210.0 ether carrier configured 3 i210.1 ether off unmanaged 4 i350.0 ether routable configured 5 i350.1 ether carrier configured 6 i350.2 ether carrier configured 7 i350.3 ether carrier configured 8 bond0 ether off unmanaged 9 bond1 ether routable configured 9 links listed. [root@ProdigyX ~]# networkctl status ● State: routable Address: xxx on i350.0 10.250.1.2 on bond1 Gateway: xxx (MITAC INTERNATIONAL CORP.) on i350.0 DNS: xxx NTP: xxx [root@ProdigyX ~]# firewall-cmd --info-zone=gluster --permanent gluster (active) target: default icmp-block-inversion: no interfaces: bond1 sources: services: glusterfs ports: protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules: node2: [root@BUNKER ~]# networkctl IDX LINK TYPE OPERATIONAL SETUP 1 lo loopback carrier unmanaged 2 enp1s0 ether carrier configured 3 enp2s0 ether carrier configured 4 enp3s0 ether carrier configured 5 enp4s0 ether carrier configured 6 bond0 ether routable configured 7 enp1s0.120 ether routable configured 7 links listed. [root@BUNKER ~]# networkctl status ● State: routable Address: 10.250.1.1 on bond0 xxx on enp1s0.120 Gateway: xxx (MITAC INTERNATIONAL CORP.) on enp1s0.120 DNS: xxx NTP: xxx [root@BUNKER ~]# firewall-cmd --info-zone=gluster --permanent gluster (active) target: default icmp-block-inversion: no interfaces: bond0 sources: services: glusterfs ports: protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules: Could you please check the entry for ZONE using below mention commands on both the nodes. Node 1: cat /etc/sysconfig/network-scripts/ifcfg-bond0 Node2: cat /etc/sysconfig/network-scripts/ifcfg-bond1 Im not using network.service. Im using systemd-networkd.service... You can see from previous outputs that (active): [root@BUNKER ~]# firewall-cmd --info-zone=gluster --permanent gluster (active) ... [root@ProdigyX ~]# firewall-cmd --info-zone=gluster --permanent gluster (active) ... More output: [root@ProdigyX ~]# firewall-cmd --get-active-zones gluster interfaces: bond1 [root@BUNKER ~]# firewall-cmd --get-active-zones gluster interfaces: bond0 public interfaces: enp1s0 So let me get this straight for myself. What you are saying is that firewalld is blocking traffic that wants to go OUT (and not IN)? (In reply to devianca from comment #28) > So let me get this straight for myself. What you are saying is that > firewalld is blocking traffic that wants to go OUT (and not IN)? Sure seems like it. But that is strange. A default new firewalld zone should block all incoming traffic, other than those that have been allowed via services/ports etc. All outgoing traffic should be allowed. In your case, the firewall appears to be dropping the connection establishment replies (SYN_ACK) for an outgoing connection. That really shouldn't be happening. But I have no idea on why this happens. Maybe we should reach out to the firewalld project or other firewall experts for answers to this. We couldn't reproduce this issue in house and as explained in the earlier comments this doesn't look like any gluster specific issues at this moment. |
Created attachment 1388503 [details] this is a fresh log from a situation when a node is stuck (removed all logs before run) Description of problem: Wrote a script to catch these situations in a reboot-loop where in my replica-2, a node after reboot (statistically ~15%) cannot connect to other node and stays like that forever in `gluster pool list` while at the same time other node says that the first node is Connected and for this other node everything is working perfectly. I narrowed it down to firewalld denying connections over port 49151 and this is causing it. Node1: 10.250.1.2 Node2: 10.250.1.1 Both Centos7 with glusterfs.xml openned on 'gluster' interconnect zone This is a quick paste of this exact situation when a node is stuck: [root@ProdigyX ~]# uptime 13:43:47 up 7 min, 1 user, load average: 0,00, 0,04, 0,05 [root@ProdigyX ~]# systemctl is-active glusterd active [root@ProdigyX ~]# networkctl status bond1 ● 9: bond1 Link File: n/a Network File: /etc/systemd/network/bond1.network Type: ether State: routable (configured) Driver: bonding HW Address: 26:bb:b5:40:75:92 MTU: 9198 Address: 10.250.1.2 <------------------------------------------------------------------------------------ [root@ProdigyX ~]# firewall-cmd --get-active-zones gluster interfaces: bond1 [root@ProdigyX ~]# firewall-cmd --permanent --info-zone=gluster gluster (active) target: default icmp-block-inversion: no interfaces: bond1 <----------------------------------------------------------------------------------------- interface bond1 sources: services: glusterfs <--------------------------------------------------------------------------------------- whole service opened ports: protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules: [root@ProdigyX ~]# cat /usr/lib/firewalld/services/glusterfs.xml <?xml version="1.0" encoding="utf-8"?> <service> <short>glusterfs-static</short> <description>Default ports for gluster-distributed storage</description> <port protocol="tcp" port="24007"/> <!--For glusterd --> <port protocol="tcp" port="24008"/> <!--For glusterd RDMA port management --> <port protocol="tcp" port="24009"/> <!--For glustereventsd --> <port protocol="tcp" port="38465"/> <!--Gluster NFS service --> <port protocol="tcp" port="38466"/> <!--Gluster NFS service --> <port protocol="tcp" port="38467"/> <!--Gluster NFS service --> <port protocol="tcp" port="38468"/> <!--Gluster NFS service --> <port protocol="tcp" port="38469"/> <!--Gluster NFS service --> <port protocol="tcp" port="49152-49664"/> <!--512 ports for bricks --> <-------------------------------------- 49152 opened </service> [root@ProdigyX ~]# cat /var/log/messages | grep "10.250.1.1" | tail -10 <-------------------------------------- its seeking 49151 instead of 49152 Jan 30 13:42:11 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 Jan 30 13:43:16 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 Jan 30 13:43:17 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 Jan 30 13:43:18 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 Jan 30 13:43:19 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 Jan 30 13:43:22 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 Jan 30 13:43:23 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 Jan 30 13:43:28 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 Jan 30 13:43:31 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 Jan 30 13:43:40 ProdigyX kernel: STATE_INVALID_DROP: IN=bond1 OUT= MAC=26:bb:b5:40:75:92:68:05:ca:69:9e:fc:08:00 SRC=10.250.1.1 DST=10.250.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=24007 DPT=49151 WINDOW=27438 RES=0x00 ACK SYN URGP=0 [root@ProdigyX ~]# lsof -P | grep ':49151' glusterd 1164 root 13u IPv4 38496 0t0 TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT) glusterd 1164 root 16u IPv4 19066 0t0 TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED) glusterti 1164 1165 root 13u IPv4 38496 0t0 TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT) glusterti 1164 1165 root 16u IPv4 19066 0t0 TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED) glustersi 1164 1166 root 13u IPv4 38496 0t0 TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT) glustersi 1164 1166 root 16u IPv4 19066 0t0 TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED) glusterme 1164 1167 root 13u IPv4 38496 0t0 TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT) glusterme 1164 1167 root 16u IPv4 19066 0t0 TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED) glustersp 1164 1168 root 13u IPv4 38496 0t0 TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT) glustersp 1164 1168 root 16u IPv4 19066 0t0 TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED) glustersp 1164 1169 root 13u IPv4 38496 0t0 TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT) glustersp 1164 1169 root 16u IPv4 19066 0t0 TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED) glustergd 1164 1171 root 13u IPv4 38496 0t0 TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT) glustergd 1164 1171 root 16u IPv4 19066 0t0 TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED) glusterep 1164 1172 root 13u IPv4 38496 0t0 TCP ProdigyX:49151->10.250.1.1:24007 (SYN_SENT) glusterep 1164 1172 root 16u IPv4 19066 0t0 TCP ProdigyX:24007->10.250.1.1:49151 (ESTABLISHED) [root@ProdigyX ~]# ss -tpl | grep 49151 [root@ProdigyX ~]# ss -tpl | grep 49152 LISTEN 0 10 *:49152 *:* users:(("glusterfsd",pid=1193,fd=11)) [root@ProdigyX ~]# gluster pool list UUID Hostname State xxx 10.250.1.1 Disconnected <-------------------------------------------------------------------------- other node yyy localhost Connected [root@ProdigyX ~]# sleep 60 [root@ProdigyX ~]# systemctl stop firewalld [root@ProdigyX ~]# sleep 60 [root@ProdigyX ~]# gluster pool list UUID Hostname State xxx 10.250.1.1 Connected <----------------------------------------------------------------------------- sometimes it works until here, sometimes need to also restart glusterd to make it Connected yyy localhost Connected