The 'gluster volume heal <volname> info' always does a DNS name resolution. When DNS address resolution takes 30 seconds to fail, the command used to fail as it crosses the time out value set for gluster AFR volumes.
With this fix, "gluster volume heal <volname> info" doesn't resolve the erroneous address and it works correctly even when DNS-server configuration has issues.
Description of problem:
When running gluster volume heal <vol> info, several bricks are showing "Transport endpoint not connected"
Version-Release number of selected component (if applicable):
RHEL: 7.2
RHGS: glusterfs-3.12.2-47.el7rhgs.x86_64
glusterfs-api-3.12.2-47.el7rhgs.x86_64
glusterfs-cli-3.12.2-47.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-47.el7rhgs.x86_64
glusterfs-fuse-3.12.2-47.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-47.el7rhgs.x86_64
glusterfs-libs-3.12.2-47.el7rhgs.x86_64
glusterfs-rdma-3.12.2-47.el7rhgs.x86_64
glusterfs-server-3.12.2-47.el7rhgs.x86_64
kernel: kernel-3.10.0-327.18.2.el7.x86_64
How reproducible:
Ongoing
Steps to Reproduce:
Run gluster volume heal <vol> info from nodes 7 or 8 and all bricks from node 4 show as "Transport endpoint not connected"
Additional info:
Client VM's running various applications are having trouble connecting to gluster volumes. This is what originally presented as the problem. After sequentially restarting the gluster nodes and checking for healing, the transport endpoint messages were noticed.
During troubleshooting we performed the following:
1. Initially noted that there were several bricks down. Force restarted volumes and most bricks came back online. Afterwards, for the most part, gluster volume status shows all bricks and self-heal daemons online. There are a couple of outliers but most volumes appeared fine.
2. We then tried stopping gluster services with systemctl stop glusterd; pkill glusterfs; pkill glusterfsd followed by systemctl start glusterd sequentially on each node. Again, gluster volume status showed only a couple of bricks offline but the transport messages continue on nodes 7 and 8 for all bricks on node 4.
3. We then tried stopping glusterd on all nodes then starting back up sequentially. No improvement in the transport messages.
4. We noticed that op.version was set to 30712 (RHGS 3.1 update 3). Had them set op.version to 31305.
5. Requested the customer to check with their end-users to see if the applications were responding but at the time this BZ is being opened, we do not have a response yet.
6. New gluster node sosreports (post changes) and at least one client sosreport from a client having difficulties have been requested. Original sosreports are on collab-shell. We will continue adding information to this BZ as it becomes available.
At this point, we are not sure if the transport messages and the clients having trouble with gluster volumes are related.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2020:0288
Description of problem: When running gluster volume heal <vol> info, several bricks are showing "Transport endpoint not connected" Version-Release number of selected component (if applicable): RHEL: 7.2 RHGS: glusterfs-3.12.2-47.el7rhgs.x86_64 glusterfs-api-3.12.2-47.el7rhgs.x86_64 glusterfs-cli-3.12.2-47.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-47.el7rhgs.x86_64 glusterfs-fuse-3.12.2-47.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-47.el7rhgs.x86_64 glusterfs-libs-3.12.2-47.el7rhgs.x86_64 glusterfs-rdma-3.12.2-47.el7rhgs.x86_64 glusterfs-server-3.12.2-47.el7rhgs.x86_64 kernel: kernel-3.10.0-327.18.2.el7.x86_64 How reproducible: Ongoing Steps to Reproduce: Run gluster volume heal <vol> info from nodes 7 or 8 and all bricks from node 4 show as "Transport endpoint not connected" Additional info: Client VM's running various applications are having trouble connecting to gluster volumes. This is what originally presented as the problem. After sequentially restarting the gluster nodes and checking for healing, the transport endpoint messages were noticed. During troubleshooting we performed the following: 1. Initially noted that there were several bricks down. Force restarted volumes and most bricks came back online. Afterwards, for the most part, gluster volume status shows all bricks and self-heal daemons online. There are a couple of outliers but most volumes appeared fine. 2. We then tried stopping gluster services with systemctl stop glusterd; pkill glusterfs; pkill glusterfsd followed by systemctl start glusterd sequentially on each node. Again, gluster volume status showed only a couple of bricks offline but the transport messages continue on nodes 7 and 8 for all bricks on node 4. 3. We then tried stopping glusterd on all nodes then starting back up sequentially. No improvement in the transport messages. 4. We noticed that op.version was set to 30712 (RHGS 3.1 update 3). Had them set op.version to 31305. 5. Requested the customer to check with their end-users to see if the applications were responding but at the time this BZ is being opened, we do not have a response yet. 6. New gluster node sosreports (post changes) and at least one client sosreport from a client having difficulties have been requested. Original sosreports are on collab-shell. We will continue adding information to this BZ as it becomes available. At this point, we are not sure if the transport messages and the clients having trouble with gluster volumes are related.