Bug 1758923

Summary: [GSS] gluster volume heal info showing "Transport endpoint not connected"
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Cal Calhoun <ccalhoun>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhgs-3.4CC: akrishna, amukherj, ksubrahm, moagrawa, mpandey, nchilaka, nravinas, olim, pkarampu, pprakash, ravishankar, rcarrier, rhs-bugs, saraut, sheggodu, skandark, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.5.z Batch Update 1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-6.0-23 Doc Type: Bug Fix
Doc Text:
The 'gluster volume heal <volname> info' always does a DNS name resolution. When DNS address resolution takes 30 seconds to fail, the command used to fail as it crosses the time out value set for gluster AFR volumes. With this fix, "gluster volume heal <volname> info" doesn't resolve the erroneous address and it works correctly even when DNS-server configuration has issues.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-30 06:42:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1765017, 1793085, 1793096    
Bug Blocks:    

Description Cal Calhoun 2019-10-06 20:49:51 UTC
Description of problem:

  When running gluster volume heal <vol> info, several bricks are showing "Transport endpoint not connected"

Version-Release number of selected component (if applicable):

    RHEL: 7.2
    RHGS: glusterfs-3.12.2-47.el7rhgs.x86_64
          glusterfs-api-3.12.2-47.el7rhgs.x86_64
          glusterfs-cli-3.12.2-47.el7rhgs.x86_64
          glusterfs-client-xlators-3.12.2-47.el7rhgs.x86_64
          glusterfs-fuse-3.12.2-47.el7rhgs.x86_64
          glusterfs-geo-replication-3.12.2-47.el7rhgs.x86_64
          glusterfs-libs-3.12.2-47.el7rhgs.x86_64
          glusterfs-rdma-3.12.2-47.el7rhgs.x86_64
          glusterfs-server-3.12.2-47.el7rhgs.x86_64
  kernel: kernel-3.10.0-327.18.2.el7.x86_64

How reproducible:

  Ongoing

Steps to Reproduce:

  Run gluster volume heal <vol> info from nodes 7 or 8 and all bricks from node 4 show as "Transport endpoint not connected"

Additional info:

  Client VM's running various applications are having trouble connecting to gluster volumes.  This is what originally presented as the problem.  After sequentially restarting the gluster nodes and checking for healing, the transport endpoint messages were noticed.

  During troubleshooting we performed the following:

  1. Initially noted that there were several bricks down.  Force restarted volumes and most bricks came back online.  Afterwards, for the most part, gluster volume status shows all bricks and self-heal daemons online.  There are a couple of outliers but most volumes appeared fine.

  2. We then tried stopping gluster services with systemctl stop glusterd; pkill glusterfs; pkill glusterfsd followed by systemctl start glusterd sequentially on each node.  Again, gluster volume status showed only a couple of bricks offline but the transport messages continue on nodes 7 and 8 for all bricks on node 4.

  3. We then tried stopping glusterd on all nodes then starting back up sequentially.  No improvement in the transport messages.

  4. We noticed that op.version was set to 30712 (RHGS 3.1 update 3).  Had them set op.version to 31305.

  5. Requested the customer to check with their end-users to see if the applications were responding but at the time this BZ is being opened, we do not have a response yet.

  6. New gluster node sosreports (post changes) and at least one client sosreport from a client having difficulties have been requested.  Original sosreports are on collab-shell.  We will continue adding information to this BZ as it becomes available.

At this point, we are not sure if the transport messages and the clients having trouble with gluster volumes are related.

Comment 45 Pranith Kumar K 2019-11-06 08:20:28 UTC
https://review.gluster.org/c/glusterfs/+/23606

Comment 54 Anjana KD 2020-01-29 17:32:13 UTC
 kindly verify the updated doc text in the doc text field.

Comment 57 errata-xmlrpc 2020-01-30 06:42:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0288