Bug 1722209

Summary: [GSS] Issues accessing gluster mount / df -h hanging
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Cal Calhoun <ccalhoun>
Component: glusterfsAssignee: Xavi Hernandez <jahernan>
Status: CLOSED ERRATA QA Contact: Bala Konda Reddy M <bmekala>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhgs-3.4CC: amanzane, amukherj, atoborek, bkunal, jahernan, mnapolis, moagrawa, nbalacha, nravinas, olim, pasik, rhs-bugs, sarora, sheggodu, skandark, spalai, srakonde, vbellur, vdas, vpandey
Target Milestone: ---Flags: sankarshan: needinfo-
sankarshan: needinfo-
sankarshan: needinfo-
Target Release: RHGS 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-6.0-8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-30 12:22:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1727068    
Bug Blocks: 1696809    

Description Cal Calhoun 2019-06-19 17:09:39 UTC
Description of problem:
  Customer is seeing a gluster mount point becoming either unmounted or unresponsive.

  Initially it was becoming unmounted and, when they saw a glusterfsd crash, they made a couple of changes recommended by us in previous cases with similar issues:

  1) gluster volume set all cluster.op-version 31303
  2) gluster volume set DC03SP2 performance.parallel-readdir off

  The next issue occurred when running df -h hangs.  They are able to  individually run df -h for all other filesystems except the gluster mount.

  One of the nodes became inaccessible and required a required a reboot
  - putty and irmc console just gave blank screen after login

  They configured kdump for the gluster nodes and generated nmi kdump from iRMC to provdie information for this case.

  Note that these gluster nodes mount the volume locally and are used as the target for the customer's application.

Version-Release number of selected component (if applicable):

  OS: 
    RHEL 7.6

  Kernal:
    kernel-3.10.0-957.1.3.el7.x86_64 

  Gluster:
    glusterfs-3.12.2-25.el7rhgs.x86_64
    glusterfs-api-3.12.2-25.el7rhgs.x86_64
    glusterfs-client-xlators-3.12.2-25.el7rhgs.x86_64
    glusterfs-cli-3.12.2-25.el7rhgs.x86_64
    glusterfs-events-3.12.2-25.el7rhgs.x86_64
    glusterfs-fuse-3.12.2-25.el7rhgs.x86_64
    glusterfs-geo-replication-3.12.2-25.el7rhgs.x86_64
    glusterfs-libs-3.12.2-25.el7rhgs.x86_64
    glusterfs-rdma-3.12.2-25.el7rhgs.x86_64
    glusterfs-server-3.12.2-25.el7rhgs.x86_64
    gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
    gluster-nagios-common-0.2.4-1.el7rhgs.noarch 

How reproducible:

  Happens roughly once every 24 hours on one or another node but can't be reproduced on demand.

Steps to Reproduce:

  N/A

Actual results:

  Mount point becomes unresponsive

Expected results:

  Mount point should remain responsive

Additional info:

  Supporting info extracted on collab shell.

  First two nmi cores are on optimus and the last one is being processed now.

Comment 13 Amar Tumballi 2019-07-01 11:20:06 UTC
Sameer,

> > [2019-06-12 17:47:10.135284] E [socket.c:2369:socket_connect_finish] 0-DC03SP2-client-68: connection to 192.168.88.209:24007 failed (No route to host); disconnecting socket
> > [2019-06-12 17:47:10.135298] E [socket.c:2369:socket_connect_finish] 0-DC03SP2-client-71: connection to 192.168.88.209:24007 failed (No route to host); disconnecting socket

'No route to host' error generally indicate n/w issues. Considering this hang is happening from one of the clients, please check if all the gluster node's DNS is being properly resolved. If there is an issue in DNS resolution, then surely gluster process would be in hung state.

Comment 47 Xavi Hernandez 2019-10-24 07:23:43 UTC
*** Bug 1751666 has been marked as a duplicate of this bug. ***

Comment 49 errata-xmlrpc 2019-10-30 12:22:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3249