Bug 1254514

Summary: gstatus: Status message doesn;t show the storage node name which is down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Anil Shah <ashah>
Component: gstatusAssignee: Sachidananda Urs <surs>
Status: CLOSED ERRATA QA Contact: Anil Shah <ashah>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: asrivast, byarlaga, surs, vagarwal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.1.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: gstatus-0.65-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-05 07:23:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1251815    

Description Anil Shah 2015-08-18 10:04:44 UTC
Description of problem:

when one of the storage node of cluster is down, running gstatus command doesn't name of the node which is down in Status message.

Version-Release number of selected component (if applicable):

[root@localhost ~]# gstatus --version
gstatus 0.64

[root@localhost ~]# rpm -qa | grep glusterfs
glusterfs-api-3.7.1-11.el7rhgs.x86_64
glusterfs-cli-3.7.1-11.el7rhgs.x86_64
glusterfs-libs-3.7.1-11.el7rhgs.x86_64
glusterfs-client-xlators-3.7.1-11.el7rhgs.x86_64
glusterfs-server-3.7.1-11.el7rhgs.x86_64
glusterfs-rdma-3.7.1-11.el7rhgs.x86_64
glusterfs-3.7.1-11.el7rhgs.x86_64
glusterfs-fuse-3.7.1-11.el7rhgs.x86_64
glusterfs-geo-replication-3.7.1-11.el7rhgs.x86_64


How reproducible:

100%

Steps to Reproduce:

1. Create 6X2 distribute replicate volume
2. Mount volume as FUSE mount on client 
3. bring down one of the storage node. check gstatus. e.g gstatus -a

Actual results:
status message doesn't show the name of the storage node which is down.

[root@knightandday ~]# gstatus -a
 
     Product: RHGS vserver3.1    Capacity: 119.00 GiB(raw bricks)
      Status: UNHEALTHY(13)                198.00 MiB(raw used)
   Glusterfs: 3.7.1                         50.00 GiB(usable from volumes)
  OverCommit: Yes               Snapshots:   1

   Nodes       :  2/  4		  Volumes:   0 Up
   Self Heal   :  2/  4		             0 Up(Degraded)
   Bricks      :  6/ 12		             1 Up(Partial)
   Connections :  0/   0                     0 Down

Volume Information
	testvol          UP(PARTIAL) - 6/12 bricks up - Distributed-Replicate
	                 Capacity: (0% used) 99.00 MiB/50.00 GiB (used/total)
	                 Snapshots: 1
	                 Self Heal:  6/12
	                 Tasks Active: None
	                 Protocols: glusterfs:on  NFS:on  SMB:on
	                 Gluster Connectivty: 0 hosts, 0 tcp connections


Status Messages
  - Cluster is UNHEALTHY
  - Volume 'testvol' is in a PARTIAL state, some data is inaccessible data, due to missing bricks
  - WARNING -> Write requests may fail against volume 'testvol'
  - Cluster node '' is down
  - Self heal daemon is down on 
  - Cluster node '' is down
  - Self heal daemon is down on 
  - Brick 10.70.47.3:/rhs/brick3/b12 in volume 'testvol' is down/unavailable
  - Brick 10.70.47.2:/rhs/brick3/b11 in volume 'testvol' is down/unavailable
  - Brick 10.70.47.3:/rhs/brick2/b8 in volume 'testvol' is down/unavailable
  - Brick 10.70.47.2:/rhs/brick2/b7 in volume 'testvol' is down/unavailable
  - Brick 10.70.47.2:/rhs/brick1/b3 in volume 'testvol' is down/unavailable
  - Brick 10.70.47.3:/rhs/brick1/b4 in volume 'testvol' is down/unavailable
  - INFO -> Not all bricks are online, so capacity provided is NOT accurate



Expected results:

Status message should display the name of the storage node which is down

Additional info:

Comment 3 Sachidananda Urs 2015-08-27 07:26:20 UTC
After discussions with Anil it was decided that we remove the self heal information, and include the number of nodes that are down/up.

Sample output 1:

Status Messages
  - Cluster is UNHEALTHY
  - One of the nodes in the cluster is down
  - Brick 10.70.47.129:/gluster/brick1 in volume 'glustervol' is down/unavailable
  - INFO -> Not all bricks are online, so capacity provided is NOT accurate


Sample output 2:

Status Messages
  - Cluster is UNHEALTHY
  - Volume 'glustervol' is in a PARTIAL state, some data is inaccessible data, due to missing bricks
  - WARNING -> Write requests may fail against volume 'glustervol'
  - 2 nodes in the cluster are down
  - Brick 10.70.46.185:/gluster/brick1 in volume 'glustervol' is down/unavailable
  - Brick 10.70.47.129:/gluster/brick1 in volume 'glustervol' is down/unavailable
  - INFO -> Not all bricks are online, so capacity provided is NOT accurate

Comment 4 Anil Shah 2015-09-02 12:32:17 UTC
[root@rhs-client46 yum.repos.d]# gstatus -a
 
     Product: RHGS Server v3.1   Capacity:   2.70 TiB(raw bricks)
      Status: UNHEALTHY(4)                  67.00 MiB(raw used)
   Glusterfs: 3.7.1                          2.70 TiB(usable from volumes)
  OverCommit: No                Snapshots:   0

   Nodes       :  2/  4		  Volumes:   0 Up
   Self Heal   :  2/  4		             1 Up(Degraded)
   Bricks      :  2/  4		             0 Up(Partial)
   Connections :  4/  16                     0 Down

Volume Information
	vol0             UP(DEGRADED) - 2/4 bricks up - Distributed-Replicate
	                 Capacity: (0% used) 67.00 MiB/2.70 TiB (used/total)
	                 Snapshots: 0
	                 Self Heal:  2/ 4
	                 Tasks Active: None
	                 Protocols: glusterfs:on  NFS:on  SMB:on
	                 Gluster Connectivty: 4 hosts, 16 tcp connections


Status Messages
  - Cluster is UNHEALTHY
  - 2 nodes in the cluster are down
  - Brick 10.70.36.71:/rhs/brick1/b02 in volume 'vol0' is down/unavailable
  - Brick 10.70.36.46:/rhs/brick1/b03 in volume 'vol0' is down/unavailable
  - INFO -> Not all bricks are online, so capacity provided is NOT accurate


Bug verified on build glusterfs-3.7.1-14.el7rhgs.x86_64

[root@rhs-client46 yum.repos.d]# gstatus --version
gstatus 0.65

Comment 6 errata-xmlrpc 2015-10-05 07:23:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1845.html