Bug 1566161

Summary: Brick reporting limited
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ju Lim <julim>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED NOTABUG QA Contact: Bala Konda Reddy M <bmekala>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: rhs-bugs, storage-qa-internal, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-12 14:04:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ju Lim 2018-04-11 16:06:36 UTC
Description of problem:
Currently, there are currently only 2 states, i.e. it's running or it's not. There's no way to distinguish stopped vs. down vs. hung.

Web Administration is reporting a brick that is not running as Stopped or Down.
gstatus says DOWN or Down/Unavailable.

There's inconsistencies in how we report this, but more importantly the user has no idea today whether a brick is not running is stopped, or is hung, or in some other state.  

From a user perspective, if it's stopped, it usually means that a user had to stop it and may not require intervention.  However, if it's down, or hung, or some other reason, likely there was some unplanned incident that occurred that warrants user's attention to resolve the incident.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
See also https://github.com/Tendrl/ui/issues/920.

Comment 2 Atin Mukherjee 2018-04-12 14:04:49 UTC
(In reply to Ju Lim from comment #0)
> Description of problem:
> Currently, there are currently only 2 states, i.e. it's running or it's not.
> There's no way to distinguish stopped vs. down vs. hung.
> 
> Web Administration is reporting a brick that is not running as Stopped or
> Down.
> gstatus says DOWN or Down/Unavailable.
> 
> There's inconsistencies in how we report this, but more importantly the user
> has no idea today whether a brick is not running is stopped, or is hung, or
> in some other state.

Let me first try to start with the output of gluster volume status and then explain it:

root@d75059e3585a:/mnt/test-vol-mnt1# gluster v status
Status of volume: test-vol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 172.17.0.2:/tmp/b1                    49153     0          Y       22078
Brick 172.17.0.3:/tmp/b1                    49152     0          Y       13512
Brick 172.17.0.4:/tmp/b1                    49152     0          Y       31274
Self-heal Daemon on localhost               N/A       N/A        Y       22068
Self-heal Daemon on 172.17.0.3              N/A       N/A        Y       13502
Self-heal Daemon on 172.17.0.4              N/A       N/A        Y       31264

From glusterd perspective, the brick's status is identified by either STARTED or STOPPED state internally which in turn is interpreted in the CLI as online status = Y when brick is in started state and N/A when the brick is in stopped state. gluster get-state dump captures all these detail in the form of:

Volume<X>.Brick<Y>.status: Started/Stopped

And then the gluster-integration at tendrl consumes these attributes and present them back at the UI. 

So a stopped brick is exactly same as a brick is not running in Gluster's design, so if we are to really differentiate between these two states, we'd need to check both the volume status as well as the brick status, if both volume status is started and brick is N/A that means brick is down where as if volume status is stopped and brick is N/A that means brick was manually stopped through volume status and this kind of aggregation has to happen at the higher layer, not at glusterd. Please note, from gluster administration perspective the entity where an admin can perform operation is the volume in this particular case. Until and unless there's specific recovery method, there's no specific way to stop a particular brick of a volume until and unless the brick is killed which to glusterd's perspective is exactly same as a volume being stopped which in turn stops the bricks or there's a RPC disconnect between glusterd and brick which again marks the brick to be stopped. Also if a brick process is stuck/hung, glusterd has no way to understand that until and unless brick process explicitly sends an event as there's no ping-pong mechanism between glusterd and brick process.

With the above explanation, I'm closing this bug as there's nothing what we can do at glusterd layer.

> 
> From a user perspective, if it's stopped, it usually means that a user had
> to stop it and may not require intervention.  However, if it's down, or
> hung, or some other reason, likely there was some unplanned incident that
> occurred that warrants user's attention to resolve the incident.
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 
> 
> Steps to Reproduce:
> 1.
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> See also https://github.com/Tendrl/ui/issues/920.