843817 – Volume status shows up, but RHS nodes are all offline

Bug 843817 - Volume status shows up, but RHS nodes are all offline

Summary: Volume status shows up, but RHS nodes are all offline

Keywords:
Status:	CLOSED DUPLICATE of bug 844333
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhsc
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Shireesh
QA Contact:	Sudhir D
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-07-27 12:42 UTC by Paul Cuzner
Modified:	2013-07-28 23:18 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-11-16 08:52:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Volume status shows up (112.06 KB, image/png) 2012-07-27 12:42 UTC, Paul Cuzner	no flags	Details
Node status when the volume is up (125.34 KB, image/png) 2012-07-27 12:43 UTC, Paul Cuzner	no flags	Details
View All

Description Paul Cuzner 2012-07-27 12:42:49 UTC

Created attachment 600766 [details]
Volume status shows up

Description of problem:
The volume status in the console is shown as up - even when the nodes that provide the volume are not online.

Version-Release number of selected component (if applicable):
2.0

How reproducible:


Steps to Reproduce:
1. Create a cluster, nodes and a volume.
2. Shutdown the nodes and rhs-c
3. restart rhs-c and login
4. view the node state, then view the volume state
  
Actual results:
volume shows up, but nodes are actually down - obviously not correct!

Expected results:
volume state should always be accurate - if the nodes and bricks are offline - the volume should be shown offline.

Additional info:
Attaching screenshots that illustrate the discrepency

Comment 1 Paul Cuzner 2012-07-27 12:43:28 UTC

Created attachment 600767 [details]
Node status when the volume is up

Comment 3 Shireesh 2012-09-25 14:29:20 UTC

The rhs-c engine not being able to talk to any of the nodes (node status 'Non Responsive') doesn't necessarily mean that they are not up - it could be a network issue. Hence we can't simply show the volume status as down.

So I think it's ok to show the last known status for volumes.

Comment 4 Paul Cuzner 2012-10-02 10:19:33 UTC

(In reply to comment #3)
> The rhs-c engine not being able to talk to any of the nodes (node status
> 'Non Responsive') doesn't necessarily mean that they are not up - it could
> be a network issue. Hence we can't simply show the volume status as down.
> 
> So I think it's ok to show the last known status for volumes.

I accept that the nodes showing as inoperable does not automatically mean the volume is unavailable - as long as vdsm and data access are over different subnets or NICs. I assume that programmatically you're not checking this?

So perhaps this discrepancy between node and volume state should result in a "unknown" state against the volume (yellow symbol). At least this unknown state would highlight a potential issue to our customers Operations teams, and may tie in to future plans for alert/monitoring.

Ignoring the status of the servers because of an unknown network topology is not the right answer. If you intend the GUI to show the last known status of the volume - you need to add a timestamp to provide a reference point for the state.

Comment 5 Shireesh 2012-10-03 13:08:51 UTC

> So perhaps this discrepancy between node and volume state should result in a
> "unknown" state against the volume (yellow symbol). At least this unknown
> state would highlight a potential issue to our customers Operations teams,
> and may tie in to future plans for alert/monitoring.

In such cases, the status of all the hosts in the cluster will anyway be 'Down' or 'Non-responsive'. Is that not enough indication for the admin that something needs to be done?

We could look at generating an alert in such cases. Also, how do you feel about showing the yellow/red symbol on the cluster itself instead of all volumes, if none of it's servers are accessible? (This is covered by one of the requirements in https://bugzilla.redhat.com/show_bug.cgi?id=844333)

> 
> Ignoring the status of the servers because of an unknown network topology is
> not the right answer. If you intend the GUI to show the last known status of
> the volume - you need to add a timestamp to provide a reference point for
> the state.

Comment 6 Paul Cuzner 2012-10-09 10:25:45 UTC

(In reply to comment #5)
> > So perhaps this discrepancy between node and volume state should result in a
> > "unknown" state against the volume (yellow symbol). At least this unknown
> > state would highlight a potential issue to our customers Operations teams,
> > and may tie in to future plans for alert/monitoring.
> 
> In such cases, the status of all the hosts in the cluster will anyway be
> 'Down' or 'Non-responsive'. Is that not enough indication for the admin that
> something needs to be done?
> 
> We could look at generating an alert in such cases. Also, how do you feel
> about showing the yellow/red symbol on the cluster itself instead of all
> volumes, if none of it's servers are accessible? (This is covered by one of
> the requirements in https://bugzilla.redhat.com/show_bug.cgi?id=844333)
> 
> > 
> > Ignoring the status of the servers because of an unknown network topology is
> > not the right answer. If you intend the GUI to show the last known status of
> > the volume - you need to add a timestamp to provide a reference point for
> > the state.

If all the nodes are down, but the volume shows up - that's an inconsist view from an administrators standpoint, and will just add confusion. If we can be consistent - then we should be!

I believe that component errors should propagate to their "parents" ; brick -> node -> volume -> cluster for example. This allows for drill down, and also allows events to be weighted - i.e. with a replicated volume, if a node is offline, the node would show red, but the cluster and volume would be yellow.

Comment 7 Shireesh 2012-10-09 11:27:54 UTC

> If all the nodes are down, but the volume shows up - that's an inconsist
> view from an administrators standpoint, and will just add confusion. If we
> can be consistent - then we should be!

First of all, I think this is not a very common scenario (all nodes down), hence not sure if it's worth the effort.

Second, we can show 'status' as 'unknown', but then what about all other attributes of the volume? If all nodes are not accessible because of some reason, and the bricks have changed (added/removed/replaced), that will not be reflected in the UI. Same with other properties like options. So in spite of showing status as 'unknown', the inconsistencies will still be there.

One option could be to completely hide the 'volumes' tab itself if all nodes are down/non-responsive. Admin will only see the red symbol on the cluster, and also all the nodes being offline, and will understand why the volumes are not being displayed. What do you think?

> 
> I believe that component errors should propagate to their "parents" ; brick
> -> node -> volume -> cluster for example. This allows for drill down, and
> also allows events to be weighted - i.e. with a replicated volume, if a node
> is offline, the node would show red, but the cluster and volume would be
> yellow.

Comment 8 Paul Cuzner 2012-10-09 13:36:30 UTC

(In reply to comment #7)
> > If all the nodes are down, but the volume shows up - that's an inconsist
> > view from an administrators standpoint, and will just add confusion. If we
> > can be consistent - then we should be!
> 
> First of all, I think this is not a very common scenario (all nodes down),
> hence not sure if it's worth the effort.
> 
> Second, we can show 'status' as 'unknown', but then what about all other
> attributes of the volume? If all nodes are not accessible because of some
> reason, and the bricks have changed (added/removed/replaced), that will not
> be reflected in the UI. Same with other properties like options. So in spite
> of showing status as 'unknown', the inconsistencies will still be there.
> 
> One option could be to completely hide the 'volumes' tab itself if all nodes
> are down/non-responsive. Admin will only see the red symbol on the cluster,
> and also all the nodes being offline, and will understand why the volumes
> are not being displayed. What do you think?
> 
> > 
> > I believe that component errors should propagate to their "parents" ; brick
> > -> node -> volume -> cluster for example. This allows for drill down, and
> > also allows events to be weighted - i.e. with a replicated volume, if a node
> > is offline, the node would show red, but the cluster and volume would be
> > yellow.

Agree - the specific error is not common, unlikely to occur in the real world etc etc.

However, it does highlight;
a) that the status of objects on RHS-C are not completely linked
b) alert propagation is either not implemented or has missed the volume/node relationship.

I'm not saying we need to focus on this specific event and code around it - but instead we need to ensure that error events are managed and related to their parent/child objects so the administrator gets an accurate view of the current state.

Is this method of managing errors in plan? If not does it sound like a reasonable approach to adopt?

PC

Comment 9 Shireesh 2012-10-09 15:13:47 UTC

> I'm not saying we need to focus on this specific event and code around it -
> but instead we need to ensure that error events are managed and related to
> their parent/child objects so the administrator gets an accurate view of the
> current state.

If so, can we treat this as duplicate of the other bug you have created? (https://bugzilla.redhat.com/show_bug.cgi?id=844333)

> 
> Is this method of managing errors in plan? If not does it sound like a
> reasonable approach to adopt?

It definitely is a very good feature to have. Requires framework level enhancements in oVirt. We want to introduce it, but no concrete plans / timelines yet. We'll track it against 844333.

Comment 10 Shireesh 2012-11-16 08:52:38 UTC


*** This bug has been marked as a duplicate of bug 844333 ***

Note You need to log in before you can comment on or make changes to this bug.