1356216 – cluster alert summary on main dashboard provides invalid data - design issues

Bug 1356216 - cluster alert summary on main dashboard provides invalid data - design issues

Summary: cluster alert summary on main dashboard provides invalid data - design issues

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3
Assignee:	sankarshan
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-07-13 17:48 UTC by Martin Bukatovic
Modified:	2018-11-19 05:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-19 05:40:12 UTC
Embargoed:

Attachments	(Terms of Use)
screenshot 1: case one - main dashboard (79.16 KB, image/png) 2016-07-13 17:51 UTC, Martin Bukatovic	no flags	Details
screenshot 2: case one - cluster list (19.45 KB, image/png) 2016-07-13 17:52 UTC, Martin Bukatovic	no flags	Details
screenshot 3: case one - filtered event list (50.45 KB, image/png) 2016-07-13 17:54 UTC, Martin Bukatovic	no flags	Details
screenshot 4: case two - main dashboard (76.35 KB, image/png) 2016-07-13 17:56 UTC, Martin Bukatovic	no flags	Details
screenshot 5: case two - filtered event list (50.13 KB, image/png) 2016-07-13 17:57 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1359103	0	unspecified	CLOSED	Clusters widget on main dashboard doesn't report cluster in a warning state	2021-02-22 00:41:40 UTC

Internal Links: 1359103

Description Martin Bukatovic 2016-07-13 17:48:10 UTC

Description of problem
======================

When there are some warning/critical events, cluster alert counter box on the
main dashboard may provide incorrect values, which conflicts alert counters on
cluster list page and on event list.

I'm not 100 % sure how to reproduce this issue. But since the dashboard related
features are considered a hight priority now, I'm providing my current evidence
even though it's without a proper reproducer. This way, QE team will know that
it's necessary to retest this with extra care, trying to find a reproducer
with a future dev freeze builds.

Version-Release
===============

On RHSC 2.0 server machine:

rhscon-core-selinux-0.0.28-1.el7scon.noarch
rhscon-core-0.0.28-1.el7scon.x86_64
rhscon-ui-0.0.42-1.el7scon.noarch
rhscon-ceph-0.0.27-1.el7scon.x86_64
ceph-installer-1.0.12-3.el7scon.noarch
ceph-ansible-1.0.5-23.el7scon.noarch

How reproducible
================

I don't know.

Steps to Reproduce
==================

I'm not 100% sure.

1. Install RHSC 2.0 following the documentation.
2. Accept few nodes for the ceph cluster.
3. Create new ceph cluster named 'alpha'.
4. Create 2 RBDs (along with new backing pool each time) in the cluster.
5. Break something so that you will have few warning and/or critical events
   (this step needs to be specified better).
6. Check Clusters overview (a box titled "1 Clusters") on the Main Dashboard.

note for step 5:
I noticed this when I was reproducing BZ 1355723 and/or BZ 1354603

Actual results
==============

I hit this issue 2 times (on 2 different clusters, using the same builds).

case one
--------

On the Main Dashboard, there is a box entitled "1 Clusters", reporting:
"2 active alerts" next to the warning icon (pficon-warning-triangle-o),
see screenshot #1.

When I click on the link there (the value, number 2 itself), I get to the
Clusters list page (see screenshot #2), which is filtered by:

 * alarmstatus: critical
 * alarmstatus: major

In this list there is 1 cluster (in a warning state), but which reports that
there are 4 alerts next to red error icon (pficon-error-circle-o).

When I click on the link there (the value, number 4 itself), I get to the
Events list page (see screenshot #3), which is filtered by:

 * cluster: alpha
 * severity: critical & warning
 * status: active

In the list there, there are 4 events in total, 2 are warning, 2 are critical.

This means that Main Dashboard alert status conflicts with alert status from
Cluster list, and with Events list (which is filtered to show active, critical
and warning events only).

case two
--------

The same use case, but with different discrepancies.

Screenshot #4 shows Main Dashboard with 1 critical and 4 warning alters, but
in the Event list (screenshot #5), I see 4 critical events.

Expected results
================

Main dashboard cluster status data should be aligned with event/alert data
provided elsewhere in the console, such as:

 * cluster item in the list of clusters
 * Events list page linked from cluster item (previous one)

Comment 1 Martin Bukatovic 2016-07-13 17:51:39 UTC

Created attachment 1179348 [details]
screenshot 1: case one - main dashboard

Comment 2 Martin Bukatovic 2016-07-13 17:52:54 UTC

Created attachment 1179349 [details]
screenshot 2: case one - cluster list

Comment 3 Martin Bukatovic 2016-07-13 17:54:46 UTC

Created attachment 1179350 [details]
screenshot 3: case one - filtered event list

Comment 4 Martin Bukatovic 2016-07-13 17:56:05 UTC

Created attachment 1179351 [details]
screenshot 4: case two - main dashboard

Comment 5 Martin Bukatovic 2016-07-13 17:57:15 UTC

Created attachment 1179352 [details]
screenshot 5: case two - filtered event list

Comment 6 Nishanth Thomas 2016-07-14 12:50:12 UTC

Require clear reproducer for this bug

Comment 7 Darshan 2016-07-14 13:16:30 UTC

This behavior is as per the Design provided by: https://docs.google.com/presentation/d/1E7ZHHMYufugMjuVceluP7FCfUM9CQNsN5QWmGruRth0/edit#slide=id.gc5e5a5c3c_0_12

To elaborate the behavior is as follows:

1. In main dashboard, the number beside "X" (critical icon) indicates the number of cluster whose status is error. Please refer slide 15 & 16

2. In main dashboard, the number beside "!" (warning icon) indicates the number of major/critical (not warning and minor) alarms in all clusters. Please refer slide 15 & 16.

3. In cluster list view, the last column (alerts) shows the number of all the alerts in the cluster(here any severity other than cleared(info)). Please refer slide 21.

Please provide your thoughts.

Comment 8 Martin Bukatovic 2016-07-15 10:46:13 UTC

(In reply to Darshan from comment #7)
> This behavior is as per the Design provided by:
> https://docs.google.com/presentation/d/
> 1E7ZHHMYufugMjuVceluP7FCfUM9CQNsN5QWmGruRth0/edit#slide=id.gc5e5a5c3c_0_12

Good catch, thanks for pointing this out. I should have definitely checked
this out before creating this BZ - as it turns out that the status on the
dashboard is invalid anyway, but in a different way compared to my original
description in the BZ (which was not based on the design document description). 

Which means that QE team still needs to recheck with the latest builds later
anyway. See my quick reply with details inline.

> To elaborate the behavior is as follows:
> 
> 1. In main dashboard, the number beside "X" (critical icon) indicates the
> number of cluster whose status is error. Please refer slide 15 & 16

Ok, so the document states that there should be:

> count of clusters with Error or Warning status

not just Error as you just mentioned.

In the screenshot #4 (of case two), I see this icon with number 1 next to it,
which is correct according to your description and the design document. So ok.

But in the first case, on screenshot #1, I don't see this Cluster Status
anywhere, even though that the cluster is in a warning state (as can be
seen on screenshot #2).

So for this reason, I would consider "Cluster Status" counter still broken.

> 2. In main dashboard, the number beside "!" (warning icon) indicates the
> number of major/critical (not warning and minor) alarms in all clusters.
> Please refer slide 15 & 16.

You are right. Even the tooltip text (visible when one hovers cursor over
the alert counter) states "2 active alerts".

So far, so good.

But I have a question, why do I get to the list of clusters when I click
on this alarm status counter? The design document which you refer states that
I should get this:

> Filtered event view showing major and critical alerts across all clusters.

But I got list of clusters instead. Which is why I was confused about the
meaning of this counter and ignored the meaning in the tooltip.

So based on this, the issue here is that the link of the alarm status counter
points to an incorrect page, which confuses the meaning of the counter even
though the counter itself (and it's tooltip) report correct data.

> 3. In cluster list view, the last column (alerts) shows the number of all
> the alerts in the cluster(here any severity other than cleared(info)).
> Please refer slide 21.

I'm not sure I understand this slide right, as I see both type of icons in
the example here:

 * 1st cluster reports 5 alerts next to a warning icon
 * 4th cluster reports 5 alerts next to a error icon
 * 6th cluster reports 5 alerts without using any icon at all

Moreover I don't think that using an icon for both types on one page, and for
particular one only on another page is a good idea.

So it seems that we need to ask the desing team here to check my concerns here.

Comment 9 Martin Bukatovic 2016-07-15 10:48:17 UTC

Ju and Matt, could you check my concern about alert counter from cluster list
page as described in the last part of comment 8?

Comment 10 Ju Lim 2016-07-19 13:44:02 UTC

In review this bug,  I see 2 things raised in this bug:
(1) why does the drilldown from dashboard goes to the cluster list vs. an event list.
The implementation is as agreed upon (i.e. going to the cluster list) -- this was a decision we made based on some bug in the past, and as we put our “user” hat on, the rationale for why we did this was user would want to see the related alerts, but then would still have to look at issues by a cluster by cluster basis.  Hence, why we ended up with drilling down to the cluster list.

(2) the alert indicator/count of the cluster is misleading as it aggregates all the critical and errors together, which is misleading.  Part of the problem is the icon as we're overloading the icon to mean single severity level (in the Dashboard and other places), but when it's in the list view, it's showing an aggregation of multiple severity levels.  To fix: we either provide a new icon to cover the roll-up or aggregation, OR limit the alert indicator to show only the most severe severity level (but then it leaves of other severe levels potentially, which is not ideal).

Comment 11 Ju Lim 2016-07-19 15:23:52 UTC

Regarding (2) indicated above whereby the icon accompanying the # Alerts in the Cluster List (and Host List) pages, I'd suggest removing the icon so as to reduce user confusion since # Alerts represents all uncleared alerts for a given object.

Comment 12 Ju Lim 2016-07-19 15:48:57 UTC

This is also likely applicable to the other list views that show # Alerts, e.g. Pools List, RBD List.

Comment 13 Martin Bukatovic 2016-07-22 09:45:40 UTC

During execution of test case RHSC-265 (web/main_dashboard_page_check), I
noticed problems with some counters again. Since the original description of
this BZ was not written based on proper understanding of the design documents,
and the comments are discussing mostly design tweaks (comment 10, comment 11,
comment 12), I'm creating new BZ 1359103 for this to prevent confusion under this
BZ. This way, proper triage and work management of the issue would be
possible.

Comment 15 Shubhendu Tripathi 2018-11-19 05:40:12 UTC

This product is EOL now

Note You need to log in before you can comment on or make changes to this bug.