Description of problem ====================== When there are some warning/critical events, cluster alert counter box on the main dashboard may provide incorrect values, which conflicts alert counters on cluster list page and on event list. I'm not 100 % sure how to reproduce this issue. But since the dashboard related features are considered a hight priority now, I'm providing my current evidence even though it's without a proper reproducer. This way, QE team will know that it's necessary to retest this with extra care, trying to find a reproducer with a future dev freeze builds. Version-Release =============== On RHSC 2.0 server machine: rhscon-core-selinux-0.0.28-1.el7scon.noarch rhscon-core-0.0.28-1.el7scon.x86_64 rhscon-ui-0.0.42-1.el7scon.noarch rhscon-ceph-0.0.27-1.el7scon.x86_64 ceph-installer-1.0.12-3.el7scon.noarch ceph-ansible-1.0.5-23.el7scon.noarch How reproducible ================ I don't know. Steps to Reproduce ================== I'm not 100% sure. 1. Install RHSC 2.0 following the documentation. 2. Accept few nodes for the ceph cluster. 3. Create new ceph cluster named 'alpha'. 4. Create 2 RBDs (along with new backing pool each time) in the cluster. 5. Break something so that you will have few warning and/or critical events (this step needs to be specified better). 6. Check Clusters overview (a box titled "1 Clusters") on the Main Dashboard. note for step 5: I noticed this when I was reproducing BZ 1355723 and/or BZ 1354603 Actual results ============== I hit this issue 2 times (on 2 different clusters, using the same builds). case one -------- On the Main Dashboard, there is a box entitled "1 Clusters", reporting: "2 active alerts" next to the warning icon (pficon-warning-triangle-o), see screenshot #1. When I click on the link there (the value, number 2 itself), I get to the Clusters list page (see screenshot #2), which is filtered by: * alarmstatus: critical * alarmstatus: major In this list there is 1 cluster (in a warning state), but which reports that there are 4 alerts next to red error icon (pficon-error-circle-o). When I click on the link there (the value, number 4 itself), I get to the Events list page (see screenshot #3), which is filtered by: * cluster: alpha * severity: critical & warning * status: active In the list there, there are 4 events in total, 2 are warning, 2 are critical. This means that Main Dashboard alert status conflicts with alert status from Cluster list, and with Events list (which is filtered to show active, critical and warning events only). case two -------- The same use case, but with different discrepancies. Screenshot #4 shows Main Dashboard with 1 critical and 4 warning alters, but in the Event list (screenshot #5), I see 4 critical events. Expected results ================ Main dashboard cluster status data should be aligned with event/alert data provided elsewhere in the console, such as: * cluster item in the list of clusters * Events list page linked from cluster item (previous one)
Created attachment 1179348 [details] screenshot 1: case one - main dashboard
Created attachment 1179349 [details] screenshot 2: case one - cluster list
Created attachment 1179350 [details] screenshot 3: case one - filtered event list
Created attachment 1179351 [details] screenshot 4: case two - main dashboard
Created attachment 1179352 [details] screenshot 5: case two - filtered event list
Require clear reproducer for this bug
This behavior is as per the Design provided by: https://docs.google.com/presentation/d/1E7ZHHMYufugMjuVceluP7FCfUM9CQNsN5QWmGruRth0/edit#slide=id.gc5e5a5c3c_0_12 To elaborate the behavior is as follows: 1. In main dashboard, the number beside "X" (critical icon) indicates the number of cluster whose status is error. Please refer slide 15 & 16 2. In main dashboard, the number beside "!" (warning icon) indicates the number of major/critical (not warning and minor) alarms in all clusters. Please refer slide 15 & 16. 3. In cluster list view, the last column (alerts) shows the number of all the alerts in the cluster(here any severity other than cleared(info)). Please refer slide 21. Please provide your thoughts.
(In reply to Darshan from comment #7) > This behavior is as per the Design provided by: > https://docs.google.com/presentation/d/ > 1E7ZHHMYufugMjuVceluP7FCfUM9CQNsN5QWmGruRth0/edit#slide=id.gc5e5a5c3c_0_12 Good catch, thanks for pointing this out. I should have definitely checked this out before creating this BZ - as it turns out that the status on the dashboard is invalid anyway, but in a different way compared to my original description in the BZ (which was not based on the design document description). Which means that QE team still needs to recheck with the latest builds later anyway. See my quick reply with details inline. > To elaborate the behavior is as follows: > > 1. In main dashboard, the number beside "X" (critical icon) indicates the > number of cluster whose status is error. Please refer slide 15 & 16 Ok, so the document states that there should be: > count of clusters with Error or Warning status not just Error as you just mentioned. In the screenshot #4 (of case two), I see this icon with number 1 next to it, which is correct according to your description and the design document. So ok. But in the first case, on screenshot #1, I don't see this Cluster Status anywhere, even though that the cluster is in a warning state (as can be seen on screenshot #2). So for this reason, I would consider "Cluster Status" counter still broken. > 2. In main dashboard, the number beside "!" (warning icon) indicates the > number of major/critical (not warning and minor) alarms in all clusters. > Please refer slide 15 & 16. You are right. Even the tooltip text (visible when one hovers cursor over the alert counter) states "2 active alerts". So far, so good. But I have a question, why do I get to the list of clusters when I click on this alarm status counter? The design document which you refer states that I should get this: > Filtered event view showing major and critical alerts across all clusters. But I got list of clusters instead. Which is why I was confused about the meaning of this counter and ignored the meaning in the tooltip. So based on this, the issue here is that the link of the alarm status counter points to an incorrect page, which confuses the meaning of the counter even though the counter itself (and it's tooltip) report correct data. > 3. In cluster list view, the last column (alerts) shows the number of all > the alerts in the cluster(here any severity other than cleared(info)). > Please refer slide 21. I'm not sure I understand this slide right, as I see both type of icons in the example here: * 1st cluster reports 5 alerts next to a warning icon * 4th cluster reports 5 alerts next to a error icon * 6th cluster reports 5 alerts without using any icon at all Moreover I don't think that using an icon for both types on one page, and for particular one only on another page is a good idea. So it seems that we need to ask the desing team here to check my concerns here.
Ju and Matt, could you check my concern about alert counter from cluster list page as described in the last part of comment 8?
In review this bug, I see 2 things raised in this bug: (1) why does the drilldown from dashboard goes to the cluster list vs. an event list. The implementation is as agreed upon (i.e. going to the cluster list) -- this was a decision we made based on some bug in the past, and as we put our “user” hat on, the rationale for why we did this was user would want to see the related alerts, but then would still have to look at issues by a cluster by cluster basis. Hence, why we ended up with drilling down to the cluster list. (2) the alert indicator/count of the cluster is misleading as it aggregates all the critical and errors together, which is misleading. Part of the problem is the icon as we're overloading the icon to mean single severity level (in the Dashboard and other places), but when it's in the list view, it's showing an aggregation of multiple severity levels. To fix: we either provide a new icon to cover the roll-up or aggregation, OR limit the alert indicator to show only the most severe severity level (but then it leaves of other severe levels potentially, which is not ideal).
Regarding (2) indicated above whereby the icon accompanying the # Alerts in the Cluster List (and Host List) pages, I'd suggest removing the icon so as to reduce user confusion since # Alerts represents all uncleared alerts for a given object.
This is also likely applicable to the other list views that show # Alerts, e.g. Pools List, RBD List.
During execution of test case RHSC-265 (web/main_dashboard_page_check), I noticed problems with some counters again. Since the original description of this BZ was not written based on proper understanding of the design documents, and the comments are discussing mostly design tweaks (comment 10, comment 11, comment 12), I'm creating new BZ 1359103 for this to prevent confusion under this BZ. This way, proper triage and work management of the issue would be possible.
This product is EOL now