Bug 1360889 - significant number of utilization charts shown "no data available" placeholder across all dashboards
Summary: significant number of utilization charts shown "no data available" placeholde...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Storage Console
Classification: Red Hat
Component: core
Version: 2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 2
Assignee: anmol babu
QA Contact: Martin Bukatovic
URL:
Whiteboard:
: 1361130 (view as bug list)
Depends On:
Blocks: 1346302 1351592 Console-2-GA
TreeView+ depends on / blocked
 
Reported: 2016-07-27 17:41 UTC by Martin Bukatovic
Modified: 2016-08-23 19:58 UTC (History)
5 users (show)

Fixed In Version: rhscon-core-0.0.39-1.el7scon.x86_64.rpm
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 19:58:15 UTC
Target Upstream Version:


Attachments (Terms of Use)
screenshot 1 (16.58 KB, image/png)
2016-07-27 17:42 UTC, Martin Bukatovic
no flags Details
screenshot 2 (14.84 KB, image/png)
2016-07-27 17:42 UTC, Martin Bukatovic
no flags Details
screenshot 3 (86.43 KB, image/png)
2016-07-27 17:42 UTC, Martin Bukatovic
no flags Details
screenshot 4 (32.60 KB, image/png)
2016-07-27 17:43 UTC, Martin Bukatovic
no flags Details
screenshot 5 (80.13 KB, image/png)
2016-07-27 17:43 UTC, Martin Bukatovic
no flags Details
screenshot 6 (26.63 KB, image/png)
2016-07-27 17:43 UTC, Martin Bukatovic
no flags Details
screenshot 7 (232.72 KB, image/png)
2016-07-27 17:44 UTC, Martin Bukatovic
no flags Details
screenshot 8 (18.40 KB, image/png)
2016-07-27 17:44 UTC, Martin Bukatovic
no flags Details
screenshot 9 (92.79 KB, image/png)
2016-07-27 17:44 UTC, Martin Bukatovic
no flags Details
screenshot 10: cpu utilization of particular host stuck in high values while no longer true (39.25 KB, image/png)
2016-07-29 14:36 UTC, Martin Bukatovic
no flags Details
screenshot 11: additional evidence for comment 11 (cpu utilization stuck in high values while actual data are too low again) (84.85 KB, image/png)
2016-07-29 14:42 UTC, Martin Bukatovic
no flags Details
qe verification evidence: screenshot of host dashboard after the test (163.01 KB, image/png)
2016-08-09 17:12 UTC, Martin Bukatovic
no flags Details


Links
System ID Priority Status Summary Last Updated
Gerrithub.io 285704 None None None 2016-08-04 05:11:24 UTC
Red Hat Bugzilla 1357768 None None None Never
Red Hat Bugzilla 1365578 None None None Never
Red Hat Product Errata RHEA-2016:1754 normal SHIPPED_LIVE New packages: Red Hat Storage Console 2.0 2017-04-18 19:09:06 UTC

Internal Links: 1357768 1365578

Description Martin Bukatovic 2016-07-27 17:41:28 UTC
Description of problem
======================

When ceph cluster is created, I see lot's of "No data available" placeholders
across dashboard pages (Main Dashboard, Cluster dashboard, host dashboards) and
listing pages (eg. Host list), while while I can see that the data are
available in graphite db.

The problem is likely caused by reporting small values as no data by mistake
(see details below in the this bug report).

Because of this, about half of the charts on a dashboard may not be displayed
for a first time user after cluster is created (so that it's immediately
noticeable).

Version-Release
===============

On RHSC 2.0 server machine:

rhscon-core-0.0.36-1.el7scon.x86_64
rhscon-ui-0.0.50-1.el7scon.noarch
rhscon-ceph-0.0.36-1.el7scon.x86_64
rhscon-core-selinux-0.0.36-1.el7scon.noarch
ceph-installer-1.0.14-1.el7scon.noarch
ceph-ansible-1.0.5-31.el7scon.noarch

On Ceph storage machines:

rhscon-agent-0.0.16-1.el7scon.noarch
rhscon-core-selinux-0.0.36-1.el7scon.noarch
ceph-base-10.2.2-27.el7cp.x86_64
ceph-selinux-10.2.2-27.el7cp.x86_64
ceph-common-10.2.2-27.el7cp.x86_64
calamari-server-1.4.7-1.el7cp.x86_64

How reproducible
================

100 %

Steps to Reproduce
==================

To make the reproducer simple, we focus on cpu utilization, but it applies
to any value which is low enough.

So make sure you don't do anything else with the cluster, which would make the
cpu utilization to go up - it helps a lot with reproducing the issue.

1. Install RHSC 2.0 following the documentation.
2. Accept few nodes for the ceph cluster.
3. Create new ceph cluster named 'alpha'.
4. Check the following pages:

 * main dahsboard
 * cluster dashboard
 * list of hosts
 * hosts dashboard (for all machines)

5. At this point, check the 1st part of *Actual results* section below.
   Continue only when checking all *After step 5* checks.

6. Run the following command on all ceph machines:

   stress --cpu ${N} --vm 1

   where N is the number of cpu of particular machine.
7. Wait about 15 or 30 minutes for collecd and RHSC 2.0 to process and report
   new utilization state.
8. Check cpu utilization reported by RHSC 2.0 (on all dashboards as listed
   in step 4).

Actual results
==============

There are "No data available" placeholders for lot of (sometimes most) charts
on any given page.

See screenshot 5 for example: on the host dashboard, I see 6 no data available
out of 13 charts in total (we have lost almost half of charts there ...).

Or on screenshot 6 - System Performance widget of main dashboard - I see that
4 out of 7 charts reports no data.

After step 5
------------

For at first sight, it seems random, but in my case I see an interesting patten
here. Let's compare cpu and memory utilization charts:

 * *memory utilization* values are reported properly on every chart
 * *cpu utilization* is not shown almost anywhere

See:

 * screenshot 1: main dashboard (cpu charts reports no data, memory charts
   shown) 
 * screenshot 2: cluster dashboard (cpu sparkline shown, but cpu donut reports
   no data - which both memory charts are shown properly)
 * screenshot 3: host list page, all cpu donut charts (one for each host) show
   no data, while all memory ones shown data properly
 * screenshot 4: host dashboard, cpu donut reports no data, while cpu sparkline
   shown - and again, no problem with memory charts
   (this is consistent across all hosts with a single exception, when the
   memory donut is not shown as well)

But when I check the graphite web interface, I can see that the cpu utilization
data are available for all machines - see screenshot 7.

After step 6
------------

When I recheck the System Performance widget on cluster dashboard, I can now
finally see some data there: see screenshot 8. The same applies to other places
affected by this issue - eg. list of hosts - screenshot 9.

Expected results
================

There should be no *No data available* placeholders when the data are actually
available in graphtite db.

This is especially important for a first time use case, when significant number
of values reported by console would be relatively small.

Additional info
===============

There are tons of errors which may be related to data fetching in various logs.
I'm going to add a link to the full tarball and point out some later in the
comments.

Comment 1 Martin Bukatovic 2016-07-27 17:42:10 UTC
Created attachment 1184729 [details]
screenshot 1

Comment 2 Martin Bukatovic 2016-07-27 17:42:28 UTC
Created attachment 1184730 [details]
screenshot 2

Comment 3 Martin Bukatovic 2016-07-27 17:42:48 UTC
Created attachment 1184731 [details]
screenshot 3

Comment 4 Martin Bukatovic 2016-07-27 17:43:17 UTC
Created attachment 1184732 [details]
screenshot 4

Comment 5 Martin Bukatovic 2016-07-27 17:43:39 UTC
Created attachment 1184733 [details]
screenshot 5

Comment 6 Martin Bukatovic 2016-07-27 17:43:58 UTC
Created attachment 1184734 [details]
screenshot 6

Comment 7 Martin Bukatovic 2016-07-27 17:44:21 UTC
Created attachment 1184735 [details]
screenshot 7

Comment 8 Martin Bukatovic 2016-07-27 17:44:41 UTC
Created attachment 1184736 [details]
screenshot 8

Comment 9 Martin Bukatovic 2016-07-27 17:44:59 UTC
Created attachment 1184737 [details]
screenshot 9

Comment 10 anmol babu 2016-07-28 05:22:31 UTC
Martin, I acknowledge the absence of cpu donut graph as a valid bug given the time series data is appearing. But regarding, the n/w utilization graph being shown as no data available is valid if that particular node is a vm bcoz the command ethtool that we use to get b/w provides b/w only for physical nics and this is expected. Regarding swap utilization being shown as no data, we have observed this only when swap is not available on the machine. Please ensure to try configuring swap space and still if the graph is not available then its a bug.And storage being no data available for mon nodes is also expected bcoz for us storage contribution of a node is only from its osds. Apart from cpu, I see iops and latency not appearing in some cases which is also a bug

Comment 11 Nishanth Thomas 2016-07-28 12:16:47 UTC
*** Bug 1361130 has been marked as a duplicate of this bug. ***

Comment 12 Martin Bukatovic 2016-07-29 14:36:21 UTC
Created attachment 1185576 [details]
screenshot 10: cpu utilization of particular host stuck in high values while no longer true

I noticed another consequence of this issue:

When the cpu utilization goes down back to the values which are recognized as
no data - cpu charts across all dashboards are stuck reporting the high cpu
usage, which is no longer true.

See for example screenshot 10, the cpu donut chart shows 68% cpu utilization,
while the sparkline chart below shows 0.1% and manual check (via top) reports
values from 0.0 to 0.4% user cpu utilization.

Comment 13 Martin Bukatovic 2016-07-29 14:37:42 UTC
Increasing severity based on comment 12 and the PM defined priorities for GA.

Comment 14 Martin Bukatovic 2016-07-29 14:42:52 UTC
Created attachment 1185581 [details]
screenshot 11: additional evidence for comment 11 (cpu utilization stuck in high values while actual data are too low again)

Attaching screenshot of host list, note that all but one machine are affected
here, which means that this will invalidate aggregated cpu charts for particular
clusters and for all cluster combined.

Comment 15 Martin Bukatovic 2016-07-29 16:25:33 UTC
Thank you for checking the issue here. It seems to be subtle bug with big
consequences.

(In reply to anmol babu from comment #10)
> Martin, I acknowledge the absence of cpu donut graph as a valid bug given
> the time series data is appearing.

Ok.

> But regarding, the n/w utilization graph
> being shown as no data available is valid if that particular node is a vm
> bcoz the command ethtool that we use to get b/w provides b/w only for
> physical nics and this is expected.

On my cluster (libvirt virtual machines), I see that there are:

* on main dashboard: Throughput sparkline, but not Latency
* on cluster dashboard: Throughput sparkline, but not Latency
* no host dashboard: Throughput and Latency sparklines
  (but note that Latency is 0.2 ms there, so it's a small value)

Could you confirm that this is expected? If there should be no network data
for virtual machines, why do I see Throughput on every dashboard and
Latency on Host dashboards only?

If your hypothesis were true, I should see a consistent behaviour across all
charts and dashboards, which is not the case.

Given the fact that the Throughput data are not small (eg. 19616.0 KB/s)
and that  Latency are small (0.2 ms), I would suspect the same issue as for
cpu applies here as well: since Throughput values are not small, these charts
are not affected, while the Latency charts having small values are more likely
to be mislabeled as no data - which matches with the observations here.

> Regarding swap utilization being shown
> as no data, we have observed this only when swap is not available on the
> machine. Please ensure to try configuring swap space and still if the graph
> is not available then its a bug.

Ok, let's check this in detail.

~~~
dhcp-126-85.lab.eng.brq.redhat.com
Swap:          2047           0        2047

dhcp-126-83.lab.eng.brq.redhat.com
Swap:             0           0           0

dhcp-126-79.lab.eng.brq.redhat.com
Swap:          2047           5        2042

dhcp-126-84.lab.eng.brq.redhat.com
Swap:          2047           0        2047

dhcp-126-80.lab.eng.brq.redhat.com
Swap:          2047           0        2047

dhcp-126-82.lab.eng.brq.redhat.com
Swap:             0           0           0

dhcp-126-78.lab.eng.brq.redhat.com
Swap:          2047         141        1906

dhcp-126-81.lab.eng.brq.redhat.com
Swap:          2047           0        2047
~~~

So for some reason, I have swap disabled on 2 machines:

* dhcp-126-83.lab.eng.brq.redhat.com
* dhcp-126-82.lab.eng.brq.redhat.com

and dashboard for these hosts shows no data for swap (both donut and sparkline
charts are replaced with "No data available" placeholder icon) - so good.

But for the other machines, I see this (I list only those which a part of the
cluster, excluding dhcp-126-83 which has been already discussed).

* dhcp-126-79.lab.eng.brq.redhat.com - donut shows no data, sparkline shows low
  values (such as 0.2%)
* dhcp-126-84.lab.eng.brq.redhat.com - donut shows no data, sparkline shows
  zeroes (0.0%)
* dhcp-126-80.lab.eng.brq.redhat.com - donut shows no data, sparkline shows
  zeroes (0.0%)
* dhcp-126-81.lab.eng.brq.redhat.com - donut shows no data, sparkline shows
  zeroes (0.0%)

Based on this check, I would assume that the swap charts have the same problem
as the cpu charts - when the values are too low, no data placeholder is shown
even though there are actual data available. And in the same way as cpu charts,
donut chart is more likely to be affected than the sparkline.

> And storage being no data available for mon
> nodes is also expected bcoz for us storage contribution of a node is only
> from its osds.

You are right, there are no data reported for MON machines and that's ok.
While the OSD machines have proper data shown on the host dashboards.

But if my hypothesis that the issue is caused by low values reported in the
chart is true, these charts are not very likely to be affected as I have some
data there loaded already (values about 20%).

> Apart from cpu, I see iops and latency not appearing in some
> cases which is also a bug

I agree.

But here I'm bit confused as well. Why do you talk about Latency here? I
thought that that is a network chart, which we already discussed - see the
beginning of this comment.

Summary
=======

To sum it up, while in some cases no data placeholders are correct (eg. storage
on mon machines, swap on machines with swap deactivated), all the other cases
of other charts are very likely affected by the same issue as the cpu charts.

Do you agree with my interpretation?

Comment 16 anmol babu 2016-07-30 16:04:35 UTC
Martin, the bug was due to a slight change in the format of graphite responses(appending unexpected spaces) and my fix now would handle these cases.
This used to happen in certain cases when there are less number of digits in the stats. So, with my fix I expect the cpu utilization, swap and iops, ping and all stats for that matter to work given the following:

1. Network utilization on host dashboard(not n/w throughput) will be available 
   only if physical nics are present and hence it is expected to be no data 
   available on vms.
2. The stats on main and cluster specific dashboards are synced/calculated once  
   every 5 mins currently.
3. Stats like memory, swap and network utilizations will be synced to console if 
and only if used, total 
   and percent-used are available.

Comment 17 Martin Bukatovic 2016-08-09 15:13:43 UTC
I consider issue of no Network Utilization charts for virtual machines as a
separate problem (BZ 1365578) based on anmol's comment 16 - and as such I would
ignore it during validation of this BZ.

Comment 18 Martin Bukatovic 2016-08-09 17:09:39 UTC
This is a comment with QE verification details.

Version-Release
===============

On RHSC 2.0 server machine:

ceph-installer-1.0.14-1.el7scon.noarch
ceph-ansible-1.0.5-32.el7scon.noarch
rhscon-ui-0.0.52-1.el7scon.noarch
rhscon-core-selinux-0.0.41-1.el7scon.noarch
rhscon-ceph-0.0.40-1.el7scon.x86_64
rhscon-core-0.0.41-1.el7scon.x86_64

On Ceph 2.0 machines:

ceph-common-10.2.2-36.el7cp.x86_64
ceph-selinux-10.2.2-36.el7cp.x86_64
rhscon-core-selinux-0.0.41-1.el7scon.noarch
rhscon-agent-0.0.18-1.el7scon.noarch

Verification
============

When I created a cluster, I no longer see lot's of "No data available"
placeholders across dashboard pages as before:

* there is no such placeholder on main dashboard
* there is no such placeholder on cluster dashboard
* host list page shows "0.0%" when current values are too low
* on host dashboards, all "No data placeholders" are valid:
   - "no data" reported for Storage on monitor machines
   - "no data" reported for Swap on OSD machines (BZ 1364167)
  or know issues:
   - "no data" reported for Network Utilization charts - see BZ 1365578
      (I consider this to be a known issue as described in comment 17)

Morevoer, I can clearly see data reported properly for small values, eg. on
Swap donut chart and related sparkline, I see "176.0 KB of 2.0 GB".

After running `stress --cpu` as described in the BZ, I see the values to go
up and then go down again after some (even though it could be argued that the
sheer time needed for this is too log - longer compared to the time needed to
notice the initial increate - but that's not a concern of this BZ) time when
the stress is no longer running.

>> VERIFIED

Comment 19 Martin Bukatovic 2016-08-09 17:12:53 UTC
Created attachment 1189361 [details]
qe verification evidence: screenshot of host dashboard after the test

Comment 21 errata-xmlrpc 2016-08-23 19:58:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1754


Note You need to log in before you can comment on or make changes to this bug.