Description of problem ====================== When ceph cluster is created, I see lot's of "No data available" placeholders across dashboard pages (Main Dashboard, Cluster dashboard, host dashboards) and listing pages (eg. Host list), while while I can see that the data are available in graphite db. The problem is likely caused by reporting small values as no data by mistake (see details below in the this bug report). Because of this, about half of the charts on a dashboard may not be displayed for a first time user after cluster is created (so that it's immediately noticeable). Version-Release =============== On RHSC 2.0 server machine: rhscon-core-0.0.36-1.el7scon.x86_64 rhscon-ui-0.0.50-1.el7scon.noarch rhscon-ceph-0.0.36-1.el7scon.x86_64 rhscon-core-selinux-0.0.36-1.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch ceph-ansible-1.0.5-31.el7scon.noarch On Ceph storage machines: rhscon-agent-0.0.16-1.el7scon.noarch rhscon-core-selinux-0.0.36-1.el7scon.noarch ceph-base-10.2.2-27.el7cp.x86_64 ceph-selinux-10.2.2-27.el7cp.x86_64 ceph-common-10.2.2-27.el7cp.x86_64 calamari-server-1.4.7-1.el7cp.x86_64 How reproducible ================ 100 % Steps to Reproduce ================== To make the reproducer simple, we focus on cpu utilization, but it applies to any value which is low enough. So make sure you don't do anything else with the cluster, which would make the cpu utilization to go up - it helps a lot with reproducing the issue. 1. Install RHSC 2.0 following the documentation. 2. Accept few nodes for the ceph cluster. 3. Create new ceph cluster named 'alpha'. 4. Check the following pages: * main dahsboard * cluster dashboard * list of hosts * hosts dashboard (for all machines) 5. At this point, check the 1st part of *Actual results* section below. Continue only when checking all *After step 5* checks. 6. Run the following command on all ceph machines: stress --cpu ${N} --vm 1 where N is the number of cpu of particular machine. 7. Wait about 15 or 30 minutes for collecd and RHSC 2.0 to process and report new utilization state. 8. Check cpu utilization reported by RHSC 2.0 (on all dashboards as listed in step 4). Actual results ============== There are "No data available" placeholders for lot of (sometimes most) charts on any given page. See screenshot 5 for example: on the host dashboard, I see 6 no data available out of 13 charts in total (we have lost almost half of charts there ...). Or on screenshot 6 - System Performance widget of main dashboard - I see that 4 out of 7 charts reports no data. After step 5 ------------ For at first sight, it seems random, but in my case I see an interesting patten here. Let's compare cpu and memory utilization charts: * *memory utilization* values are reported properly on every chart * *cpu utilization* is not shown almost anywhere See: * screenshot 1: main dashboard (cpu charts reports no data, memory charts shown) * screenshot 2: cluster dashboard (cpu sparkline shown, but cpu donut reports no data - which both memory charts are shown properly) * screenshot 3: host list page, all cpu donut charts (one for each host) show no data, while all memory ones shown data properly * screenshot 4: host dashboard, cpu donut reports no data, while cpu sparkline shown - and again, no problem with memory charts (this is consistent across all hosts with a single exception, when the memory donut is not shown as well) But when I check the graphite web interface, I can see that the cpu utilization data are available for all machines - see screenshot 7. After step 6 ------------ When I recheck the System Performance widget on cluster dashboard, I can now finally see some data there: see screenshot 8. The same applies to other places affected by this issue - eg. list of hosts - screenshot 9. Expected results ================ There should be no *No data available* placeholders when the data are actually available in graphtite db. This is especially important for a first time use case, when significant number of values reported by console would be relatively small. Additional info =============== There are tons of errors which may be related to data fetching in various logs. I'm going to add a link to the full tarball and point out some later in the comments.
Created attachment 1184729 [details] screenshot 1
Created attachment 1184730 [details] screenshot 2
Created attachment 1184731 [details] screenshot 3
Created attachment 1184732 [details] screenshot 4
Created attachment 1184733 [details] screenshot 5
Created attachment 1184734 [details] screenshot 6
Created attachment 1184735 [details] screenshot 7
Created attachment 1184736 [details] screenshot 8
Created attachment 1184737 [details] screenshot 9
Martin, I acknowledge the absence of cpu donut graph as a valid bug given the time series data is appearing. But regarding, the n/w utilization graph being shown as no data available is valid if that particular node is a vm bcoz the command ethtool that we use to get b/w provides b/w only for physical nics and this is expected. Regarding swap utilization being shown as no data, we have observed this only when swap is not available on the machine. Please ensure to try configuring swap space and still if the graph is not available then its a bug.And storage being no data available for mon nodes is also expected bcoz for us storage contribution of a node is only from its osds. Apart from cpu, I see iops and latency not appearing in some cases which is also a bug
*** Bug 1361130 has been marked as a duplicate of this bug. ***
Created attachment 1185576 [details] screenshot 10: cpu utilization of particular host stuck in high values while no longer true I noticed another consequence of this issue: When the cpu utilization goes down back to the values which are recognized as no data - cpu charts across all dashboards are stuck reporting the high cpu usage, which is no longer true. See for example screenshot 10, the cpu donut chart shows 68% cpu utilization, while the sparkline chart below shows 0.1% and manual check (via top) reports values from 0.0 to 0.4% user cpu utilization.
Increasing severity based on comment 12 and the PM defined priorities for GA.
Created attachment 1185581 [details] screenshot 11: additional evidence for comment 11 (cpu utilization stuck in high values while actual data are too low again) Attaching screenshot of host list, note that all but one machine are affected here, which means that this will invalidate aggregated cpu charts for particular clusters and for all cluster combined.
Thank you for checking the issue here. It seems to be subtle bug with big consequences. (In reply to anmol babu from comment #10) > Martin, I acknowledge the absence of cpu donut graph as a valid bug given > the time series data is appearing. Ok. > But regarding, the n/w utilization graph > being shown as no data available is valid if that particular node is a vm > bcoz the command ethtool that we use to get b/w provides b/w only for > physical nics and this is expected. On my cluster (libvirt virtual machines), I see that there are: * on main dashboard: Throughput sparkline, but not Latency * on cluster dashboard: Throughput sparkline, but not Latency * no host dashboard: Throughput and Latency sparklines (but note that Latency is 0.2 ms there, so it's a small value) Could you confirm that this is expected? If there should be no network data for virtual machines, why do I see Throughput on every dashboard and Latency on Host dashboards only? If your hypothesis were true, I should see a consistent behaviour across all charts and dashboards, which is not the case. Given the fact that the Throughput data are not small (eg. 19616.0 KB/s) and that Latency are small (0.2 ms), I would suspect the same issue as for cpu applies here as well: since Throughput values are not small, these charts are not affected, while the Latency charts having small values are more likely to be mislabeled as no data - which matches with the observations here. > Regarding swap utilization being shown > as no data, we have observed this only when swap is not available on the > machine. Please ensure to try configuring swap space and still if the graph > is not available then its a bug. Ok, let's check this in detail. ~~~ dhcp-126-85.lab.eng.brq.redhat.com Swap: 2047 0 2047 dhcp-126-83.lab.eng.brq.redhat.com Swap: 0 0 0 dhcp-126-79.lab.eng.brq.redhat.com Swap: 2047 5 2042 dhcp-126-84.lab.eng.brq.redhat.com Swap: 2047 0 2047 dhcp-126-80.lab.eng.brq.redhat.com Swap: 2047 0 2047 dhcp-126-82.lab.eng.brq.redhat.com Swap: 0 0 0 dhcp-126-78.lab.eng.brq.redhat.com Swap: 2047 141 1906 dhcp-126-81.lab.eng.brq.redhat.com Swap: 2047 0 2047 ~~~ So for some reason, I have swap disabled on 2 machines: * dhcp-126-83.lab.eng.brq.redhat.com * dhcp-126-82.lab.eng.brq.redhat.com and dashboard for these hosts shows no data for swap (both donut and sparkline charts are replaced with "No data available" placeholder icon) - so good. But for the other machines, I see this (I list only those which a part of the cluster, excluding dhcp-126-83 which has been already discussed). * dhcp-126-79.lab.eng.brq.redhat.com - donut shows no data, sparkline shows low values (such as 0.2%) * dhcp-126-84.lab.eng.brq.redhat.com - donut shows no data, sparkline shows zeroes (0.0%) * dhcp-126-80.lab.eng.brq.redhat.com - donut shows no data, sparkline shows zeroes (0.0%) * dhcp-126-81.lab.eng.brq.redhat.com - donut shows no data, sparkline shows zeroes (0.0%) Based on this check, I would assume that the swap charts have the same problem as the cpu charts - when the values are too low, no data placeholder is shown even though there are actual data available. And in the same way as cpu charts, donut chart is more likely to be affected than the sparkline. > And storage being no data available for mon > nodes is also expected bcoz for us storage contribution of a node is only > from its osds. You are right, there are no data reported for MON machines and that's ok. While the OSD machines have proper data shown on the host dashboards. But if my hypothesis that the issue is caused by low values reported in the chart is true, these charts are not very likely to be affected as I have some data there loaded already (values about 20%). > Apart from cpu, I see iops and latency not appearing in some > cases which is also a bug I agree. But here I'm bit confused as well. Why do you talk about Latency here? I thought that that is a network chart, which we already discussed - see the beginning of this comment. Summary ======= To sum it up, while in some cases no data placeholders are correct (eg. storage on mon machines, swap on machines with swap deactivated), all the other cases of other charts are very likely affected by the same issue as the cpu charts. Do you agree with my interpretation?
Martin, the bug was due to a slight change in the format of graphite responses(appending unexpected spaces) and my fix now would handle these cases. This used to happen in certain cases when there are less number of digits in the stats. So, with my fix I expect the cpu utilization, swap and iops, ping and all stats for that matter to work given the following: 1. Network utilization on host dashboard(not n/w throughput) will be available only if physical nics are present and hence it is expected to be no data available on vms. 2. The stats on main and cluster specific dashboards are synced/calculated once every 5 mins currently. 3. Stats like memory, swap and network utilizations will be synced to console if and only if used, total and percent-used are available.
I consider issue of no Network Utilization charts for virtual machines as a separate problem (BZ 1365578) based on anmol's comment 16 - and as such I would ignore it during validation of this BZ.
This is a comment with QE verification details. Version-Release =============== On RHSC 2.0 server machine: ceph-installer-1.0.14-1.el7scon.noarch ceph-ansible-1.0.5-32.el7scon.noarch rhscon-ui-0.0.52-1.el7scon.noarch rhscon-core-selinux-0.0.41-1.el7scon.noarch rhscon-ceph-0.0.40-1.el7scon.x86_64 rhscon-core-0.0.41-1.el7scon.x86_64 On Ceph 2.0 machines: ceph-common-10.2.2-36.el7cp.x86_64 ceph-selinux-10.2.2-36.el7cp.x86_64 rhscon-core-selinux-0.0.41-1.el7scon.noarch rhscon-agent-0.0.18-1.el7scon.noarch Verification ============ When I created a cluster, I no longer see lot's of "No data available" placeholders across dashboard pages as before: * there is no such placeholder on main dashboard * there is no such placeholder on cluster dashboard * host list page shows "0.0%" when current values are too low * on host dashboards, all "No data placeholders" are valid: - "no data" reported for Storage on monitor machines - "no data" reported for Swap on OSD machines (BZ 1364167) or know issues: - "no data" reported for Network Utilization charts - see BZ 1365578 (I consider this to be a known issue as described in comment 17) Morevoer, I can clearly see data reported properly for small values, eg. on Swap donut chart and related sparkline, I see "176.0 KB of 2.0 GB". After running `stress --cpu` as described in the BZ, I see the values to go up and then go down again after some (even though it could be argued that the sheer time needed for this is too log - longer compared to the time needed to notice the initial increate - but that's not a concern of this BZ) time when the stress is no longer running. >> VERIFIED
Created attachment 1189361 [details] qe verification evidence: screenshot of host dashboard after the test
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1754