Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1363759

Summary:	Dashboard - storage values are not refreshed correctly and shows zeros
Product:	[oVirt] ovirt-engine-dashboard	Reporter:	Lukas Svaty <lsvaty>
Component:	Core	Assignee:	Alexander Wels <awels>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Pavel Novotny <pnovotny>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	unspecified	CC:	awels, bugs, lsvaty, mgoldboi, oourfali, sradco
Target Milestone:	ovirt-4.0.6	Flags:	rule-engine: ovirt-4.0.z+ mgoldboi: planning_ack+ oourfali: devel_ack+ lsvaty: testing_ack+
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-15 11:41:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	UX	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lukas Svaty 2016-08-03 13:39:12 UTC

Version-Release number of selected component (if applicable):
rhevm-4.0.2.3-0.1.el7ev.noarch
ovirt-engine-dashboard-1.0.1-0.el7ev.x86_64

How reproducible:
100%

The negative value for available space is no longer reported.
But now there is still a time window, 
where after removing or just detaching a 2nd storage domain,
the available space is incorrectly reported as 0 TiB. 
However, the zero value is corrected after few minutes.

Environment:
2 NFS storages, each one has: Size 767 GB, Available: 224 GB, Used: 543 GB

Take #1:
* only 1 storage is present: 
  Dashboard @ 4:53:13 PM:  0.2 Available of 0.7 TiB / 0.5 TiB Used
* 2nd storage added: 
  Dashboard @ 4:57:58 PM: 0.4 Available of 1.5 TiB / 1.1 TiB Used
* 2nd storage detached (but not removed):
  Dashboard @ 5:03:05 PM: 0 Available of 0.7 TiB / 0.7 TiB Used  <<-- WRONG
  Dashboard @ 5:27:03 PM: 0.2 Available of 0.7 TiB / 0.5 TiB Used  <<-- CORRECT

Take #2:
* 2nd storage added: 
  5:38:53 PM: 0.4 Available of 1.5 TiB / 1.1 TiB Used
* 2nd storage detached and removed:
  5:43:59 PM - 0 Available of 0.7 TiB / 0.7 TiB Used  <<-- WRONG
  5:50:46 PM - 0.2 Available of 0.7 TiB / 0.5 TiB Used  <<-- CORRECT

Take #3:
* 2nd storage added: 
  5:57:38 PM: 0.4 Available of 1.5 TiB / 1.1 TiB Used
* 2nd storage detached (but not removed):
  6:01:57 PM - 0 Available of 0.7 TiB / 0.7 TiB Used  <<--WRONG
  6:14:10 PM - 0.2 Available of 0.7 TiB / 0.5 TiB Used  <<-- CORRECT

Actual results:
Zeros are shown till the next load from dwh.

Expected results:
Old values should stay till the next load from dwh.

Comment 1 Oved Ourfali 2016-08-04 10:45:57 UTC

As mentioned in the original bug, I don't see how we can fix that without polling DWH very frequently, which will have an impact on performance.
I don't think this should block the dashboard RFE.

Currently targeting it to 4.1 for consideration there.

Comment 2 Lukas Svaty 2016-08-04 11:18:25 UTC

Why are the zeros displayed and not the last status gathered from dwh?

Comment 3 Oved Ourfali 2016-08-04 11:43:19 UTC

(In reply to Lukas Svaty from comment #2)
> Why are the zeros displayed and not the last status gathered from dwh?

Alexander?

Comment 4 Alexander Wels 2016-08-04 12:11:01 UTC

Because the last status from the DWH actually returns a negative value. Basically the following happens:

1. DWH reads data from engine database, and the dashboard gets the values and shows them.
2. User detaches storage domain from engine.
3. The storage domain is flagged as unavailable in one of the tables in the DWH, but the used data is not changed.
4. The query looks up the total available storage (which is now down one storage domain), but the used data has not been updated yet, so that is still the value from before.
5. Available = total - used. But total is lower now while used is not. This can lead to negative values. We have code in place that basically sets a value to 0 if it is negative.
6. After a while the used value is updated as well, and everything is fine.

IMO the problem is the different intervals at which some of the data is updated in the DWH.

Comment 5 Oved Ourfali 2016-08-04 12:17:39 UTC

(In reply to Alexander Wels from comment #4)
> Because the last status from the DWH actually returns a negative value.
> Basically the following happens:
> 
> 1. DWH reads data from engine database, and the dashboard gets the values
> and shows them.
> 2. User detaches storage domain from engine.
> 3. The storage domain is flagged as unavailable in one of the tables in the
> DWH, but the used data is not changed.
> 4. The query looks up the total available storage (which is now down one
> storage domain), but the used data has not been updated yet, so that is
> still the value from before.

Do both the total and used come from DWH?

> 5. Available = total - used. But total is lower now while used is not. This
> can lead to negative values. We have code in place that basically sets a
> value to 0 if it is negative.
> 6. After a while the used value is updated as well, and everything is fine.
> 
> IMO the problem is the different intervals at which some of the data is
> updated in the DWH.

Comment 6 Alexander Wels 2016-08-04 12:20:46 UTC

Yes everything in the dashboard comes from the DWH, with the exception of the inventory cards at the top. So total/used comes from DWH database.

Comment 7 Oved Ourfali 2016-08-04 12:43:47 UTC

Shirly - what do you suggest to do?
We take the last 5 minutes data, so the total in the latest sample doesn't contain the last domain, however the previous samples do contain it.
How can we overcome it?

Comment 8 Alexander Wels 2016-08-04 13:19:13 UTC

Actually, took another look at the queries and this is the problem:

The total query basically looks at the last sample to determine if it should include a SD in the calculation or not. So when you detach the SD the next sample will immediately exclude the detach SD.

The used query takes the average used over the last 5 minutes (this is what is shown in the center donut, and this is also the reason we have 2 queries, one for total and one for used). Now if you detach the SD, the last sample will not be included in the average, but the previous 4 will be.

We have several options to fix this:
1. Make the total an average over the last 5 minutes like the used.
2. Make the used not an average, but simply look at the last sample.
3. Modify the query to exclude all samples from the average if the last sample says the SD is not active.

@Moran,
Which option would you like?

Comment 9 Moran Goldboim 2016-08-17 15:46:13 UTC

(In reply to Alexander Wels from comment #8)
> Actually, took another look at the queries and this is the problem:
> 
> The total query basically looks at the last sample to determine if it should
> include a SD in the calculation or not. So when you detach the SD the next
> sample will immediately exclude the detach SD.
> 
> The used query takes the average used over the last 5 minutes (this is what
> is shown in the center donut, and this is also the reason we have 2 queries,
> one for total and one for used). Now if you detach the SD, the last sample
> will not be included in the average, but the previous 4 will be.
> 
> We have several options to fix this:
> 1. Make the total an average over the last 5 minutes like the used.
> 2. Make the used not an average, but simply look at the last sample.
> 3. Modify the query to exclude all samples from the average if the last
> sample says the SD is not active.
> 
> @Moran,
> Which option would you like?

the most appealing to me would be option 2, since i think it gives the current status and nature of storage statistics is different from CPU and MEM which are very dynamic and needs to be normalized , what do you think would be the "downsides" of going with this option.

Comment 10 Alexander Wels 2016-08-17 16:09:19 UTC

I personally don't see any downsides to any of the options, it will in certain circumstances give slightly different data. All of which are valid IMO. If you want to go with option #2, I will implement that.

Comment 11 Moran Goldboim 2016-08-17 19:48:28 UTC

(In reply to Alexander Wels from comment #10)
> I personally don't see any downsides to any of the options, it will in
> certain circumstances give slightly different data. All of which are valid
> IMO. If you want to go with option #2, I will implement that.

let's just make sure that if we do this change, that we do it in a consistent manner across the dashboard

Comment 12 Shirly Radco 2016-09-04 08:58:15 UTC

(In reply to Moran Goldboim from comment #11)
> (In reply to Alexander Wels from comment #10)
> > I personally don't see any downsides to any of the options, it will in
> > certain circumstances give slightly different data. All of which are valid
> > IMO. If you want to go with option #2, I will implement that.
> 
> let's just make sure that if we do this change, that we do it in a
> consistent manner across the dashboard

Can you please attach a screenshot of the issue?

Comment 13 Lukas Svaty 2016-11-15 11:41:35 UTC

I was not able to reproduce this issue, it was fixed in previous release.
Thus closing.