Bug 1398553 - DWH sampling is too high
Summary: DWH sampling is too high
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine-dwh
Version: 4.0.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.1.0-beta
: ---
Assignee: Shirly Radco
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On: 1395608 1478859
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-25 08:52 UTC by Michal Skrivanek
Modified: 2019-04-28 13:25 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1395608
Environment:
Last Closed: 2016-12-15 10:30:39 UTC
oVirt Team: Metrics
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Michal Skrivanek 2016-11-25 08:52:52 UTC
cloning to downstream
requesting z-stream fix
adding high severity due to internal production instance impact

+++ This bug was initially created as a clone of Bug #1395608 +++

Description of problem:
DHW sampling rate is currently 20 seconds and that incures a big amount of work and the postgres specifically if the history db is hosted with the engine db.

This leads to overheads on the postgres prefomance which affects engine performance:
- dwh takes almost 50% of the queries on the db
- increased io overhead
- contention on autovacuum workers
- little effect on the dashboard [1] which is not a core functionality 


[1] The cpu dashboad is a 24 hours overview. VDSM already samples an average of the last 15s and engine collects that every 15s, a sample of 20s by dhw will miss some engine samplings anyway.


actual: sample interval of 20s
expected: interval of 60s, if not higher

--- Additional comment from Michal Skrivanek on 2016-11-22 11:01:44 CET ---

requesting a 4.0.z fix since it is a regression in performance and basically just a revert of earlier patch to set it back to 60s

--- Additional comment from Shirly Radco on 2016-11-22 12:48:59 CET ---

(In reply to Roy Golan from comment #0)
> Description of problem:
> DHW sampling rate is currently 20 seconds and that incures a big amount of
> work and the postgres specifically if the history db is hosted with the
> engine db.
> 
> This leads to overheads on the postgres prefomance which affects engine
> performance:
> - dwh takes almost 50% of the queries on the db

We are sampling small amount of data with simple non joining selects every 20 seconds which is fine as long as it doesn't impact the engine usability

> - increased io overhead

iowait can be increased and yet everything is fine or it can be 0% and nothing works.

> - contention on autovacuum workers

Since not locking any engine table, this has 0 impact on the engine DB.  

> - little effect on the dashboard [1] which is not a core functionality 
> 
> 
> [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average
> of the last 15s and engine collects that every 15s, a sample of 20s by dhw
> will miss some engine samplings anyway.


The dwh is not meant to be in sink with the vdsm rate. 
We are collecting 5 samples out of the 6 VDSM reports which is quite good average and max calculations compared with the 1 out of 6 that we have from in the previous interval.

> 
> 
> actual: sample interval of 20s
> expected: interval of 60s, if not higher

This requires further testing.
I have asked to check the affect on the engine performance against the postgres db before and after the change from 20 back to 60 seconds.

Also, follow test Juan suggestion to limit the size of the java heap size.

--- Additional comment from Michal Skrivanek on 2016-11-25 08:40:37 CET ---

Please go through the email threads, results from Roy, me, again.

(In reply to Shirly Radco from comment #2)
> (In reply to Roy Golan from comment #0)
> > Description of problem:
> > DHW sampling rate is currently 20 seconds and that incures a big amount of
> > work and the postgres specifically if the history db is hosted with the
> > engine db.
> > 
> > This leads to overheads on the postgres prefomance which affects engine
> > performance:
> > - dwh takes almost 50% of the queries on the db
> 
> We are sampling small amount of data with simple non joining selects every
> 20 seconds which is fine as long as it doesn't impact the engine usability

The previous statement clearly says does impact the engine usability

> 
> > - increased io overhead
> 
> iowait can be increased and yet everything is fine or it can be 0% and
> nothing works.

increased io overhead does impact the engine usability. It is *not* "fine".

> 
> > - contention on autovacuum workers
> 
> Since not locking any engine table, this has 0 impact on the engine DB.

testing proves it does
 
> 
> > - little effect on the dashboard [1] which is not a core functionality 
> > 
> > 
> > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average
> > of the last 15s and engine collects that every 15s, a sample of 20s by dhw
> > will miss some engine samplings anyway.
> 
> 
> The dwh is not meant to be in sink with the vdsm rate. 
> We are collecting 5 samples out of the 6 VDSM reports which is quite good
> average and max calculations compared with the 1 out of 6 that we have from
> in the previous interval.

how is this related to the argument above? What Roy is saying is that there is no reason to sample such often as it doesn't improve accuracy or any data, it's just increases load

> 
> > 
> > 
> > actual: sample interval of 20s
> > expected: interval of 60s, if not higher
> 
> This requires further testing.
> I have asked to check the affect on the engine performance against the
> postgres db before and after the change from 20 back to 60 seconds.
> 
> Also, follow test Juan suggestion to limit the size of the java heap size.

yes, we can continue exploring that. But first move back to sane default interval please.
And in comment #1 I requested to do that in 4.0.z

Comment 1 Yaniv Kaul 2016-12-01 13:18:36 UTC
Deferring to 4.0.7. We have other fixes (users table needless updates) that went in, we'll revisit in 4.0.7.

Comment 2 Yaniv Lavi 2016-12-15 10:20:15 UTC
Moving to 4.1. It has not been proved the sampling rate causes any real impact on the engine. If this will be proven, we can consider moving this back.


Note You need to log in before you can comment on or make changes to this bug.