Bug 1398553

Summary: DWH sampling is too high
Product: Red Hat Enterprise Virtualization Manager Reporter: Michal Skrivanek <michal.skrivanek>
Component: ovirt-engine-dwhAssignee: Shirly Radco <sradco>
Status: CLOSED UPSTREAM QA Contact: Lukas Svaty <lsvaty>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.0.3CC: bugs, lsurette, michal.skrivanek, pstehlik, rbalakri, rgolan, Rhev-m-bugs, sradco, srevivo, ykaul, ylavi
Target Milestone: ovirt-4.1.0-betaKeywords: Performance
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1395608 Environment:
Last Closed: 2016-12-15 10:30:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Metrics RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1395608, 1478859    
Bug Blocks:    

Description Michal Skrivanek 2016-11-25 08:52:52 UTC
cloning to downstream
requesting z-stream fix
adding high severity due to internal production instance impact

+++ This bug was initially created as a clone of Bug #1395608 +++

Description of problem:
DHW sampling rate is currently 20 seconds and that incures a big amount of work and the postgres specifically if the history db is hosted with the engine db.

This leads to overheads on the postgres prefomance which affects engine performance:
- dwh takes almost 50% of the queries on the db
- increased io overhead
- contention on autovacuum workers
- little effect on the dashboard [1] which is not a core functionality 


[1] The cpu dashboad is a 24 hours overview. VDSM already samples an average of the last 15s and engine collects that every 15s, a sample of 20s by dhw will miss some engine samplings anyway.


actual: sample interval of 20s
expected: interval of 60s, if not higher

--- Additional comment from Michal Skrivanek on 2016-11-22 11:01:44 CET ---

requesting a 4.0.z fix since it is a regression in performance and basically just a revert of earlier patch to set it back to 60s

--- Additional comment from Shirly Radco on 2016-11-22 12:48:59 CET ---

(In reply to Roy Golan from comment #0)
> Description of problem:
> DHW sampling rate is currently 20 seconds and that incures a big amount of
> work and the postgres specifically if the history db is hosted with the
> engine db.
> 
> This leads to overheads on the postgres prefomance which affects engine
> performance:
> - dwh takes almost 50% of the queries on the db

We are sampling small amount of data with simple non joining selects every 20 seconds which is fine as long as it doesn't impact the engine usability

> - increased io overhead

iowait can be increased and yet everything is fine or it can be 0% and nothing works.

> - contention on autovacuum workers

Since not locking any engine table, this has 0 impact on the engine DB.  

> - little effect on the dashboard [1] which is not a core functionality 
> 
> 
> [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average
> of the last 15s and engine collects that every 15s, a sample of 20s by dhw
> will miss some engine samplings anyway.


The dwh is not meant to be in sink with the vdsm rate. 
We are collecting 5 samples out of the 6 VDSM reports which is quite good average and max calculations compared with the 1 out of 6 that we have from in the previous interval.

> 
> 
> actual: sample interval of 20s
> expected: interval of 60s, if not higher

This requires further testing.
I have asked to check the affect on the engine performance against the postgres db before and after the change from 20 back to 60 seconds.

Also, follow test Juan suggestion to limit the size of the java heap size.

--- Additional comment from Michal Skrivanek on 2016-11-25 08:40:37 CET ---

Please go through the email threads, results from Roy, me, again.

(In reply to Shirly Radco from comment #2)
> (In reply to Roy Golan from comment #0)
> > Description of problem:
> > DHW sampling rate is currently 20 seconds and that incures a big amount of
> > work and the postgres specifically if the history db is hosted with the
> > engine db.
> > 
> > This leads to overheads on the postgres prefomance which affects engine
> > performance:
> > - dwh takes almost 50% of the queries on the db
> 
> We are sampling small amount of data with simple non joining selects every
> 20 seconds which is fine as long as it doesn't impact the engine usability

The previous statement clearly says does impact the engine usability

> 
> > - increased io overhead
> 
> iowait can be increased and yet everything is fine or it can be 0% and
> nothing works.

increased io overhead does impact the engine usability. It is *not* "fine".

> 
> > - contention on autovacuum workers
> 
> Since not locking any engine table, this has 0 impact on the engine DB.

testing proves it does
 
> 
> > - little effect on the dashboard [1] which is not a core functionality 
> > 
> > 
> > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average
> > of the last 15s and engine collects that every 15s, a sample of 20s by dhw
> > will miss some engine samplings anyway.
> 
> 
> The dwh is not meant to be in sink with the vdsm rate. 
> We are collecting 5 samples out of the 6 VDSM reports which is quite good
> average and max calculations compared with the 1 out of 6 that we have from
> in the previous interval.

how is this related to the argument above? What Roy is saying is that there is no reason to sample such often as it doesn't improve accuracy or any data, it's just increases load

> 
> > 
> > 
> > actual: sample interval of 20s
> > expected: interval of 60s, if not higher
> 
> This requires further testing.
> I have asked to check the affect on the engine performance against the
> postgres db before and after the change from 20 back to 60 seconds.
> 
> Also, follow test Juan suggestion to limit the size of the java heap size.

yes, we can continue exploring that. But first move back to sane default interval please.
And in comment #1 I requested to do that in 4.0.z

Comment 1 Yaniv Kaul 2016-12-01 13:18:36 UTC
Deferring to 4.0.7. We have other fixes (users table needless updates) that went in, we'll revisit in 4.0.7.

Comment 2 Yaniv Lavi 2016-12-15 10:20:15 UTC
Moving to 4.1. It has not been proved the sampling rate causes any real impact on the engine. If this will be proven, we can consider moving this back.