cloning to downstream requesting z-stream fix adding high severity due to internal production instance impact +++ This bug was initially created as a clone of Bug #1395608 +++ Description of problem: DHW sampling rate is currently 20 seconds and that incures a big amount of work and the postgres specifically if the history db is hosted with the engine db. This leads to overheads on the postgres prefomance which affects engine performance: - dwh takes almost 50% of the queries on the db - increased io overhead - contention on autovacuum workers - little effect on the dashboard [1] which is not a core functionality [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average of the last 15s and engine collects that every 15s, a sample of 20s by dhw will miss some engine samplings anyway. actual: sample interval of 20s expected: interval of 60s, if not higher --- Additional comment from Michal Skrivanek on 2016-11-22 11:01:44 CET --- requesting a 4.0.z fix since it is a regression in performance and basically just a revert of earlier patch to set it back to 60s --- Additional comment from Shirly Radco on 2016-11-22 12:48:59 CET --- (In reply to Roy Golan from comment #0) > Description of problem: > DHW sampling rate is currently 20 seconds and that incures a big amount of > work and the postgres specifically if the history db is hosted with the > engine db. > > This leads to overheads on the postgres prefomance which affects engine > performance: > - dwh takes almost 50% of the queries on the db We are sampling small amount of data with simple non joining selects every 20 seconds which is fine as long as it doesn't impact the engine usability > - increased io overhead iowait can be increased and yet everything is fine or it can be 0% and nothing works. > - contention on autovacuum workers Since not locking any engine table, this has 0 impact on the engine DB. > - little effect on the dashboard [1] which is not a core functionality > > > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average > of the last 15s and engine collects that every 15s, a sample of 20s by dhw > will miss some engine samplings anyway. The dwh is not meant to be in sink with the vdsm rate. We are collecting 5 samples out of the 6 VDSM reports which is quite good average and max calculations compared with the 1 out of 6 that we have from in the previous interval. > > > actual: sample interval of 20s > expected: interval of 60s, if not higher This requires further testing. I have asked to check the affect on the engine performance against the postgres db before and after the change from 20 back to 60 seconds. Also, follow test Juan suggestion to limit the size of the java heap size. --- Additional comment from Michal Skrivanek on 2016-11-25 08:40:37 CET --- Please go through the email threads, results from Roy, me, again. (In reply to Shirly Radco from comment #2) > (In reply to Roy Golan from comment #0) > > Description of problem: > > DHW sampling rate is currently 20 seconds and that incures a big amount of > > work and the postgres specifically if the history db is hosted with the > > engine db. > > > > This leads to overheads on the postgres prefomance which affects engine > > performance: > > - dwh takes almost 50% of the queries on the db > > We are sampling small amount of data with simple non joining selects every > 20 seconds which is fine as long as it doesn't impact the engine usability The previous statement clearly says does impact the engine usability > > > - increased io overhead > > iowait can be increased and yet everything is fine or it can be 0% and > nothing works. increased io overhead does impact the engine usability. It is *not* "fine". > > > - contention on autovacuum workers > > Since not locking any engine table, this has 0 impact on the engine DB. testing proves it does > > > - little effect on the dashboard [1] which is not a core functionality > > > > > > [1] The cpu dashboad is a 24 hours overview. VDSM already samples an average > > of the last 15s and engine collects that every 15s, a sample of 20s by dhw > > will miss some engine samplings anyway. > > > The dwh is not meant to be in sink with the vdsm rate. > We are collecting 5 samples out of the 6 VDSM reports which is quite good > average and max calculations compared with the 1 out of 6 that we have from > in the previous interval. how is this related to the argument above? What Roy is saying is that there is no reason to sample such often as it doesn't improve accuracy or any data, it's just increases load > > > > > > > actual: sample interval of 20s > > expected: interval of 60s, if not higher > > This requires further testing. > I have asked to check the affect on the engine performance against the > postgres db before and after the change from 20 back to 60 seconds. > > Also, follow test Juan suggestion to limit the size of the java heap size. yes, we can continue exploring that. But first move back to sane default interval please. And in comment #1 I requested to do that in 4.0.z
Deferring to 4.0.7. We have other fixes (users table needless updates) that went in, we'll revisit in 4.0.7.
Moving to 4.1. It has not been proved the sampling rate causes any real impact on the engine. If this will be proven, we can consider moving this back.